btrfs RAID 10 truncates files over 2G to 4096 bytes.

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* btrfs RAID 10 truncates files over 2G to 4096 bytes.
@ 2016-07-02 23:36 Tomasz Kusmierz
  2016-07-04 21:13 ` Henk Slager
  0 siblings, 1 reply; 16+ messages in thread
From: Tomasz Kusmierz @ 2016-07-02 23:36 UTC (permalink / raw)
  To: linux-btrfs

Hi,

My setup is that I use one file system for / and /home (on SSD) and a
larger raid 10 for /mnt/share (6 x 2TB).

Today I've discovered that 14 of files that are supposed to be over
2GB are in fact just 4096 bytes. I've checked the content of those 4KB
and it seems that it does contain information that were at the
beginnings of the files.

I've experienced this problem in the past (3 - 4 years ago ?) but
attributed it to different problem that I've spoke with you guys here
about (corruption due to non ECC ram). At that time I did deleted
files affected (56) and similar problem was discovered a year but not
more than 2 years ago and I believe I've deleted the files.

I periodically (once a month) run a scrub on my system to eliminate
any errors sneaking in. I believe I did a balance a half a year ago ?
to reclaim space after I deleted a large database.

root@noname_server:/mnt/share# btrfs fi show
Label: none  uuid: 060c2345-5d2f-4965-b0a2-47ed2d1a5ba2
    Total devices 1 FS bytes used 177.19GiB
    devid    3 size 899.22GiB used 360.06GiB path /dev/sde2

Label: none  uuid: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1
    Total devices 6 FS bytes used 4.02TiB
    devid    1 size 1.82TiB used 1.34TiB path /dev/sdg1
    devid    2 size 1.82TiB used 1.34TiB path /dev/sdh1
    devid    3 size 1.82TiB used 1.34TiB path /dev/sdi1
    devid    4 size 1.82TiB used 1.34TiB path /dev/sdb1
    devid    5 size 1.82TiB used 1.34TiB path /dev/sda1
    devid    6 size 1.82TiB used 1.34TiB path /dev/sdf1

root@noname_server:/mnt/share# uname -a
Linux noname_server 4.4.0-28-generic #47-Ubuntu SMP Fri Jun 24
10:09:13 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
root@noname_server:/mnt/share# btrfs --version
btrfs-progs v4.4
root@noname_server:/mnt/share#

Problem is that stuff on this filesystem moves so slowly that it's
hard to remember historical events ... it's like AWS glacier. What I
can state with 100% certainty is that:
- files that are affected are 2GB and over (safe to assume 4GB and over)
- files affected were just read (and some not even read) never written
after putting into storage
- In the past I've assumed that files affected are due to size, but I
have quite few ISO files some backups of virtual machines ... no
problems there - seems like problem originates in one folder & size >
2GB & extension .mkv

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.
  2016-07-02 23:36 btrfs RAID 10 truncates files over 2G to 4096 bytes Tomasz Kusmierz
@ 2016-07-04 21:13 ` Henk Slager
  2016-07-04 21:28   ` Tomasz Kusmierz
  2016-07-05  4:36   ` Duncan
  0 siblings, 2 replies; 16+ messages in thread
From: Henk Slager @ 2016-07-04 21:13 UTC (permalink / raw)
  To: Tomasz Kusmierz; +Cc: linux-btrfs

On Sun, Jul 3, 2016 at 1:36 AM, Tomasz Kusmierz <tom.kusmierz@gmail.com> wrote:
> Hi,
>
> My setup is that I use one file system for / and /home (on SSD) and a
> larger raid 10 for /mnt/share (6 x 2TB).
>
> Today I've discovered that 14 of files that are supposed to be over
> 2GB are in fact just 4096 bytes. I've checked the content of those 4KB
> and it seems that it does contain information that were at the
> beginnings of the files.
>
> I've experienced this problem in the past (3 - 4 years ago ?) but
> attributed it to different problem that I've spoke with you guys here
> about (corruption due to non ECC ram). At that time I did deleted
> files affected (56) and similar problem was discovered a year but not
> more than 2 years ago and I believe I've deleted the files.
>
> I periodically (once a month) run a scrub on my system to eliminate
> any errors sneaking in. I believe I did a balance a half a year ago ?
> to reclaim space after I deleted a large database.
>
> root@noname_server:/mnt/share# btrfs fi show
> Label: none  uuid: 060c2345-5d2f-4965-b0a2-47ed2d1a5ba2
>     Total devices 1 FS bytes used 177.19GiB
>     devid    3 size 899.22GiB used 360.06GiB path /dev/sde2
>
> Label: none  uuid: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1
>     Total devices 6 FS bytes used 4.02TiB
>     devid    1 size 1.82TiB used 1.34TiB path /dev/sdg1
>     devid    2 size 1.82TiB used 1.34TiB path /dev/sdh1
>     devid    3 size 1.82TiB used 1.34TiB path /dev/sdi1
>     devid    4 size 1.82TiB used 1.34TiB path /dev/sdb1
>     devid    5 size 1.82TiB used 1.34TiB path /dev/sda1
>     devid    6 size 1.82TiB used 1.34TiB path /dev/sdf1
>
> root@noname_server:/mnt/share# uname -a
> Linux noname_server 4.4.0-28-generic #47-Ubuntu SMP Fri Jun 24
> 10:09:13 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
> root@noname_server:/mnt/share# btrfs --version
> btrfs-progs v4.4
> root@noname_server:/mnt/share#
>
>
> Problem is that stuff on this filesystem moves so slowly that it's
> hard to remember historical events ... it's like AWS glacier. What I
> can state with 100% certainty is that:
> - files that are affected are 2GB and over (safe to assume 4GB and over)
> - files affected were just read (and some not even read) never written
> after putting into storage
> - In the past I've assumed that files affected are due to size, but I
> have quite few ISO files some backups of virtual machines ... no
> problems there - seems like problem originates in one folder & size >
> 2GB & extension .mkv

In case some application is the root cause of the issue, I would say
try to keep some ro snapshots done by a tool like snapper for example,
but maybe you do that already. It sounds also like this is some kernel
bug, snaphots won't help that much then I think.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.
  2016-07-04 21:13 ` Henk Slager
@ 2016-07-04 21:28   ` Tomasz Kusmierz
  2016-07-05 23:30     ` Henk Slager
  2016-07-05  4:36   ` Duncan
  1 sibling, 1 reply; 16+ messages in thread
From: Tomasz Kusmierz @ 2016-07-04 21:28 UTC (permalink / raw)
  To: Henk Slager; +Cc: linux-btrfs

I did consider that, but:
- some files were NOT accessed by anything with 100% certainty (well if there is a rootkit on my system or something in that shape than maybe yes) 
- the only application that could access those files is totem (well Nautilius checks extension -> directs it to totem) so in that case we would hear about out break of totem killing people files.  
- if it was a kernel bug then other large files would be affected.

Maybe I’m wrong and it’s actually related to the fact that all those files are located in single location on file system (single folder) that might have a historical bug in some structure somewhere ? 

I did forgot to add that file system was created a long time ago and it was created with leaf & node size = 16k. 

(ps. this email client on OS X is driving me up the wall … have to correct the corrections all the time :/)

> On 4 Jul 2016, at 22:13, Henk Slager <eye1tm@gmail.com> wrote:
> 
> On Sun, Jul 3, 2016 at 1:36 AM, Tomasz Kusmierz <tom.kusmierz@gmail.com> wrote:
>> Hi,
>> 
>> My setup is that I use one file system for / and /home (on SSD) and a
>> larger raid 10 for /mnt/share (6 x 2TB).
>> 
>> Today I've discovered that 14 of files that are supposed to be over
>> 2GB are in fact just 4096 bytes. I've checked the content of those 4KB
>> and it seems that it does contain information that were at the
>> beginnings of the files.
>> 
>> I've experienced this problem in the past (3 - 4 years ago ?) but
>> attributed it to different problem that I've spoke with you guys here
>> about (corruption due to non ECC ram). At that time I did deleted
>> files affected (56) and similar problem was discovered a year but not
>> more than 2 years ago and I believe I've deleted the files.
>> 
>> I periodically (once a month) run a scrub on my system to eliminate
>> any errors sneaking in. I believe I did a balance a half a year ago ?
>> to reclaim space after I deleted a large database.
>> 
>> root@noname_server:/mnt/share# btrfs fi show
>> Label: none  uuid: 060c2345-5d2f-4965-b0a2-47ed2d1a5ba2
>>    Total devices 1 FS bytes used 177.19GiB
>>    devid    3 size 899.22GiB used 360.06GiB path /dev/sde2
>> 
>> Label: none  uuid: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1
>>    Total devices 6 FS bytes used 4.02TiB
>>    devid    1 size 1.82TiB used 1.34TiB path /dev/sdg1
>>    devid    2 size 1.82TiB used 1.34TiB path /dev/sdh1
>>    devid    3 size 1.82TiB used 1.34TiB path /dev/sdi1
>>    devid    4 size 1.82TiB used 1.34TiB path /dev/sdb1
>>    devid    5 size 1.82TiB used 1.34TiB path /dev/sda1
>>    devid    6 size 1.82TiB used 1.34TiB path /dev/sdf1
>> 
>> root@noname_server:/mnt/share# uname -a
>> Linux noname_server 4.4.0-28-generic #47-Ubuntu SMP Fri Jun 24
>> 10:09:13 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
>> root@noname_server:/mnt/share# btrfs --version
>> btrfs-progs v4.4
>> root@noname_server:/mnt/share#
>> 
>> 
>> Problem is that stuff on this filesystem moves so slowly that it's
>> hard to remember historical events ... it's like AWS glacier. What I
>> can state with 100% certainty is that:
>> - files that are affected are 2GB and over (safe to assume 4GB and over)
>> - files affected were just read (and some not even read) never written
>> after putting into storage
>> - In the past I've assumed that files affected are due to size, but I
>> have quite few ISO files some backups of virtual machines ... no
>> problems there - seems like problem originates in one folder & size >
>> 2GB & extension .mkv
> 
> In case some application is the root cause of the issue, I would say
> try to keep some ro snapshots done by a tool like snapper for example,
> but maybe you do that already. It sounds also like this is some kernel
> bug, snaphots won't help that much then I think.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.
  2016-07-04 21:28   ` Tomasz Kusmierz
@ 2016-07-05 23:30     ` Henk Slager
  2016-07-06  1:24       ` Tomasz Kusmierz
       [not found]       ` <0EBF76CB-A350-4108-91EF-076A73932061@gmail.com>
  0 siblings, 2 replies; 16+ messages in thread
From: Henk Slager @ 2016-07-05 23:30 UTC (permalink / raw)
  To: Tomasz Kusmierz; +Cc: linux-btrfs

On Mon, Jul 4, 2016 at 11:28 PM, Tomasz Kusmierz <tom.kusmierz@gmail.com> wrote:
> I did consider that, but:
> - some files were NOT accessed by anything with 100% certainty (well if there is a rootkit on my system or something in that shape than maybe yes)
> - the only application that could access those files is totem (well Nautilius checks extension -> directs it to totem) so in that case we would hear about out break of totem killing people files.
> - if it was a kernel bug then other large files would be affected.
>
> Maybe I’m wrong and it’s actually related to the fact that all those files are located in single location on file system (single folder) that might have a historical bug in some structure somewhere ?

I find it hard to imagine that this has something to do with the
folderstructure, unless maybe the folder is a subvolume with
non-default attributes or so. How the files in that folder are created
(at full disktransferspeed or during a day or even a week) might give
some hint. You could run filefrag and see if that rings a bell.

> I did forgot to add that file system was created a long time ago and it was created with leaf & node size = 16k.

If this long time ago is >2 years then you have likely specifically
set node size = 16k, otherwise with older tools it would have been 4K.
Have you created it as raid10 or has it undergone profile conversions?

It could also be that the ondisk format is somewhat corrupted (btrfs
check should find that ) and that that causes the issue.

In-lining on raid10 has caused me some trouble (I had 4k nodes) over
time, it has happened over a year ago with kernels recent at that
time, but the fs was converted from raid5.

You might want to run the python scrips from here:
https://github.com/knorrie/python-btrfs

so that maybe you see how block-groups/chunks are filled etc.

> (ps. this email client on OS X is driving me up the wall … have to correct the corrections all the time :/)
>
>> On 4 Jul 2016, at 22:13, Henk Slager <eye1tm@gmail.com> wrote:
>>
>> On Sun, Jul 3, 2016 at 1:36 AM, Tomasz Kusmierz <tom.kusmierz@gmail.com> wrote:
>>> Hi,
>>>
>>> My setup is that I use one file system for / and /home (on SSD) and a
>>> larger raid 10 for /mnt/share (6 x 2TB).
>>>
>>> Today I've discovered that 14 of files that are supposed to be over
>>> 2GB are in fact just 4096 bytes. I've checked the content of those 4KB
>>> and it seems that it does contain information that were at the
>>> beginnings of the files.
>>>
>>> I've experienced this problem in the past (3 - 4 years ago ?) but
>>> attributed it to different problem that I've spoke with you guys here
>>> about (corruption due to non ECC ram). At that time I did deleted
>>> files affected (56) and similar problem was discovered a year but not
>>> more than 2 years ago and I believe I've deleted the files.
>>>
>>> I periodically (once a month) run a scrub on my system to eliminate
>>> any errors sneaking in. I believe I did a balance a half a year ago ?
>>> to reclaim space after I deleted a large database.
>>>
>>> root@noname_server:/mnt/share# btrfs fi show
>>> Label: none  uuid: 060c2345-5d2f-4965-b0a2-47ed2d1a5ba2
>>>    Total devices 1 FS bytes used 177.19GiB
>>>    devid    3 size 899.22GiB used 360.06GiB path /dev/sde2
>>>
>>> Label: none  uuid: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1
>>>    Total devices 6 FS bytes used 4.02TiB
>>>    devid    1 size 1.82TiB used 1.34TiB path /dev/sdg1
>>>    devid    2 size 1.82TiB used 1.34TiB path /dev/sdh1
>>>    devid    3 size 1.82TiB used 1.34TiB path /dev/sdi1
>>>    devid    4 size 1.82TiB used 1.34TiB path /dev/sdb1
>>>    devid    5 size 1.82TiB used 1.34TiB path /dev/sda1
>>>    devid    6 size 1.82TiB used 1.34TiB path /dev/sdf1
>>>
>>> root@noname_server:/mnt/share# uname -a
>>> Linux noname_server 4.4.0-28-generic #47-Ubuntu SMP Fri Jun 24
>>> 10:09:13 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
>>> root@noname_server:/mnt/share# btrfs --version
>>> btrfs-progs v4.4
>>> root@noname_server:/mnt/share#
>>>
>>>
>>> Problem is that stuff on this filesystem moves so slowly that it's
>>> hard to remember historical events ... it's like AWS glacier. What I
>>> can state with 100% certainty is that:
>>> - files that are affected are 2GB and over (safe to assume 4GB and over)
>>> - files affected were just read (and some not even read) never written
>>> after putting into storage
>>> - In the past I've assumed that files affected are due to size, but I
>>> have quite few ISO files some backups of virtual machines ... no
>>> problems there - seems like problem originates in one folder & size >
>>> 2GB & extension .mkv
>>
>> In case some application is the root cause of the issue, I would say
>> try to keep some ro snapshots done by a tool like snapper for example,
>> but maybe you do that already. It sounds also like this is some kernel
>> bug, snaphots won't help that much then I think.
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.
  2016-07-05 23:30     ` Henk Slager
@ 2016-07-06  1:24       ` Tomasz Kusmierz
       [not found]       ` <0EBF76CB-A350-4108-91EF-076A73932061@gmail.com>
  1 sibling, 0 replies; 16+ messages in thread
From: Tomasz Kusmierz @ 2016-07-06  1:24 UTC (permalink / raw)
  To: Henk Slager; +Cc: linux-btrfs

On 6 Jul 2016, at 00:30, Henk Slager <eye1tm@gmail.com <mailto:eye1tm@gmail.com>> wrote:
> 
> On Mon, Jul 4, 2016 at 11:28 PM, Tomasz Kusmierz <tom.kusmierz@gmail.com <mailto:tom.kusmierz@gmail.com>> wrote:
>> I did consider that, but:
>> - some files were NOT accessed by anything with 100% certainty (well if there is a rootkit on my system or something in that shape than maybe yes)
>> - the only application that could access those files is totem (well Nautilius checks extension -> directs it to totem) so in that case we would hear about out break of totem killing people files.
>> - if it was a kernel bug then other large files would be affected.
>> 
>> Maybe I’m wrong and it’s actually related to the fact that all those files are located in single location on file system (single folder) that might have a historical bug in some structure somewhere ?
> 
> I find it hard to imagine that this has something to do with the
> folderstructure, unless maybe the folder is a subvolume with
> non-default attributes or so. How the files in that folder are created
> (at full disktransferspeed or during a day or even a week) might give
> some hint. You could run filefrag and see if that rings a bell.
files that are 4096 show:
1 extent found
> 
>> I did forgot to add that file system was created a long time ago and it was created with leaf & node size = 16k.
> 
> If this long time ago is >2 years then you have likely specifically
> set node size = 16k, otherwise with older tools it would have been 4K.
You are right I used -l 16K -n 16K
> Have you created it as raid10 or has it undergone profile conversions?
Due to lack of spare disks 
(it may sound odd for some but spending for more than 6 disks for home use seems like an overkill)
and due to last I’ve had I had to migrate all data to new file system.
This played that way that I’ve:
1. from original FS I’ve removed 2 disks
2. Created RAID1 on those 2 disks,
3. shifted 2TB
4. removed 2 disks from source FS and adde those to destination FS
5 shifted 2 further TB 
6 destroyed original FS and adde 2 disks to destination FS
7 converted destination FS to RAID10

FYI, when I convert to raid 10 I use:
btrfs balance start -mconvert=raid10 -dconvert=raid10 -sconvert=raid10 -f /path/to/FS

this filesystem has 5 sub volumes. Files affected are located in separate folder within a “victim folder” that is within a one sub volume.
> 
> It could also be that the ondisk format is somewhat corrupted (btrfs
> check should find that ) and that that causes the issue.

root@noname_server:/mnt# btrfs check /dev/sdg1
Checking filesystem on /dev/sdg1
UUID: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1
checking extents
checking free space cache
checking fs roots
checking csums
checking root refs
found 4424060642634 bytes used err is 0
total csum bytes: 4315954936
total tree bytes: 4522786816
total fs tree bytes: 61702144
total extent tree bytes: 41402368
btree space waste bytes: 72430813
file data blocks allocated: 4475917217792
 referenced 4420407603200

No luck there :/

> In-lining on raid10 has caused me some trouble (I had 4k nodes) over
> time, it has happened over a year ago with kernels recent at that
> time, but the fs was converted from raid5
Could you please elaborate on that ? you also ended up with files that got truncated to 4096 bytes ?

> You might want to run the python scrips from here:
> https://github.com/knorrie/python-btrfs <https://github.com/knorrie/python-btrfs>
Will do. 

> so that maybe you see how block-groups/chunks are filled etc.
> 
>> (ps. this email client on OS X is driving me up the wall … have to correct the corrections all the time :/)
>> 
>>> On 4 Jul 2016, at 22:13, Henk Slager <eye1tm@gmail.com <mailto:eye1tm@gmail.com>> wrote:
>>> 
>>> On Sun, Jul 3, 2016 at 1:36 AM, Tomasz Kusmierz <tom.kusmierz@gmail.com <mailto:tom.kusmierz@gmail.com>> wrote:
>>>> Hi,
>>>> 
>>>> My setup is that I use one file system for / and /home (on SSD) and a
>>>> larger raid 10 for /mnt/share (6 x 2TB).
>>>> 
>>>> Today I've discovered that 14 of files that are supposed to be over
>>>> 2GB are in fact just 4096 bytes. I've checked the content of those 4KB
>>>> and it seems that it does contain information that were at the
>>>> beginnings of the files.
>>>> 
>>>> I've experienced this problem in the past (3 - 4 years ago ?) but
>>>> attributed it to different problem that I've spoke with you guys here
>>>> about (corruption due to non ECC ram). At that time I did deleted
>>>> files affected (56) and similar problem was discovered a year but not
>>>> more than 2 years ago and I believe I've deleted the files.
>>>> 
>>>> I periodically (once a month) run a scrub on my system to eliminate
>>>> any errors sneaking in. I believe I did a balance a half a year ago ?
>>>> to reclaim space after I deleted a large database.
>>>> 
>>>> root@noname_server:/mnt/share# btrfs fi show
>>>> Label: none  uuid: 060c2345-5d2f-4965-b0a2-47ed2d1a5ba2
>>>>   Total devices 1 FS bytes used 177.19GiB
>>>>   devid    3 size 899.22GiB used 360.06GiB path /dev/sde2
>>>> 
>>>> Label: none  uuid: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1
>>>>   Total devices 6 FS bytes used 4.02TiB
>>>>   devid    1 size 1.82TiB used 1.34TiB path /dev/sdg1
>>>>   devid    2 size 1.82TiB used 1.34TiB path /dev/sdh1
>>>>   devid    3 size 1.82TiB used 1.34TiB path /dev/sdi1
>>>>   devid    4 size 1.82TiB used 1.34TiB path /dev/sdb1
>>>>   devid    5 size 1.82TiB used 1.34TiB path /dev/sda1
>>>>   devid    6 size 1.82TiB used 1.34TiB path /dev/sdf1
>>>> 
>>>> root@noname_server:/mnt/share# uname -a
>>>> Linux noname_server 4.4.0-28-generic #47-Ubuntu SMP Fri Jun 24
>>>> 10:09:13 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
>>>> root@noname_server:/mnt/share# btrfs --version
>>>> btrfs-progs v4.4
>>>> root@noname_server:/mnt/share#
>>>> 
>>>> 
>>>> Problem is that stuff on this filesystem moves so slowly that it's
>>>> hard to remember historical events ... it's like AWS glacier. What I
>>>> can state with 100% certainty is that:
>>>> - files that are affected are 2GB and over (safe to assume 4GB and over)
>>>> - files affected were just read (and some not even read) never written
>>>> after putting into storage
>>>> - In the past I've assumed that files affected are due to size, but I
>>>> have quite few ISO files some backups of virtual machines ... no
>>>> problems there - seems like problem originates in one folder & size >
>>>> 2GB & extension .mkv
>>> 
>>> In case some application is the root cause of the issue, I would say
>>> try to keep some ro snapshots done by a tool like snapper for example,
>>> but maybe you do that already. It sounds also like this is some kernel
>>> bug, snaphots won't help that much then I think.

^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <0EBF76CB-A350-4108-91EF-076A73932061@gmail.com>]

* Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.
       [not found]       ` <0EBF76CB-A350-4108-91EF-076A73932061@gmail.com>
@ 2016-07-06  1:25         ` Henk Slager
  2016-07-06 12:20           ` Tomasz Kusmierz
  0 siblings, 1 reply; 16+ messages in thread
From: Henk Slager @ 2016-07-06  1:25 UTC (permalink / raw)
  To: Tomasz Kusmierz; +Cc: linux-btrfs

On Wed, Jul 6, 2016 at 2:32 AM, Tomasz Kusmierz <tom.kusmierz@gmail.com> wrote:
>
> On 6 Jul 2016, at 00:30, Henk Slager <eye1tm@gmail.com> wrote:
>
> On Mon, Jul 4, 2016 at 11:28 PM, Tomasz Kusmierz <tom.kusmierz@gmail.com>
> wrote:
>
> I did consider that, but:
> - some files were NOT accessed by anything with 100% certainty (well if
> there is a rootkit on my system or something in that shape than maybe yes)
> - the only application that could access those files is totem (well
> Nautilius checks extension -> directs it to totem) so in that case we would
> hear about out break of totem killing people files.
> - if it was a kernel bug then other large files would be affected.
>
> Maybe I’m wrong and it’s actually related to the fact that all those files
> are located in single location on file system (single folder) that might
> have a historical bug in some structure somewhere ?
>
>
> I find it hard to imagine that this has something to do with the
> folderstructure, unless maybe the folder is a subvolume with
> non-default attributes or so. How the files in that folder are created
> (at full disktransferspeed or during a day or even a week) might give
> some hint. You could run filefrag and see if that rings a bell.
>
> files that are 4096 show:
> 1 extent found

I actually meant filefrag for the files that are not (yet) truncated
to 4k. For example for virtual machine imagefiles (CoW), one could see
an MBR write.

> I did forgot to add that file system was created a long time ago and it was
> created with leaf & node size = 16k.
>
>
> If this long time ago is >2 years then you have likely specifically
> set node size = 16k, otherwise with older tools it would have been 4K.
>
> You are right I used -l 16K -n 16K
>
> Have you created it as raid10 or has it undergone profile conversions?
>
> Due to lack of spare disks
> (it may sound odd for some but spending for more than 6 disks for home use
> seems like an overkill)
> and due to last I’ve had I had to migrate all data to new file system.
> This played that way that I’ve:
> 1. from original FS I’ve removed 2 disks
> 2. Created RAID1 on those 2 disks,
> 3. shifted 2TB
> 4. removed 2 disks from source FS and adde those to destination FS
> 5 shifted 2 further TB
> 6 destroyed original FS and adde 2 disks to destination FS
> 7 converted destination FS to RAID10
>
> FYI, when I convert to raid 10 I use:
> btrfs balance start -mconvert=raid10 -dconvert=raid10 -sconvert=raid10 -f
> /path/to/FS
>
> this filesystem has 5 sub volumes. Files affected are located in separate
> folder within a “victim folder” that is within a one sub volume.
>
>
> It could also be that the ondisk format is somewhat corrupted (btrfs
> check should find that ) and that that causes the issue.
>
>
> root@noname_server:/mnt# btrfs check /dev/sdg1
> Checking filesystem on /dev/sdg1
> UUID: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1
> checking extents
> checking free space cache
> checking fs roots
> checking csums
> checking root refs
> found 4424060642634 bytes used err is 0
> total csum bytes: 4315954936
> total tree bytes: 4522786816
> total fs tree bytes: 61702144
> total extent tree bytes: 41402368
> btree space waste bytes: 72430813
> file data blocks allocated: 4475917217792
>  referenced 4420407603200
>
> No luck there :/

Indeed looks all normal.

> In-lining on raid10 has caused me some trouble (I had 4k nodes) over
> time, it has happened over a year ago with kernels recent at that
> time, but the fs was converted from raid5
>
> Could you please elaborate on that ? you also ended up with files that got
> truncated to 4096 bytes ?

I did not have truncated to 4k files, but your case lets me think of
small files inlining. Default max_inline mount option is 8k and that
means that 0 to ~3k files end up in metadata. I had size corruptions
for several of those small sized files that were updated quite
frequent, also within commit time AFAIK. Btrfs check lists this as
errors 400, although fs operation is not disturbed. I don't know what
happens if those small files are being updated/rewritten and are just
below or just above the max_inline limit.

The only thing I was thinking of is that your files were started as
small, so inline, then extended to multi-GB. In the past, there were
'bad extent/chunk type' issues and it was suggested that the fs would
have been an ext4-converted one (which had non-compliant mixed
metadata and data) but for most it was not the case. So there was/is
something unclear, but full balance or so fixed it as far as I
remember. But it is guessing, I do not have any failure cases like the
one you see.

> You might want to run the python scrips from here:
> https://github.com/knorrie/python-btrfs
>
> Will do.
>
> so that maybe you see how block-groups/chunks are filled etc.
>
> (ps. this email client on OS X is driving me up the wall … have to correct
> the corrections all the time :/)
>
> On 4 Jul 2016, at 22:13, Henk Slager <eye1tm@gmail.com> wrote:
>
> On Sun, Jul 3, 2016 at 1:36 AM, Tomasz Kusmierz <tom.kusmierz@gmail.com>
> wrote:
>
> Hi,
>
> My setup is that I use one file system for / and /home (on SSD) and a
> larger raid 10 for /mnt/share (6 x 2TB).
>
> Today I've discovered that 14 of files that are supposed to be over
> 2GB are in fact just 4096 bytes. I've checked the content of those 4KB
> and it seems that it does contain information that were at the
> beginnings of the files.
>
> I've experienced this problem in the past (3 - 4 years ago ?) but
> attributed it to different problem that I've spoke with you guys here
> about (corruption due to non ECC ram). At that time I did deleted
> files affected (56) and similar problem was discovered a year but not
> more than 2 years ago and I believe I've deleted the files.
>
> I periodically (once a month) run a scrub on my system to eliminate
> any errors sneaking in. I believe I did a balance a half a year ago ?
> to reclaim space after I deleted a large database.
>
> root@noname_server:/mnt/share# btrfs fi show
> Label: none  uuid: 060c2345-5d2f-4965-b0a2-47ed2d1a5ba2
>   Total devices 1 FS bytes used 177.19GiB
>   devid    3 size 899.22GiB used 360.06GiB path /dev/sde2
>
> Label: none  uuid: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1
>   Total devices 6 FS bytes used 4.02TiB
>   devid    1 size 1.82TiB used 1.34TiB path /dev/sdg1
>   devid    2 size 1.82TiB used 1.34TiB path /dev/sdh1
>   devid    3 size 1.82TiB used 1.34TiB path /dev/sdi1
>   devid    4 size 1.82TiB used 1.34TiB path /dev/sdb1
>   devid    5 size 1.82TiB used 1.34TiB path /dev/sda1
>   devid    6 size 1.82TiB used 1.34TiB path /dev/sdf1
>
> root@noname_server:/mnt/share# uname -a
> Linux noname_server 4.4.0-28-generic #47-Ubuntu SMP Fri Jun 24
> 10:09:13 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
> root@noname_server:/mnt/share# btrfs --version
> btrfs-progs v4.4
> root@noname_server:/mnt/share#
>
>
> Problem is that stuff on this filesystem moves so slowly that it's
> hard to remember historical events ... it's like AWS glacier. What I
> can state with 100% certainty is that:
> - files that are affected are 2GB and over (safe to assume 4GB and over)
> - files affected were just read (and some not even read) never written
> after putting into storage
> - In the past I've assumed that files affected are due to size, but I
> have quite few ISO files some backups of virtual machines ... no
> problems there - seems like problem originates in one folder & size >
> 2GB & extension .mkv
>
>
> In case some application is the root cause of the issue, I would say
> try to keep some ro snapshots done by a tool like snapper for example,
> but maybe you do that already. It sounds also like this is some kernel
> bug, snaphots won't help that much then I think.
>
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.
  2016-07-06  1:25         ` Henk Slager
@ 2016-07-06 12:20           ` Tomasz Kusmierz
  2016-07-06 21:41             ` Henk Slager
  2016-07-06 23:22             ` Kai Krakow
  0 siblings, 2 replies; 16+ messages in thread
From: Tomasz Kusmierz @ 2016-07-06 12:20 UTC (permalink / raw)
  To: Henk Slager; +Cc: linux-btrfs


> On 6 Jul 2016, at 02:25, Henk Slager <eye1tm@gmail.com> wrote:
> 
> On Wed, Jul 6, 2016 at 2:32 AM, Tomasz Kusmierz <tom.kusmierz@gmail.com> wrote:
>> 
>> On 6 Jul 2016, at 00:30, Henk Slager <eye1tm@gmail.com> wrote:
>> 
>> On Mon, Jul 4, 2016 at 11:28 PM, Tomasz Kusmierz <tom.kusmierz@gmail.com>
>> wrote:
>> 
>> I did consider that, but:
>> - some files were NOT accessed by anything with 100% certainty (well if
>> there is a rootkit on my system or something in that shape than maybe yes)
>> - the only application that could access those files is totem (well
>> Nautilius checks extension -> directs it to totem) so in that case we would
>> hear about out break of totem killing people files.
>> - if it was a kernel bug then other large files would be affected.
>> 
>> Maybe I’m wrong and it’s actually related to the fact that all those files
>> are located in single location on file system (single folder) that might
>> have a historical bug in some structure somewhere ?
>> 
>> 
>> I find it hard to imagine that this has something to do with the
>> folderstructure, unless maybe the folder is a subvolume with
>> non-default attributes or so. How the files in that folder are created
>> (at full disktransferspeed or during a day or even a week) might give
>> some hint. You could run filefrag and see if that rings a bell.
>> 
>> files that are 4096 show:
>> 1 extent found
> 
> I actually meant filefrag for the files that are not (yet) truncated
> to 4k. For example for virtual machine imagefiles (CoW), one could see
> an MBR write.
117 extents found
filesize 15468645003

good / bad ?  
> 
>> I did forgot to add that file system was created a long time ago and it was
>> created with leaf & node size = 16k.
>> 
>> 
>> If this long time ago is >2 years then you have likely specifically
>> set node size = 16k, otherwise with older tools it would have been 4K.
>> 
>> You are right I used -l 16K -n 16K
>> 
>> Have you created it as raid10 or has it undergone profile conversions?
>> 
>> Due to lack of spare disks
>> (it may sound odd for some but spending for more than 6 disks for home use
>> seems like an overkill)
>> and due to last I’ve had I had to migrate all data to new file system.
>> This played that way that I’ve:
>> 1. from original FS I’ve removed 2 disks
>> 2. Created RAID1 on those 2 disks,
>> 3. shifted 2TB
>> 4. removed 2 disks from source FS and adde those to destination FS
>> 5 shifted 2 further TB
>> 6 destroyed original FS and adde 2 disks to destination FS
>> 7 converted destination FS to RAID10
>> 
>> FYI, when I convert to raid 10 I use:
>> btrfs balance start -mconvert=raid10 -dconvert=raid10 -sconvert=raid10 -f
>> /path/to/FS
>> 
>> this filesystem has 5 sub volumes. Files affected are located in separate
>> folder within a “victim folder” that is within a one sub volume.
>> 
>> 
>> It could also be that the ondisk format is somewhat corrupted (btrfs
>> check should find that ) and that that causes the issue.
>> 
>> 
>> root@noname_server:/mnt# btrfs check /dev/sdg1
>> Checking filesystem on /dev/sdg1
>> UUID: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1
>> checking extents
>> checking free space cache
>> checking fs roots
>> checking csums
>> checking root refs
>> found 4424060642634 bytes used err is 0
>> total csum bytes: 4315954936
>> total tree bytes: 4522786816
>> total fs tree bytes: 61702144
>> total extent tree bytes: 41402368
>> btree space waste bytes: 72430813
>> file data blocks allocated: 4475917217792
>> referenced 4420407603200
>> 
>> No luck there :/
> 
> Indeed looks all normal.
> 
>> In-lining on raid10 has caused me some trouble (I had 4k nodes) over
>> time, it has happened over a year ago with kernels recent at that
>> time, but the fs was converted from raid5
>> 
>> Could you please elaborate on that ? you also ended up with files that got
>> truncated to 4096 bytes ?
> 
> I did not have truncated to 4k files, but your case lets me think of
> small files inlining. Default max_inline mount option is 8k and that
> means that 0 to ~3k files end up in metadata. I had size corruptions
> for several of those small sized files that were updated quite
> frequent, also within commit time AFAIK. Btrfs check lists this as
> errors 400, although fs operation is not disturbed. I don't know what
> happens if those small files are being updated/rewritten and are just
> below or just above the max_inline limit.
> 
> The only thing I was thinking of is that your files were started as
> small, so inline, then extended to multi-GB. In the past, there were
> 'bad extent/chunk type' issues and it was suggested that the fs would
> have been an ext4-converted one (which had non-compliant mixed
> metadata and data) but for most it was not the case. So there was/is
> something unclear, but full balance or so fixed it as far as I
> remember. But it is guessing, I do not have any failure cases like the
> one you see.

When I think of it, I did move this folder first when filesystem was RAID 1 (or not even RAID at all) and then it was upgraded to RAID 1 then RAID 10. 
Was there a faulty balance around August 2014 ? Please remember that I’m using Ubuntu so it was probably kernel from Ubuntu 14.04 LTS

Also, I would like to hear it from horses mouth: dos & donts for a long term storage where you moderately care about the data:
RAID10 - flaky ? would RAID1 give similar performance ? 
leaf & node size = 16k - pointless / flaky / untested / phased out ?
growing FS: add disks and rebalance and then change to different RAID level or it doesn’t matter ?!
RAID level on system data - am I an idiot to just even touch it ? 

> 
>> You might want to run the python scrips from here:
>> https://github.com/knorrie/python-btrfs
>> 
>> Will do.
>> 
>> so that maybe you see how block-groups/chunks are filled etc.
>> 
>> (ps. this email client on OS X is driving me up the wall … have to correct
>> the corrections all the time :/)
>> 
>> On 4 Jul 2016, at 22:13, Henk Slager <eye1tm@gmail.com> wrote:
>> 
>> On Sun, Jul 3, 2016 at 1:36 AM, Tomasz Kusmierz <tom.kusmierz@gmail.com>
>> wrote:
>> 
>> Hi,
>> 
>> My setup is that I use one file system for / and /home (on SSD) and a
>> larger raid 10 for /mnt/share (6 x 2TB).
>> 
>> Today I've discovered that 14 of files that are supposed to be over
>> 2GB are in fact just 4096 bytes. I've checked the content of those 4KB
>> and it seems that it does contain information that were at the
>> beginnings of the files.
>> 
>> I've experienced this problem in the past (3 - 4 years ago ?) but
>> attributed it to different problem that I've spoke with you guys here
>> about (corruption due to non ECC ram). At that time I did deleted
>> files affected (56) and similar problem was discovered a year but not
>> more than 2 years ago and I believe I've deleted the files.
>> 
>> I periodically (once a month) run a scrub on my system to eliminate
>> any errors sneaking in. I believe I did a balance a half a year ago ?
>> to reclaim space after I deleted a large database.
>> 
>> root@noname_server:/mnt/share# btrfs fi show
>> Label: none  uuid: 060c2345-5d2f-4965-b0a2-47ed2d1a5ba2
>>  Total devices 1 FS bytes used 177.19GiB
>>  devid    3 size 899.22GiB used 360.06GiB path /dev/sde2
>> 
>> Label: none  uuid: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1
>>  Total devices 6 FS bytes used 4.02TiB
>>  devid    1 size 1.82TiB used 1.34TiB path /dev/sdg1
>>  devid    2 size 1.82TiB used 1.34TiB path /dev/sdh1
>>  devid    3 size 1.82TiB used 1.34TiB path /dev/sdi1
>>  devid    4 size 1.82TiB used 1.34TiB path /dev/sdb1
>>  devid    5 size 1.82TiB used 1.34TiB path /dev/sda1
>>  devid    6 size 1.82TiB used 1.34TiB path /dev/sdf1
>> 
>> root@noname_server:/mnt/share# uname -a
>> Linux noname_server 4.4.0-28-generic #47-Ubuntu SMP Fri Jun 24
>> 10:09:13 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
>> root@noname_server:/mnt/share# btrfs --version
>> btrfs-progs v4.4
>> root@noname_server:/mnt/share#
>> 
>> 
>> Problem is that stuff on this filesystem moves so slowly that it's
>> hard to remember historical events ... it's like AWS glacier. What I
>> can state with 100% certainty is that:
>> - files that are affected are 2GB and over (safe to assume 4GB and over)
>> - files affected were just read (and some not even read) never written
>> after putting into storage
>> - In the past I've assumed that files affected are due to size, but I
>> have quite few ISO files some backups of virtual machines ... no
>> problems there - seems like problem originates in one folder & size >
>> 2GB & extension .mkv
>> 
>> 
>> In case some application is the root cause of the issue, I would say
>> try to keep some ro snapshots done by a tool like snapper for example,
>> but maybe you do that already. It sounds also like this is some kernel
>> bug, snaphots won't help that much then I think.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.
  2016-07-06 12:20           ` Tomasz Kusmierz
@ 2016-07-06 21:41             ` Henk Slager
  2016-07-06 22:16               ` Tomasz Kusmierz
  2016-07-06 23:22             ` Kai Krakow
  1 sibling, 1 reply; 16+ messages in thread
From: Henk Slager @ 2016-07-06 21:41 UTC (permalink / raw)
  To: Tomasz Kusmierz; +Cc: linux-btrfs

On Wed, Jul 6, 2016 at 2:20 PM, Tomasz Kusmierz <tom.kusmierz@gmail.com> wrote:
>
>> On 6 Jul 2016, at 02:25, Henk Slager <eye1tm@gmail.com> wrote:
>>
>> On Wed, Jul 6, 2016 at 2:32 AM, Tomasz Kusmierz <tom.kusmierz@gmail.com> wrote:
>>>
>>> On 6 Jul 2016, at 00:30, Henk Slager <eye1tm@gmail.com> wrote:
>>>
>>> On Mon, Jul 4, 2016 at 11:28 PM, Tomasz Kusmierz <tom.kusmierz@gmail.com>
>>> wrote:
>>>
>>> I did consider that, but:
>>> - some files were NOT accessed by anything with 100% certainty (well if
>>> there is a rootkit on my system or something in that shape than maybe yes)
>>> - the only application that could access those files is totem (well
>>> Nautilius checks extension -> directs it to totem) so in that case we would
>>> hear about out break of totem killing people files.
>>> - if it was a kernel bug then other large files would be affected.
>>>
>>> Maybe I’m wrong and it’s actually related to the fact that all those files
>>> are located in single location on file system (single folder) that might
>>> have a historical bug in some structure somewhere ?
>>>
>>>
>>> I find it hard to imagine that this has something to do with the
>>> folderstructure, unless maybe the folder is a subvolume with
>>> non-default attributes or so. How the files in that folder are created
>>> (at full disktransferspeed or during a day or even a week) might give
>>> some hint. You could run filefrag and see if that rings a bell.
>>>
>>> files that are 4096 show:
>>> 1 extent found
>>
>> I actually meant filefrag for the files that are not (yet) truncated
>> to 4k. For example for virtual machine imagefiles (CoW), one could see
>> an MBR write.
> 117 extents found
> filesize 15468645003
>
> good / bad ?

117 extents for a 1.5G file is fine, with -v option you could see the
fragmentation at the start, but this won't lead to any hint why you
have the truncate issue.

>>> I did forgot to add that file system was created a long time ago and it was
>>> created with leaf & node size = 16k.
>>>
>>>
>>> If this long time ago is >2 years then you have likely specifically
>>> set node size = 16k, otherwise with older tools it would have been 4K.
>>>
>>> You are right I used -l 16K -n 16K
>>>
>>> Have you created it as raid10 or has it undergone profile conversions?
>>>
>>> Due to lack of spare disks
>>> (it may sound odd for some but spending for more than 6 disks for home use
>>> seems like an overkill)
>>> and due to last I’ve had I had to migrate all data to new file system.
>>> This played that way that I’ve:
>>> 1. from original FS I’ve removed 2 disks
>>> 2. Created RAID1 on those 2 disks,
>>> 3. shifted 2TB
>>> 4. removed 2 disks from source FS and adde those to destination FS
>>> 5 shifted 2 further TB
>>> 6 destroyed original FS and adde 2 disks to destination FS
>>> 7 converted destination FS to RAID10
>>>
>>> FYI, when I convert to raid 10 I use:
>>> btrfs balance start -mconvert=raid10 -dconvert=raid10 -sconvert=raid10 -f
>>> /path/to/FS
>>>
>>> this filesystem has 5 sub volumes. Files affected are located in separate
>>> folder within a “victim folder” that is within a one sub volume.
>>>
>>>
>>> It could also be that the ondisk format is somewhat corrupted (btrfs
>>> check should find that ) and that that causes the issue.
>>>
>>>
>>> root@noname_server:/mnt# btrfs check /dev/sdg1
>>> Checking filesystem on /dev/sdg1
>>> UUID: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1
>>> checking extents
>>> checking free space cache
>>> checking fs roots
>>> checking csums
>>> checking root refs
>>> found 4424060642634 bytes used err is 0
>>> total csum bytes: 4315954936
>>> total tree bytes: 4522786816
>>> total fs tree bytes: 61702144
>>> total extent tree bytes: 41402368
>>> btree space waste bytes: 72430813
>>> file data blocks allocated: 4475917217792
>>> referenced 4420407603200
>>>
>>> No luck there :/
>>
>> Indeed looks all normal.
>>
>>> In-lining on raid10 has caused me some trouble (I had 4k nodes) over
>>> time, it has happened over a year ago with kernels recent at that
>>> time, but the fs was converted from raid5
>>>
>>> Could you please elaborate on that ? you also ended up with files that got
>>> truncated to 4096 bytes ?
>>
>> I did not have truncated to 4k files, but your case lets me think of
>> small files inlining. Default max_inline mount option is 8k and that
>> means that 0 to ~3k files end up in metadata. I had size corruptions
>> for several of those small sized files that were updated quite
>> frequent, also within commit time AFAIK. Btrfs check lists this as
>> errors 400, although fs operation is not disturbed. I don't know what
>> happens if those small files are being updated/rewritten and are just
>> below or just above the max_inline limit.
>>
>> The only thing I was thinking of is that your files were started as
>> small, so inline, then extended to multi-GB. In the past, there were
>> 'bad extent/chunk type' issues and it was suggested that the fs would
>> have been an ext4-converted one (which had non-compliant mixed
>> metadata and data) but for most it was not the case. So there was/is
>> something unclear, but full balance or so fixed it as far as I
>> remember. But it is guessing, I do not have any failure cases like the
>> one you see.
>
> When I think of it, I did move this folder first when filesystem was RAID 1 (or not even RAID at all) and then it was upgraded to RAID 1 then RAID 10.
> Was there a faulty balance around August 2014 ? Please remember that I’m using Ubuntu so it was probably kernel from Ubuntu 14.04 LTS

All those conversions should work, many people like yourself here on
the ML do this. However, as you say, you use Ubuntu 14.04 LTS which
has 3.13 base I see on distrowatch. What patches Canonical did add to
that version, how they match with the many kernel.org patches over the
last 2 years and when/if you upgraded the kernel, is what you would
have to get clear for yourself in order have a chance to come to a
reproducible case. And even then, the request will be to compile
and/or install a kernel.org version.

> Also, I would like to hear it from horses mouth: dos & donts for a long term storage where you moderately care about the data:
'moderately care about the data' is not of interest for
btrfs-developers paid by commercial companies IMHO, lets see what
happens...

> RAID10 - flaky ? would RAID1 give similar performance ?
I personally have not lost any data when using btrfs raid10 and I also
can't remember any report w.r.t. on this ML. I choose raid10 over
raid1 as I planned/had to use 4 HDD's anyhow and then raid10 at least
reads from 2 devices so that Gbps ethernet is almost always saturated.
That is what I had with XFS with 2 disk raid0.

The troubles I mentioned w.r.t. small files must have been a leftover
from when that fs was btrfs raid5. Also the 2 file corruptions I have
ever seen were inside multi-GB (VM) images and from btrfs raid5 times.
I converted to raid10 in summer 2015 (kernel 4.1.6) and the 1st scrub
after that corrected several errors. I did several add, delete, dd of
disks etc after that but no dataloss.

I must say that I have been using mostly self-compiled mainline/stable
kernel.org kernels as my distrobase was 3.11 and that version could do
raid5 only as a sort of 2-disk raid0.

> leaf & node size = 16k - pointless / flaky / untested / phased out ?
This 16k is the default since a year or so, it was 4k. You can find
the (performance) reasoning by C.Mason in this ML. So you took the
right decision 2 years ago.
I recently re-created the from-raid5-converted-raid10 fs to a new
raid10 fs for 16k node-size. The 4k fs with quite some snapshot and
heavy fragmentation was fast enough because of 300G SSD blockcaching,
but I wanted to use SSD storage a bit more efficient.

> growing FS: add disks and rebalance and then change to different RAID level or it doesn’t matter ?!
With raid56 there are issues, but for other profiles I personally have
no doubts, also looking at this ML. Things like replacing a running
rootfs partition on SSD to a 3 HDD btrfs raid1+single works I can say.

> RAID level on system data - am I an idiot to just even touch it ?
You can even balance a 32M system chunk part of a raid1 to another
device, so no issue I would say.

>>> You might want to run the python scrips from here:
>>> https://github.com/knorrie/python-btrfs
>>>
>>> Will do.
>>>
>>> so that maybe you see how block-groups/chunks are filled etc.
>>>
>>> (ps. this email client on OS X is driving me up the wall … have to correct
>>> the corrections all the time :/)
>>>
>>> On 4 Jul 2016, at 22:13, Henk Slager <eye1tm@gmail.com> wrote:
>>>
>>> On Sun, Jul 3, 2016 at 1:36 AM, Tomasz Kusmierz <tom.kusmierz@gmail.com>
>>> wrote:
>>>
>>> Hi,
>>>
>>> My setup is that I use one file system for / and /home (on SSD) and a
>>> larger raid 10 for /mnt/share (6 x 2TB).
>>>
>>> Today I've discovered that 14 of files that are supposed to be over
>>> 2GB are in fact just 4096 bytes. I've checked the content of those 4KB
>>> and it seems that it does contain information that were at the
>>> beginnings of the files.
>>>
>>> I've experienced this problem in the past (3 - 4 years ago ?) but
>>> attributed it to different problem that I've spoke with you guys here
>>> about (corruption due to non ECC ram). At that time I did deleted
>>> files affected (56) and similar problem was discovered a year but not
>>> more than 2 years ago and I believe I've deleted the files.
>>>
>>> I periodically (once a month) run a scrub on my system to eliminate
>>> any errors sneaking in. I believe I did a balance a half a year ago ?
>>> to reclaim space after I deleted a large database.
>>>
>>> root@noname_server:/mnt/share# btrfs fi show
>>> Label: none  uuid: 060c2345-5d2f-4965-b0a2-47ed2d1a5ba2
>>>  Total devices 1 FS bytes used 177.19GiB
>>>  devid    3 size 899.22GiB used 360.06GiB path /dev/sde2
>>>
>>> Label: none  uuid: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1
>>>  Total devices 6 FS bytes used 4.02TiB
>>>  devid    1 size 1.82TiB used 1.34TiB path /dev/sdg1
>>>  devid    2 size 1.82TiB used 1.34TiB path /dev/sdh1
>>>  devid    3 size 1.82TiB used 1.34TiB path /dev/sdi1
>>>  devid    4 size 1.82TiB used 1.34TiB path /dev/sdb1
>>>  devid    5 size 1.82TiB used 1.34TiB path /dev/sda1
>>>  devid    6 size 1.82TiB used 1.34TiB path /dev/sdf1
>>>
>>> root@noname_server:/mnt/share# uname -a
>>> Linux noname_server 4.4.0-28-generic #47-Ubuntu SMP Fri Jun 24
>>> 10:09:13 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
>>> root@noname_server:/mnt/share# btrfs --version
>>> btrfs-progs v4.4
>>> root@noname_server:/mnt/share#
>>>
>>>
>>> Problem is that stuff on this filesystem moves so slowly that it's
>>> hard to remember historical events ... it's like AWS glacier. What I
>>> can state with 100% certainty is that:
>>> - files that are affected are 2GB and over (safe to assume 4GB and over)
>>> - files affected were just read (and some not even read) never written
>>> after putting into storage
>>> - In the past I've assumed that files affected are due to size, but I
>>> have quite few ISO files some backups of virtual machines ... no
>>> problems there - seems like problem originates in one folder & size >
>>> 2GB & extension .mkv
>>>
>>>
>>> In case some application is the root cause of the issue, I would say
>>> try to keep some ro snapshots done by a tool like snapper for example,
>>> but maybe you do that already. It sounds also like this is some kernel
>>> bug, snaphots won't help that much then I think.
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.
  2016-07-06 21:41             ` Henk Slager
@ 2016-07-06 22:16               ` Tomasz Kusmierz
  0 siblings, 0 replies; 16+ messages in thread
From: Tomasz Kusmierz @ 2016-07-06 22:16 UTC (permalink / raw)
  To: Henk Slager; +Cc: linux-btrfs


> On 6 Jul 2016, at 22:41, Henk Slager <eye1tm@gmail.com> wrote:
> 
> On Wed, Jul 6, 2016 at 2:20 PM, Tomasz Kusmierz <tom.kusmierz@gmail.com> wrote:
>> 
>>> On 6 Jul 2016, at 02:25, Henk Slager <eye1tm@gmail.com> wrote:
>>> 
>>> On Wed, Jul 6, 2016 at 2:32 AM, Tomasz Kusmierz <tom.kusmierz@gmail.com> wrote:
>>>> 
>>>> On 6 Jul 2016, at 00:30, Henk Slager <eye1tm@gmail.com> wrote:
>>>> 
>>>> On Mon, Jul 4, 2016 at 11:28 PM, Tomasz Kusmierz <tom.kusmierz@gmail.com>
>>>> wrote:
>>>> 
>>>> I did consider that, but:
>>>> - some files were NOT accessed by anything with 100% certainty (well if
>>>> there is a rootkit on my system or something in that shape than maybe yes)
>>>> - the only application that could access those files is totem (well
>>>> Nautilius checks extension -> directs it to totem) so in that case we would
>>>> hear about out break of totem killing people files.
>>>> - if it was a kernel bug then other large files would be affected.
>>>> 
>>>> Maybe I’m wrong and it’s actually related to the fact that all those files
>>>> are located in single location on file system (single folder) that might
>>>> have a historical bug in some structure somewhere ?
>>>> 
>>>> 
>>>> I find it hard to imagine that this has something to do with the
>>>> folderstructure, unless maybe the folder is a subvolume with
>>>> non-default attributes or so. How the files in that folder are created
>>>> (at full disktransferspeed or during a day or even a week) might give
>>>> some hint. You could run filefrag and see if that rings a bell.
>>>> 
>>>> files that are 4096 show:
>>>> 1 extent found
>>> 
>>> I actually meant filefrag for the files that are not (yet) truncated
>>> to 4k. For example for virtual machine imagefiles (CoW), one could see
>>> an MBR write.
>> 117 extents found
>> filesize 15468645003
>> 
>> good / bad ?
> 
> 117 extents for a 1.5G file is fine, with -v option you could see the
> fragmentation at the start, but this won't lead to any hint why you
> have the truncate issue.
> 
>>>> I did forgot to add that file system was created a long time ago and it was
>>>> created with leaf & node size = 16k.
>>>> 
>>>> 
>>>> If this long time ago is >2 years then you have likely specifically
>>>> set node size = 16k, otherwise with older tools it would have been 4K.
>>>> 
>>>> You are right I used -l 16K -n 16K
>>>> 
>>>> Have you created it as raid10 or has it undergone profile conversions?
>>>> 
>>>> Due to lack of spare disks
>>>> (it may sound odd for some but spending for more than 6 disks for home use
>>>> seems like an overkill)
>>>> and due to last I’ve had I had to migrate all data to new file system.
>>>> This played that way that I’ve:
>>>> 1. from original FS I’ve removed 2 disks
>>>> 2. Created RAID1 on those 2 disks,
>>>> 3. shifted 2TB
>>>> 4. removed 2 disks from source FS and adde those to destination FS
>>>> 5 shifted 2 further TB
>>>> 6 destroyed original FS and adde 2 disks to destination FS
>>>> 7 converted destination FS to RAID10
>>>> 
>>>> FYI, when I convert to raid 10 I use:
>>>> btrfs balance start -mconvert=raid10 -dconvert=raid10 -sconvert=raid10 -f
>>>> /path/to/FS
>>>> 
>>>> this filesystem has 5 sub volumes. Files affected are located in separate
>>>> folder within a “victim folder” that is within a one sub volume.
>>>> 
>>>> 
>>>> It could also be that the ondisk format is somewhat corrupted (btrfs
>>>> check should find that ) and that that causes the issue.
>>>> 
>>>> 
>>>> root@noname_server:/mnt# btrfs check /dev/sdg1
>>>> Checking filesystem on /dev/sdg1
>>>> UUID: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1
>>>> checking extents
>>>> checking free space cache
>>>> checking fs roots
>>>> checking csums
>>>> checking root refs
>>>> found 4424060642634 bytes used err is 0
>>>> total csum bytes: 4315954936
>>>> total tree bytes: 4522786816
>>>> total fs tree bytes: 61702144
>>>> total extent tree bytes: 41402368
>>>> btree space waste bytes: 72430813
>>>> file data blocks allocated: 4475917217792
>>>> referenced 4420407603200
>>>> 
>>>> No luck there :/
>>> 
>>> Indeed looks all normal.
>>> 
>>>> In-lining on raid10 has caused me some trouble (I had 4k nodes) over
>>>> time, it has happened over a year ago with kernels recent at that
>>>> time, but the fs was converted from raid5
>>>> 
>>>> Could you please elaborate on that ? you also ended up with files that got
>>>> truncated to 4096 bytes ?
>>> 
>>> I did not have truncated to 4k files, but your case lets me think of
>>> small files inlining. Default max_inline mount option is 8k and that
>>> means that 0 to ~3k files end up in metadata. I had size corruptions
>>> for several of those small sized files that were updated quite
>>> frequent, also within commit time AFAIK. Btrfs check lists this as
>>> errors 400, although fs operation is not disturbed. I don't know what
>>> happens if those small files are being updated/rewritten and are just
>>> below or just above the max_inline limit.
>>> 
>>> The only thing I was thinking of is that your files were started as
>>> small, so inline, then extended to multi-GB. In the past, there were
>>> 'bad extent/chunk type' issues and it was suggested that the fs would
>>> have been an ext4-converted one (which had non-compliant mixed
>>> metadata and data) but for most it was not the case. So there was/is
>>> something unclear, but full balance or so fixed it as far as I
>>> remember. But it is guessing, I do not have any failure cases like the
>>> one you see.
>> 
>> When I think of it, I did move this folder first when filesystem was RAID 1 (or not even RAID at all) and then it was upgraded to RAID 1 then RAID 10.
>> Was there a faulty balance around August 2014 ? Please remember that I’m using Ubuntu so it was probably kernel from Ubuntu 14.04 LTS
> 
> All those conversions should work, many people like yourself here on
> the ML do this. However, as you say, you use Ubuntu 14.04 LTS which
> has 3.13 base I see on distrowatch. What patches Canonical did add to
> that version, how they match with the many kernel.org patches over the
> last 2 years and when/if you upgraded the kernel, is what you would
> have to get clear for yourself in order have a chance to come to a
> reproducible case. And even then, the request will be to compile
> and/or install a kernel.org version.
This was the kernel during the migration from old FS … I keep updating my machine fairly regurarly so kernel did change couple of times since then. Thou I appreciate the point about inability of having a reproducible case ;)

>> Also, I would like to hear it from horses mouth: dos & donts for a long term storage where you moderately care about the data:
> 'moderately care about the data' is not of interest for
> btrfs-developers paid by commercial companies IMHO, lets see what
> happens…
It’s always nice to get splashed in the face from a lukewarm coffee mug of developer that felt severely under appreciated by my comment :)  

>> RAID10 - flaky ? would RAID1 give similar performance ?
> I personally have not lost any data when using btrfs raid10 and I also
> can't remember any report w.r.t. on this ML. I choose raid10 over
> raid1 as I planned/had to use 4 HDD's anyhow and then raid10 at least
> reads from 2 devices so that Gbps ethernet is almost always saturated.
> That is what I had with XFS with 2 disk raid0.
> 
> The troubles I mentioned w.r.t. small files must have been a leftover
> from when that fs was btrfs raid5. Also the 2 file corruptions I have
> ever seen were inside multi-GB (VM) images and from btrfs raid5 times.
> I converted to raid10 in summer 2015 (kernel 4.1.6) and the 1st scrub
> after that corrected several errors. I did several add, delete, dd of
> disks etc after that but no dataloss.
> 
> I must say that I have been using mostly self-compiled mainline/stable
> kernel.org kernels as my distrobase was 3.11 and that version could do
> raid5 only as a sort of 2-disk raid0.
Thanks for that, I guess I wasn’t insane going for raid10

>> leaf & node size = 16k - pointless / flaky / untested / phased out ?
> This 16k is the default since a year or so, it was 4k. You can find
> the (performance) reasoning by C.Mason in this ML. So you took the
> right decision 2 years ago.
> I recently re-created the from-raid5-converted-raid10 fs to a new
> raid10 fs for 16k node-size. The 4k fs with quite some snapshot and
> heavy fragmentation was fast enough because of 300G SSD blockcaching,
> but I wanted to use SSD storage a bit more efficient.
Thanks, actually since I’ve first heard of btrfs through Avi Miller video I’ve worked out “Hey, I ain’t got that many small files and who cares, btrfs will put it into metadata rather than occupy whole node so it’s a win-win for me !”. I was just wondering whenever this data / demos were out of date or proven faulty over some time. Anyway thanks for clearing it up.

FYI: To guys that write documentation, demos & wiki: some people do actually read this stuff and it helps ! Please more demos / examples / howtos /  corner cases explanations / tricks / dos & donts !!! Keep making it more approachable by mere mortals !!!


>> growing FS: add disks and rebalance and then change to different RAID level or it doesn’t matter ?!
> With raid56 there are issues, but for other profiles I personally have
> no doubts, also looking at this ML. Things like replacing a running
> rootfs partition on SSD to a 3 HDD btrfs raid1+single works I can say
Thanks !
>> RAID level on system data - am I an idiot to just even touch it ?
> You can even balance a 32M system chunk part of a raid1 to another
> device, so no issue I would say.
Could I RAID1 on 6 drives to eliminate any future problems ?
>>>> You might want to run the python scrips from here:
>>>> https://github.com/knorrie/python-btrfs
>>>> 
>>>> Will do.
>>>> 
>>>> so that maybe you see how block-groups/chunks are filled etc.
>>>> 
>>>> (ps. this email client on OS X is driving me up the wall … have to correct
>>>> the corrections all the time :/)
>>>> 
>>>> On 4 Jul 2016, at 22:13, Henk Slager <eye1tm@gmail.com> wrote:
>>>> 
>>>> On Sun, Jul 3, 2016 at 1:36 AM, Tomasz Kusmierz <tom.kusmierz@gmail.com>
>>>> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> My setup is that I use one file system for / and /home (on SSD) and a
>>>> larger raid 10 for /mnt/share (6 x 2TB).
>>>> 
>>>> Today I've discovered that 14 of files that are supposed to be over
>>>> 2GB are in fact just 4096 bytes. I've checked the content of those 4KB
>>>> and it seems that it does contain information that were at the
>>>> beginnings of the files.
>>>> 
>>>> I've experienced this problem in the past (3 - 4 years ago ?) but
>>>> attributed it to different problem that I've spoke with you guys here
>>>> about (corruption due to non ECC ram). At that time I did deleted
>>>> files affected (56) and similar problem was discovered a year but not
>>>> more than 2 years ago and I believe I've deleted the files.
>>>> 
>>>> I periodically (once a month) run a scrub on my system to eliminate
>>>> any errors sneaking in. I believe I did a balance a half a year ago ?
>>>> to reclaim space after I deleted a large database.
>>>> 
>>>> root@noname_server:/mnt/share# btrfs fi show
>>>> Label: none  uuid: 060c2345-5d2f-4965-b0a2-47ed2d1a5ba2
>>>> Total devices 1 FS bytes used 177.19GiB
>>>> devid    3 size 899.22GiB used 360.06GiB path /dev/sde2
>>>> 
>>>> Label: none  uuid: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1
>>>> Total devices 6 FS bytes used 4.02TiB
>>>> devid    1 size 1.82TiB used 1.34TiB path /dev/sdg1
>>>> devid    2 size 1.82TiB used 1.34TiB path /dev/sdh1
>>>> devid    3 size 1.82TiB used 1.34TiB path /dev/sdi1
>>>> devid    4 size 1.82TiB used 1.34TiB path /dev/sdb1
>>>> devid    5 size 1.82TiB used 1.34TiB path /dev/sda1
>>>> devid    6 size 1.82TiB used 1.34TiB path /dev/sdf1
>>>> 
>>>> root@noname_server:/mnt/share# uname -a
>>>> Linux noname_server 4.4.0-28-generic #47-Ubuntu SMP Fri Jun 24
>>>> 10:09:13 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
>>>> root@noname_server:/mnt/share# btrfs --version
>>>> btrfs-progs v4.4
>>>> root@noname_server:/mnt/share#
>>>> 
>>>> 
>>>> Problem is that stuff on this filesystem moves so slowly that it's
>>>> hard to remember historical events ... it's like AWS glacier. What I
>>>> can state with 100% certainty is that:
>>>> - files that are affected are 2GB and over (safe to assume 4GB and over)
>>>> - files affected were just read (and some not even read) never written
>>>> after putting into storage
>>>> - In the past I've assumed that files affected are due to size, but I
>>>> have quite few ISO files some backups of virtual machines ... no
>>>> problems there - seems like problem originates in one folder & size >
>>>> 2GB & extension .mkv
>>>> 
>>>> 
>>>> In case some application is the root cause of the issue, I would say
>>>> try to keep some ro snapshots done by a tool like snapper for example,
>>>> but maybe you do that already. It sounds also like this is some kernel
>>>> bug, snaphots won't help that much then I think.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.
  2016-07-06 12:20           ` Tomasz Kusmierz
  2016-07-06 21:41             ` Henk Slager
@ 2016-07-06 23:22             ` Kai Krakow
  2016-07-06 23:51               ` Tomasz Kusmierz
  2016-07-07  1:46               ` Chris Murphy
  1 sibling, 2 replies; 16+ messages in thread
From: Kai Krakow @ 2016-07-06 23:22 UTC (permalink / raw)
  To: linux-btrfs

Am Wed, 6 Jul 2016 13:20:15 +0100
schrieb Tomasz Kusmierz <tom.kusmierz@gmail.com>:

> When I think of it, I did move this folder first when filesystem was
> RAID 1 (or not even RAID at all) and then it was upgraded to RAID 1
> then RAID 10. Was there a faulty balance around August 2014 ? Please
> remember that I’m using Ubuntu so it was probably kernel from Ubuntu
> 14.04 LTS
> 
> Also, I would like to hear it from horses mouth: dos & donts for a
> long term storage where you moderately care about the data: RAID10 -
> flaky ? would RAID1 give similar performance ?

The current implementation of RAID0 in btrfs is probably not very
optimized. RAID0 is a special case anyways: Stripes have a defined
width - I'm not sure what it is for btrfs, probably it's per chunk, so
it's 1GB, maybe it's 64k **. That means your data is usually not read
from multiple disks in parallel anyways as long as requests are below
stripe width (which is probably true for most access patterns except
copying files) - there's no immediate performance benefit. This holds
true for any RAID0 with read and write patterns below the stripe size.
Data is just more evenly distributed across devices and your
application will only benefit performance-wise if accesses spread
semi-random across the span of the whole file. And at least last time I
checked, it was stated that btrfs raid0 does not submit IOs in parallel
yet but first reads one stripe, then the next - so it doesn't submit
IOs to different devices in parallel.

Getting to RAID1, btrfs is even less optimized: Stripe decision is based
on process pids instead of device load, read accesses won't distribute
evenly to different stripes per single process, it's only just reading
from the same single device - always. Write access isn't faster anyways:
Both stripes need to be written - writing RAID1 is single device
performance only.

So I guess, at this stage there's no big difference between RAID1 and
RAID10 in btrfs (except maybe for large file copies), not for single
process access patterns and neither for multi process access patterns.
Btrfs can only benefit from RAID1 in multi process access patterns
currently, as can btrfs RAID0 by design for usual small random access
patterns (and maybe large sequential operations). But RAID1 with more
than two disks and multi process access patterns is more or less equal
to RAID10 because stripes are likely to be on different devices anyways.

In conclusion: RAID1 is simpler than RAID10 and thus its less likely to
contain flaws or bugs.

**: Please enlighten me, I couldn't find docs on this matter.

-- 
Regards,
Kai

Replies to list-only preferred.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.
  2016-07-06 23:22             ` Kai Krakow
@ 2016-07-06 23:51               ` Tomasz Kusmierz
  2016-07-07  0:32                 ` Kai Krakow
  2016-07-07  1:46               ` Chris Murphy
  1 sibling, 1 reply; 16+ messages in thread
From: Tomasz Kusmierz @ 2016-07-06 23:51 UTC (permalink / raw)
  To: Kai Krakow; +Cc: linux-btrfs


> On 7 Jul 2016, at 00:22, Kai Krakow <hurikhan77@gmail.com> wrote:
> 
> Am Wed, 6 Jul 2016 13:20:15 +0100
> schrieb Tomasz Kusmierz <tom.kusmierz@gmail.com>:
> 
>> When I think of it, I did move this folder first when filesystem was
>> RAID 1 (or not even RAID at all) and then it was upgraded to RAID 1
>> then RAID 10. Was there a faulty balance around August 2014 ? Please
>> remember that I’m using Ubuntu so it was probably kernel from Ubuntu
>> 14.04 LTS
>> 
>> Also, I would like to hear it from horses mouth: dos & donts for a
>> long term storage where you moderately care about the data: RAID10 -
>> flaky ? would RAID1 give similar performance ?
> 
> The current implementation of RAID0 in btrfs is probably not very
> optimized. RAID0 is a special case anyways: Stripes have a defined
> width - I'm not sure what it is for btrfs, probably it's per chunk, so
> it's 1GB, maybe it's 64k **. That means your data is usually not read
> from multiple disks in parallel anyways as long as requests are below
> stripe width (which is probably true for most access patterns except
> copying files) - there's no immediate performance benefit. This holds
> true for any RAID0 with read and write patterns below the stripe size.
> Data is just more evenly distributed across devices and your
> application will only benefit performance-wise if accesses spread
> semi-random across the span of the whole file. And at least last time I
> checked, it was stated that btrfs raid0 does not submit IOs in parallel
> yet but first reads one stripe, then the next - so it doesn't submit
> IOs to different devices in parallel.
> 
> Getting to RAID1, btrfs is even less optimized: Stripe decision is based
> on process pids instead of device load, read accesses won't distribute
> evenly to different stripes per single process, it's only just reading
> from the same single device - always. Write access isn't faster anyways:
> Both stripes need to be written - writing RAID1 is single device
> performance only.
> 
> So I guess, at this stage there's no big difference between RAID1 and
> RAID10 in btrfs (except maybe for large file copies), not for single
> process access patterns and neither for multi process access patterns.
> Btrfs can only benefit from RAID1 in multi process access patterns
> currently, as can btrfs RAID0 by design for usual small random access
> patterns (and maybe large sequential operations). But RAID1 with more
> than two disks and multi process access patterns is more or less equal
> to RAID10 because stripes are likely to be on different devices anyways.
> 
> In conclusion: RAID1 is simpler than RAID10 and thus its less likely to
> contain flaws or bugs.
> 
> **: Please enlighten me, I couldn't find docs on this matter.

:O 

It’s an eye opener - I think that this should end up on btrfs WIKI … seriously !

Anyway my use case for this is “storage” therefore I predominantly copy large files. 


> -- 
> Regards,
> Kai
> 
> Replies to list-only preferred.
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.
  2016-07-06 23:51               ` Tomasz Kusmierz
@ 2016-07-07  0:32                 ` Kai Krakow
  0 siblings, 0 replies; 16+ messages in thread
From: Kai Krakow @ 2016-07-07  0:32 UTC (permalink / raw)
  To: linux-btrfs

Am Thu, 7 Jul 2016 00:51:16 +0100
schrieb Tomasz Kusmierz <tom.kusmierz@gmail.com>:

> > On 7 Jul 2016, at 00:22, Kai Krakow <hurikhan77@gmail.com> wrote:
> > 
> > Am Wed, 6 Jul 2016 13:20:15 +0100
> > schrieb Tomasz Kusmierz <tom.kusmierz@gmail.com>:
> >   
> >> When I think of it, I did move this folder first when filesystem
> >> was RAID 1 (or not even RAID at all) and then it was upgraded to
> >> RAID 1 then RAID 10. Was there a faulty balance around August
> >> 2014 ? Please remember that I’m using Ubuntu so it was probably
> >> kernel from Ubuntu 14.04 LTS
> >> 
> >> Also, I would like to hear it from horses mouth: dos & donts for a
> >> long term storage where you moderately care about the data: RAID10
> >> - flaky ? would RAID1 give similar performance ?  
> > 
> > The current implementation of RAID0 in btrfs is probably not very
> > optimized. RAID0 is a special case anyways: Stripes have a defined
> > width - I'm not sure what it is for btrfs, probably it's per chunk,
> > so it's 1GB, maybe it's 64k **. That means your data is usually not
> > read from multiple disks in parallel anyways as long as requests
> > are below stripe width (which is probably true for most access
> > patterns except copying files) - there's no immediate performance
> > benefit. This holds true for any RAID0 with read and write patterns
> > below the stripe size. Data is just more evenly distributed across
> > devices and your application will only benefit performance-wise if
> > accesses spread semi-random across the span of the whole file. And
> > at least last time I checked, it was stated that btrfs raid0 does
> > not submit IOs in parallel yet but first reads one stripe, then the
> > next - so it doesn't submit IOs to different devices in parallel.
> > 
> > Getting to RAID1, btrfs is even less optimized: Stripe decision is
> > based on process pids instead of device load, read accesses won't
> > distribute evenly to different stripes per single process, it's
> > only just reading from the same single device - always. Write
> > access isn't faster anyways: Both stripes need to be written -
> > writing RAID1 is single device performance only.
> > 
> > So I guess, at this stage there's no big difference between RAID1
> > and RAID10 in btrfs (except maybe for large file copies), not for
> > single process access patterns and neither for multi process access
> > patterns. Btrfs can only benefit from RAID1 in multi process access
> > patterns currently, as can btrfs RAID0 by design for usual small
> > random access patterns (and maybe large sequential operations). But
> > RAID1 with more than two disks and multi process access patterns is
> > more or less equal to RAID10 because stripes are likely to be on
> > different devices anyways.
> > 
> > In conclusion: RAID1 is simpler than RAID10 and thus its less
> > likely to contain flaws or bugs.
> > 
> > **: Please enlighten me, I couldn't find docs on this matter.  
> 
> :O 
> 
> It’s an eye opener - I think that this should end up on btrfs WIKI …
> seriously !
> 
> Anyway my use case for this is “storage” therefore I predominantly
> copy large files. 

Then RAID10 may be your best option - for local operations. Copying
large files, even a modern single SATA spindle can saturate a gigabit
link. So, if your use case is NAS, and you don't use server side copies
(like modern versions of NFS and Samba support), you won't benefit from
RAID10 vs RAID1 - so just use the simpler implementation.

My personal recommendation: Add a small, high quality SSD to your array
and configure btrfs on top of bcache, configure it for write-around
caching to get best life-time and data safety. This should cache mostly
meta data access in your usecase and improve performance much better
than RAID10 over RAID1. I can recommend Crucial MX series from
personal experience, choose 250GB or higher as 120GB versions of
Crucial MX suffer much lower durability for caching purposes. Adding
bcache to an existing btrfs array is a little painful but easily doable
if you have enough free space to temporarily sacrifice one disk.

BTW: I'm using 3x 1TB btrfs mraid1/draid0 with a single 500GB bcache
SSD in write-back mode and local operation (it's my desktop machine).
The performance is great, bcache decouples some of the performance
downsides the current btrfs raid implementation has. I do daily
backups, so write-back caching is not a real problem (in case it
fails), and btrfs draid0 is also not a problem (mraid1 ensures meta
data integrity, so only file contents are at risk, and covered by
backups). With this setup I can easily saturate my 6Gb onboard SATA
controller, the system boots to usable desktop in 30 seconds from cold
start (including EFI firmware), including autologin to full-blown
KDE, autostart of Chrome and Steam, 2 virtual machine containers
(nspawn-based, one MySQL instance, one ElasticSearch instance), plus
local MySQL and ElasticSearch service (used for development and staging
purposes), and a local postfix service. Without bcache this machine
needs around 2-3 minutes to boot to a usable state.

BTW: I found this but it's old:
https://btrfs.wiki.kernel.org/index.php/Multi-device_Benchmarks

However, it should give you a rough overview about your usage patterns.
You can see that RAID1 would saturate a 1 Gb link for single process
operations, this is if your usecase is NAS, you're good to go.
Simultaneous read+write isn't covered by the test but performance will
probably be killed by seek overhead anyways then, except you use
bcache (it greatly reduces seeks and, especially in write-back mode,
converts them to sequential access patterns but you need to watch the
SSD wear closely in write-back mode and swap the SSD early before it
dies... in write-around mode, bcache dying is irrelevant to btrfs
integrity... write-through doesn't make sense for your usecase).

-- 
Regards,
Kai

Replies to list-only preferred.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.
  2016-07-06 23:22             ` Kai Krakow
  2016-07-06 23:51               ` Tomasz Kusmierz
@ 2016-07-07  1:46               ` Chris Murphy
  2016-07-07  2:24                 ` Tomasz Kusmierz
  1 sibling, 1 reply; 16+ messages in thread
From: Chris Murphy @ 2016-07-07  1:46 UTC (permalink / raw)
  To: Kai Krakow; +Cc: Btrfs BTRFS

On Wed, Jul 6, 2016 at 5:22 PM, Kai Krakow <hurikhan77@gmail.com> wrote:

> The current implementation of RAID0 in btrfs is probably not very
> optimized. RAID0 is a special case anyways: Stripes have a defined
> width - I'm not sure what it is for btrfs, probably it's per chunk, so
> it's 1GB, maybe it's 64k **.

Stripe element (a.k.a. strip, a.k.a. md chunk) size in Btrfs is fixed at 64KiB.

>That means your data is usually not read
> from multiple disks in parallel anyways as long as requests are below
> stripe width (which is probably true for most access patterns except
> copying files) - there's no immediate performance benefit.

Most any write pattern benefits from raid0 due to less disk
contention, even if the typical file size is smaller than stripe size.
Parallelization is improved even if it's suboptimal. This is really no
different than md raid striping with a 64KiB chunk size.

On Btrfs, it might be that some workloads benefit from metadata
raid10, and others don't. I also think it's hard to estimate without
benchmarking an actual workload with metadata as raid1 vs raid10.

> So I guess, at this stage there's no big difference between RAID1 and
> RAID10 in btrfs (except maybe for large file copies), not for single
> process access patterns and neither for multi process access patterns.
> Btrfs can only benefit from RAID1 in multi process access patterns
> currently, as can btrfs RAID0 by design for usual small random access
> patterns (and maybe large sequential operations). But RAID1 with more
> than two disks and multi process access patterns is more or less equal
> to RAID10 because stripes are likely to be on different devices anyways.

I think that too would need to be benchmarked and I think it'd need to
be aged as well to see the effect of both file and block group free
space fragmentation. The devil will be in really minute details, all
you have to do is read a few weeks of XFS list stuff with people
talking about optimization or bad performance and almost always it's
not the fault of the file system. And when it is, it depends on the
kernel version as XFS has had substantial changes even over its long
career, including (somewhat) recent changes for metadata heavy
workloads.

> In conclusion: RAID1 is simpler than RAID10 and thus its less likely to
> contain flaws or bugs.

I don't know about that. I think it's about the same. All multiple
device support, except raid56, was introduced at the same time
practically from day 2. Btrfs raid1 and raid10 tolerate only exactly 1
device loss, *maybe* two if you're very lucky, so neither of them are
really scalable.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.
  2016-07-07  1:46               ` Chris Murphy
@ 2016-07-07  2:24                 ` Tomasz Kusmierz
  2016-07-07  3:09                   ` Chris Murphy
  0 siblings, 1 reply; 16+ messages in thread
From: Tomasz Kusmierz @ 2016-07-07  2:24 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Kai Krakow, Btrfs BTRFS


> On 7 Jul 2016, at 02:46, Chris Murphy <lists@colorremedies.com> wrote:
> 

Chaps, I didn’t wanted this to spring up as a performance of btrfs argument,

BUT 

you are throwing a lot of useful data, maybe diverting some of it into wiki ? you know, us normal people might find it useful for making educated choice in some future ? :)

Interestingly on my RAID10 with 6 disks I only get:

dd if=/mnt/share/asdf of=/dev/zero bs=100M
113+1 records in
113+1 records out
11874643004 bytes (12 GB, 11 GiB) copied, 45.3123 s, 262 MB/s


filefrag -v
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..    2471: 2101940598..2101943069:   2472:            
   1:     2472..   12583: 1938312686..1938322797:  10112: 2101943070:
   2:    12584..   12837: 1937534654..1937534907:    254: 1938322798:
   3:    12838..   12839: 1937534908..1937534909:      2:            
   4:    12840..   34109: 1902954063..1902975332:  21270: 1937534910:
   5:    34110..   53671: 1900857931..1900877492:  19562: 1902975333:
   6:    53672..   54055: 1900877493..1900877876:    384:            
   7:    54056..   54063: 1900877877..1900877884:      8:            
   8:    54064..   98041: 1900877885..1900921862:  43978:            
   9:    98042..  117671: 1900921863..1900941492:  19630:            
  10:   117672..  118055: 1900941493..1900941876:    384:            
  11:   118056..  161833: 1900941877..1900985654:  43778:            
  12:   161834..  204013: 1900985655..1901027834:  42180:            
  13:   204014..  214269: 1901027835..1901038090:  10256:            
  14:   214270..  214401: 1901038091..1901038222:    132:            
  15:   214402..  214407: 1901038223..1901038228:      6:            
  16:   214408..  258089: 1901038229..1901081910:  43682:            
  17:   258090..  300139: 1901081911..1901123960:  42050:            
  18:   300140..  310559: 1901123961..1901134380:  10420:            
  19:   310560..  310695: 1901134381..1901134516:    136:            
  20:   310696..  354251: 1901134517..1901178072:  43556:            
  21:   354252..  396389: 1901178073..1901220210:  42138:            
  22:   396390..  406353: 1901220211..1901230174:   9964:            
  23:   406354..  406515: 1901230175..1901230336:    162:            
  24:   406516..  406519: 1901230337..1901230340:      4:            
  25:   406520..  450115: 1901230341..1901273936:  43596:            
  26:   450116..  492161: 1901273937..1901315982:  42046:            
  27:   492162..  524199: 1901315983..1901348020:  32038:            
  28:   524200..  535355: 1901348021..1901359176:  11156:            
  29:   535356..  535591: 1901359177..1901359412:    236:            
  30:   535592.. 1315369: 1899830240..1900610017: 779778: 1901359413:
  31:  1315370.. 1357435: 1901359413..1901401478:  42066: 1900610018:
  32:  1357436.. 1368091: 1928101070..1928111725:  10656: 1901401479:
  33:  1368092.. 1368231: 1928111726..1928111865:    140:            
  34:  1368232.. 2113959: 1899043808..1899789535: 745728: 1928111866:
  35:  2113960.. 2899082: 1898257376..1899042498: 785123: 1899789536: last,elf


If it would be possible to read from 6 disks at once maybe this performance would be better for linear read.

Anyway this is a huge diversion from original question, so maybe we will end here ?



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.
  2016-07-07  2:24                 ` Tomasz Kusmierz
@ 2016-07-07  3:09                   ` Chris Murphy
  0 siblings, 0 replies; 16+ messages in thread
From: Chris Murphy @ 2016-07-07  3:09 UTC (permalink / raw)
  To: Tomasz Kusmierz; +Cc: Chris Murphy, Kai Krakow, Btrfs BTRFS

On Wed, Jul 6, 2016 at 8:24 PM, Tomasz Kusmierz <tom.kusmierz@gmail.com> wrote:

> you are throwing a lot of useful data, maybe diverting some of it into wiki ? you know, us normal people might find it useful for making educated choice in some future ? :)

There is a wiki, and it's difficult for keep up to date as it is.
There are just too many changes happening in Btrfs, and  really only
the devs have a birds eye view of what's going on and what will happen
sooner than later.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.
  2016-07-04 21:13 ` Henk Slager
  2016-07-04 21:28   ` Tomasz Kusmierz
@ 2016-07-05  4:36   ` Duncan
  1 sibling, 0 replies; 16+ messages in thread
From: Duncan @ 2016-07-05  4:36 UTC (permalink / raw)
  To: linux-btrfs

Henk Slager posted on Mon, 04 Jul 2016 23:13:52 +0200 as excerpted:

> [Tomasz Kusmierz wrote...]

>> Problem is that stuff on this filesystem moves so slowly that it's hard
>> to remember historical events ... it's like AWS glacier. What I can
>> state with 100% certainty is that:
>> - files that are affected are 2GB and over (safe to assume 4GB and
>> over)
>> - files affected were just read (and some not even read) never written
>> after putting into storage
>> - In the past I've assumed that files
>> affected are due to size, but I have quite few ISO files some backups
>> of virtual machines ... no problems there - seems like problem
>> originates in one folder & size > 2GB & extension .mkv

This reads to me like a security-video use-case, very large media files 
that are mostly WORN (write-once read-never).  These files would be time-
based and likely mostly the same size, tho compression could cause them 
to vary in size somewhat.

I see a comment that I didn't quote, to the effect that you did a balance 
a half a year ago or so.

Btrfs data chunk size is nominally 1 GiB.  However, on large enough 
btrfs, I believe sometimes dependent on striped-raid as well (the exact 
conditions aren't clear to me), chunks are some multiple of that.  With a 
6-drive btrfs raid10, which we know does two copies and stripe as wide as 
possible, so 3-device-wide stripes here with two mirrors of the stripe, 
I'd guess it's 3 GiB chunks, 1 GiB * 3-device stripe width.

Is it possible that it's 3 GiB plus files only that are affected, and 
that the culprit was a buggy balance shifting around those big chunks 
half a year or whatever ago?

As to your VM images not being affected, their usage is far different, 
unless they're simply archived images, not actually in use.  If they're 
in-use not archived VM images, they're likely either highly fragmented, 
or you managed the fragmentation with the use of the NOCOW file 
attribute.  Either way, the way the filesystem treats them as opposed to 
very large write-once files that are likely using whole data chunks is 
very different, and it could well be that difference that explains why 
the video files were affected but the VM images not.

Given the evidence, a buggy balance would indeed be my first suspect, but 
I'm not a dev, and I haven't the foggiest what sort of balance bug might 
be the trigger here, or whether it has been fixed at all, let alone when, 
if so.

But of course that does suggest a potential partial proof and a test.  
The partial proof would be that none of the files created after the 
balance should be affected.

And the test, after backing up newer video files if they're likely to be 
needed, try another balance and see if it eats them too.  If it does...

If it doesn't with a new kernel and tools, you might try yet another 
balance with the same kernel and progs you were likely using half a year 
ago when you did that balance, just to nail down for sure whether it did 
eat the files back then, so we don't have to worry about some other 
problem.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-07-07  3:09 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-07-02 23:36 btrfs RAID 10 truncates files over 2G to 4096 bytes Tomasz Kusmierz
2016-07-04 21:13 ` Henk Slager
2016-07-04 21:28   ` Tomasz Kusmierz
2016-07-05 23:30     ` Henk Slager
2016-07-06  1:24       ` Tomasz Kusmierz
     [not found]       ` <0EBF76CB-A350-4108-91EF-076A73932061@gmail.com>
2016-07-06  1:25         ` Henk Slager
2016-07-06 12:20           ` Tomasz Kusmierz
2016-07-06 21:41             ` Henk Slager
2016-07-06 22:16               ` Tomasz Kusmierz
2016-07-06 23:22             ` Kai Krakow
2016-07-06 23:51               ` Tomasz Kusmierz
2016-07-07  0:32                 ` Kai Krakow
2016-07-07  1:46               ` Chris Murphy
2016-07-07  2:24                 ` Tomasz Kusmierz
2016-07-07  3:09                   ` Chris Murphy
2016-07-05  4:36   ` Duncan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).