Linux Btrfs filesystem development
 help / color / mirror / Atom feed
* btrfs stuck with lot's of files
@ 2014-12-01 11:46 Peter Volkov
  2014-12-01 18:47 ` Robert White
  2014-12-02  1:33 ` Qu Wenruo
  0 siblings, 2 replies; 10+ messages in thread
From: Peter Volkov @ 2014-12-01 11:46 UTC (permalink / raw)
  To: linux-btrfs@vger.kernel.org

Hi, guys.

We have a problem with btrfs file system: sometimes it became stuck
without leaving me any way to interrupt it (shutdown -r now is unable to
restart server). By stuck I mean some processes that previously were
able to write on disk are unable to cope with load and load average goes
up:

top - 13:10:58 up 1 day,  9:26,  5 users,  load average: 157.76, 156.61,
149.29
Tasks: 235 total,   2 running, 233 sleeping,   0 stopped,   0 zombie
%Cpu(s): 19.8 us, 15.0 sy,  0.0 ni, 60.7 id,  3.9 wa,  0.0 hi,  0.6 si,
0.0 st
KiB Mem:  65922104 total, 65414856 used,   507248 free,     1844 buffers
KiB Swap:        0 total,        0 used,        0 free. 62570804 cached
Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
COMMAND                                              
 8644 root      20   0       0      0      0 R  96.5  0.0 127:21.95
kworker/u16:16                                       
 5047 dvr       20   0 6884292 122668   4132 S   6.4  0.2 258:59.49
dvrserver                                            
30223 root      20   0   20140   2600   2132 R   6.4  0.0   0:00.01
top                                                  
    1 root      20   0    4276   1628   1524 S   0.0  0.0   0:40.19
init                                                 



There are about 300 treads on server, some of which are writing on disk.
A bit information about this btrfs filesystem: this is 22 disk file
system with raid1 for metadata and raid0 for data:

 # btrfs filesystem df /store/
Data, single: total=11.92TiB, used=10.86TiB
System, RAID1: total=8.00MiB, used=1.27MiB
System, single: total=4.00MiB, used=0.00B
Metadata, RAID1: total=46.00GiB, used=33.49GiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=512.00MiB, used=128.00KiB
 # btrfs property get /store/
ro=false
label=store
 # btrfs device stats /store/
(shows all zeros)
 # btrfs balance status /store/
No balance found on '/store/'
 # btrfs filesystem show /store/
Btrfs v3.17.1
(btw, is it supposed to have only version here?)

As for load we write quite small files of size (some of 313K, some of
800K), that's why metadata takes that much. So back to the problem.
iostat 1 exposes following problem:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          16.96    0.00   17.09   65.95    0.00    0.00

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               0.00         0.00         0.00          0          0
sdc               0.00         0.00         0.00          0          0
sdb               0.00         0.00         0.00          0          0
sde               0.00         0.00         0.00          0          0
sdd               0.00         0.00         0.00          0          0
sdf               0.00         0.00         0.00          0          0
sdg               0.00         0.00         0.00          0          0
sdj               0.00         0.00         0.00          0          0
sdh               0.00         0.00         0.00          0          0
sdk               0.00         0.00         0.00          0          0
sdi               1.00         0.00       200.00          0        200
sdl               0.00         0.00         0.00          0          0
sdn              48.00         0.00     17260.00          0      17260
sdm               0.00         0.00         0.00          0          0
sdp               0.00         0.00         0.00          0          0
sdo               0.00         0.00         0.00          0          0
sdq               0.00         0.00         0.00          0          0
sdr               0.00         0.00         0.00          0          0
sds               0.00         0.00         0.00          0          0
sdt               0.00         0.00         0.00          0          0
sdv               0.00         0.00         0.00          0          0
sdw               0.00         0.00         0.00          0          0
sdu               0.00         0.00         0.00          0          0


write goes to one disk. I've tried to debug what's going in kworker and
did

$ echo workqueue:workqueue_queue_work
> /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace_pipe > trace_pipe.out2

trace_pipe2.out.xz in attachment. Could you comment, what goes wrong
here?

Server has 64Gb of RAM. Is it possible that it is unable to keep all
metadata in memory, can we encrease this memory limit, if exists?


Thanks in advance for any pointers,
--
Peter.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: btrfs stuck with lot's of files
  2014-12-01 11:46 btrfs stuck with lot's of files Peter Volkov
@ 2014-12-01 18:47 ` Robert White
  2014-12-02  1:50   ` Peter Volkov
  2014-12-02  1:33 ` Qu Wenruo
  1 sibling, 1 reply; 10+ messages in thread
From: Robert White @ 2014-12-01 18:47 UTC (permalink / raw)
  To: Peter Volkov, linux-btrfs@vger.kernel.org

On 12/01/2014 03:46 AM, Peter Volkov wrote:
> Hi, guys.
 > (stuff about getting hung up trying to write to one drive)

That drive (/dev/sdn) is probably starting to fail. Some older drives 
basically go unresponsive when they start to go bad. Particularly if 
they've gone bad enough to have run out of spare tracks/sectors. 
Sometimes they will just refuse to answer. Sometimes they will go into 
"try again" mode, and the same activity will be retried indefinitely. 
This will then fill up your write queues and jam up all sorts of subsystems.

Step 1: Backup your data. Since you didn't RAID your data at all, when 
that drive dies your data is going to go away in fascinating and 
unpredictable ways. (RAID1 metadata with no RAID1 or RAID5 of the data 
means you have essentially no media failure protection.)

Step 2: Turn on SMART (if you can and you can) and check whether the 
drive is in its final moments of life. If your disk is all green lights 
according to smart, you may be able to un-jamb it by just doing a 
balance as described and explained after the next time I quote you.

Step 3: Switch your data mode to RAID5. It will cost you about half of 
your currenly free data space, but it won't leave you _as_ _vulnerable_ 
to complete data loss as you are now. SMART might be wrong about your 
drive being fine if it says it is.

>   # btrfs filesystem df /store/
> Data, single: total=11.92TiB, used=10.86TiB

Reguardless of the above...

You have a terabyte of unused but allocated data storage. You probably 
need to balance your system to un-jamb that. That's a lot of space that 
is unavailable to the metadata (etc).

ASIDE: Having your metadata set to RAID1 (as opposed to the default of 
DUP) seems a little iffy since your data is still set to DUP. This 
configuration is not going to leave you with a mountable filesystem if 
you lose a disk. I'm not sure if the RAID1 layout is going to want to 
put specific datum in specific places, but it might, which if it does 
might leave you in an irreconcilable position.

Either way, you will probably un-jam your system in the short run by 
doing a balance. A full balance (no filter args at all) would be your 
best bet.

FUTHER ASIDE: raid1 metadata and raid5 data might be good for you given 
22 volumes and 10% empty empty space it would only cost you half of your 
existing empty space. If you don't RAID your data, there is no real 
point to putting your metadata in RAID.

[Yes, I said my basic points about your current layout two different 
ways and times. You are either "just a little over-committed on space" 
or you are "about to lose all your data" and it's impossible to tell 
which is the case from here.]

Backup your data. NOW!


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: btrfs stuck with lot's of files
  2014-12-01 11:46 btrfs stuck with lot's of files Peter Volkov
  2014-12-01 18:47 ` Robert White
@ 2014-12-02  1:33 ` Qu Wenruo
  2014-12-02  2:00   ` Peter Volkov
  2014-12-04 22:58   ` Reiterate: " Peter Volkov
  1 sibling, 2 replies; 10+ messages in thread
From: Qu Wenruo @ 2014-12-02  1:33 UTC (permalink / raw)
  To: Peter Volkov, linux-btrfs@vger.kernel.org


-------- Original Message --------
Subject: btrfs stuck with lot's of files
From: Peter Volkov <pva@gentoo.org>
To: linux-btrfs@vger.kernel.org <linux-btrfs@vger.kernel.org>
Date: 2014年12月01日 19:46
> Hi, guys.
>
> We have a problem with btrfs file system: sometimes it became stuck
> without leaving me any way to interrupt it (shutdown -r now is unable to
> restart server). By stuck I mean some processes that previously were
> able to write on disk are unable to cope with load and load average goes
> up:
>
> top - 13:10:58 up 1 day,  9:26,  5 users,  load average: 157.76, 156.61,
> 149.29
> Tasks: 235 total,   2 running, 233 sleeping,   0 stopped,   0 zombie
> %Cpu(s): 19.8 us, 15.0 sy,  0.0 ni, 60.7 id,  3.9 wa,  0.0 hi,  0.6 si,
> 0.0 st
> KiB Mem:  65922104 total, 65414856 used,   507248 free,     1844 buffers
> KiB Swap:        0 total,        0 used,        0 free. 62570804 cached
> Mem
>
>    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
> COMMAND
>   8644 root      20   0       0      0      0 R  96.5  0.0 127:21.95
> kworker/u16:16
>   5047 dvr       20   0 6884292 122668   4132 S   6.4  0.2 258:59.49
> dvrserver
> 30223 root      20   0   20140   2600   2132 R   6.4  0.0   0:00.01
> top
>      1 root      20   0    4276   1628   1524 S   0.0  0.0   0:40.19
> init
>
>
>
> There are about 300 treads on server, some of which are writing on disk.
> A bit information about this btrfs filesystem: this is 22 disk file
> system with raid1 for metadata and raid0 for data:
>
>   # btrfs filesystem df /store/
> Data, single: total=11.92TiB, used=10.86TiB
> System, RAID1: total=8.00MiB, used=1.27MiB
> System, single: total=4.00MiB, used=0.00B
> Metadata, RAID1: total=46.00GiB, used=33.49GiB
> Metadata, single: total=8.00MiB, used=0.00B
> GlobalReserve, single: total=512.00MiB, used=128.00KiB
>   # btrfs property get /store/
> ro=false
> label=store
>   # btrfs device stats /store/
> (shows all zeros)
>   # btrfs balance status /store/
> No balance found on '/store/'
>   # btrfs filesystem show /store/
> Btrfs v3.17.1
> (btw, is it supposed to have only version here?)
This is a small bug that if there is appending '/' in the path for 
'btrfs fi show', it can't recognize it....
Patch is already sent and maybe included next version.
>
> As for load we write quite small files of size (some of 313K, some of
> 800K), that's why metadata takes that much. So back to the problem.
> iostat 1 exposes following problem:
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            16.96    0.00   17.09   65.95    0.00    0.00
>
> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
> sda               0.00         0.00         0.00          0          0
> sdc               0.00         0.00         0.00          0          0
> sdb               0.00         0.00         0.00          0          0
> sde               0.00         0.00         0.00          0          0
> sdd               0.00         0.00         0.00          0          0
> sdf               0.00         0.00         0.00          0          0
> sdg               0.00         0.00         0.00          0          0
> sdj               0.00         0.00         0.00          0          0
> sdh               0.00         0.00         0.00          0          0
> sdk               0.00         0.00         0.00          0          0
> sdi               1.00         0.00       200.00          0        200
> sdl               0.00         0.00         0.00          0          0
> sdn              48.00         0.00     17260.00          0      17260
> sdm               0.00         0.00         0.00          0          0
> sdp               0.00         0.00         0.00          0          0
> sdo               0.00         0.00         0.00          0          0
> sdq               0.00         0.00         0.00          0          0
> sdr               0.00         0.00         0.00          0          0
> sds               0.00         0.00         0.00          0          0
> sdt               0.00         0.00         0.00          0          0
> sdv               0.00         0.00         0.00          0          0
> sdw               0.00         0.00         0.00          0          0
> sdu               0.00         0.00         0.00          0          0
>
>
> write goes to one disk. I've tried to debug what's going in kworker and
> did
>
> $ echo workqueue:workqueue_queue_work
>> /sys/kernel/debug/tracing/set_event
> $ cat /sys/kernel/debug/tracing/trace_pipe > trace_pipe.out2
>
> trace_pipe2.out.xz in attachment. Could you comment, what goes wrong
> here?
It seems that attachment is blocked by mail-list so I didn't see the 
attachment.
>
> Server has 64Gb of RAM. Is it possible that it is unable to keep all
> metadata in memory, can we encrease this memory limit, if exists?
Not possible, it will never happen (if nothing goes wrong....).
Kernel has the outstanding page cache mechanism, when memory comes short,
some cached metadata/data can be flushed back(if dirty) to disk to free 
space.
And re-read from disk if needed later.

So kernel don't need to load all the metadata/data into memory, and 
that's mostly impossible for large fs.

And one missing important informantion: kernel version.

What I can see is only the btrfs-progs version, which doesn't really 
help for such kernel stuck problem.

Thanks,
Qu
>
>
> Thanks in advance for any pointers,
> --
> Peter.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: btrfs stuck with lot's of files
  2014-12-01 18:47 ` Robert White
@ 2014-12-02  1:50   ` Peter Volkov
  2014-12-02 12:48     ` Duncan
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Volkov @ 2014-12-02  1:50 UTC (permalink / raw)
  To: Robert White; +Cc: linux-btrfs@vger.kernel.org

В Пн, 01/12/2014 в 10:47 -0800, Robert White пишет:
> On 12/01/2014 03:46 AM, Peter Volkov wrote:
>  > (stuff about getting hung up trying to write to one drive)
> 
> That drive (/dev/sdn) is probably starting to fail.
> (about failed drive)

Thank you Robert for the answer. It is not likely that drive fails here.
Similar condition (write to a single drive) happens with other drives
i.e. such write pattern may happen with any drive.

After looking at what happens longer I see the following. During stuck
single processor core is busy 100% of CPU in kernel space (some kworker
is taking 100% CPU). Ftrace reveals that
btrfs_async_reclaim_metadata_space is most frequently called function.
So it looks like btrfs is doing some operation with metadata and until
it finishes that everything is stuck (practically no writes happens on
disk). So I'm looking for suggestion on how to cope with this process.

> >   # btrfs filesystem df /store/
> > Data, single: total=11.92TiB, used=10.86TiB
> 
> Reguardless of the above...
> 
> You have a terabyte of unused but allocated data storage. You probably 
> need to balance your system to un-jamb that. That's a lot of space that 
> is unavailable to the metadata (etc).

Well, I'm afraid that balance will put fs into even longer "stuck".

> ASIDE: Having your metadata set to RAID1 (as opposed to the default of 
> DUP) seems a little iffy since your data is still set to DUP.

That's true. But why data is duplicated? During btrfs volume creation
I've set explicitly -d data single.

> FUTHER ASIDE: raid1 metadata and raid5 data might be good for you given 
> 22 volumes and 10% empty empty space it would only cost you half of your 
> existing empty space. If you don't RAID your data, there is no real 
> point to putting your metadata in RAID.

Is raid5 ready for use? As I read post[1] mentioned on[2] it is still
some way to make it stable.

[1]
http://marc.merlins.org/perso/btrfs/post_2014-03-23_Btrfs-Raid5-Status.html
[2] https://btrfs.wiki.kernel.org/index.php/RAID56

--
Peter.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: btrfs stuck with lot's of files
  2014-12-02  1:33 ` Qu Wenruo
@ 2014-12-02  2:00   ` Peter Volkov
  2014-12-04 22:58   ` Reiterate: " Peter Volkov
  1 sibling, 0 replies; 10+ messages in thread
From: Peter Volkov @ 2014-12-02  2:00 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs@vger.kernel.org

В Вт, 02/12/2014 в 09:33 +0800, Qu Wenruo пишет:
> -------- Original Message --------
> Subject: btrfs stuck with lot's of files
> From: Peter Volkov <pva@gentoo.org>
> To: linux-btrfs@vger.kernel.org <linux-btrfs@vger.kernel.org>
> Date: 2014年12月01日 19:46
> > Hi, guys.
> >
> > We have a problem with btrfs file system: sometimes it became stuck
> > without leaving me any way to interrupt it (shutdown -r now is unable to
> > restart server). By stuck I mean some processes that previously were
> > able to write on disk are unable to cope with load and load average goes
> > up:
> >
> > top - 13:10:58 up 1 day,  9:26,  5 users,  load average: 157.76, 156.61,
> > 149.29
> > Tasks: 235 total,   2 running, 233 sleeping,   0 stopped,   0 zombie
> > %Cpu(s): 19.8 us, 15.0 sy,  0.0 ni, 60.7 id,  3.9 wa,  0.0 hi,  0.6 si,
> > 0.0 st
> > KiB Mem:  65922104 total, 65414856 used,   507248 free,     1844 buffers
> > KiB Swap:        0 total,        0 used,        0 free. 62570804 cached
> > Mem
> >
> >    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
> > COMMAND
> >   8644 root      20   0       0      0      0 R  96.5  0.0 127:21.95
> > kworker/u16:16
> >   5047 dvr       20   0 6884292 122668   4132 S   6.4  0.2 258:59.49
> > dvrserver
> > 30223 root      20   0   20140   2600   2132 R   6.4  0.0   0:00.01
> > top
> >      1 root      20   0    4276   1628   1524 S   0.0  0.0   0:40.19
> > init
> >
> >
> >
> > There are about 300 treads on server, some of which are writing on disk.
> > A bit information about this btrfs filesystem: this is 22 disk file
> > system with raid1 for metadata and raid0 for data:
> >
> >   # btrfs filesystem df /store/
> > Data, single: total=11.92TiB, used=10.86TiB
> > System, RAID1: total=8.00MiB, used=1.27MiB
> > System, single: total=4.00MiB, used=0.00B
> > Metadata, RAID1: total=46.00GiB, used=33.49GiB
> > Metadata, single: total=8.00MiB, used=0.00B
> > GlobalReserve, single: total=512.00MiB, used=128.00KiB
> >   # btrfs property get /store/
> > ro=false
> > label=store
> >   # btrfs device stats /store/
> > (shows all zeros)
> >   # btrfs balance status /store/
> > No balance found on '/store/'
> >   # btrfs filesystem show /store/
> > Btrfs v3.17.1
> > (btw, is it supposed to have only version here?)
> This is a small bug that if there is appending '/' in the path for 
> 'btrfs fi show', it can't recognize it....
> Patch is already sent and maybe included next version.
> >
> > As for load we write quite small files of size (some of 313K, some of
> > 800K), that's why metadata takes that much. So back to the problem.
> > iostat 1 exposes following problem:
> >
> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> >            16.96    0.00   17.09   65.95    0.00    0.00
> >
> > Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
> > sda               0.00         0.00         0.00          0          0
> > sdc               0.00         0.00         0.00          0          0
> > sdb               0.00         0.00         0.00          0          0
> > sde               0.00         0.00         0.00          0          0
> > sdd               0.00         0.00         0.00          0          0
> > sdf               0.00         0.00         0.00          0          0
> > sdg               0.00         0.00         0.00          0          0
> > sdj               0.00         0.00         0.00          0          0
> > sdh               0.00         0.00         0.00          0          0
> > sdk               0.00         0.00         0.00          0          0
> > sdi               1.00         0.00       200.00          0        200
> > sdl               0.00         0.00         0.00          0          0
> > sdn              48.00         0.00     17260.00          0      17260
> > sdm               0.00         0.00         0.00          0          0
> > sdp               0.00         0.00         0.00          0          0
> > sdo               0.00         0.00         0.00          0          0
> > sdq               0.00         0.00         0.00          0          0
> > sdr               0.00         0.00         0.00          0          0
> > sds               0.00         0.00         0.00          0          0
> > sdt               0.00         0.00         0.00          0          0
> > sdv               0.00         0.00         0.00          0          0
> > sdw               0.00         0.00         0.00          0          0
> > sdu               0.00         0.00         0.00          0          0
> >
> >
> > write goes to one disk. I've tried to debug what's going in kworker and
> > did
> >
> > $ echo workqueue:workqueue_queue_work
> >> /sys/kernel/debug/tracing/set_event
> > $ cat /sys/kernel/debug/tracing/trace_pipe > trace_pipe.out2
> >
> > trace_pipe2.out.xz in attachment. Could you comment, what goes wrong
> > here?
> It seems that attachment is blocked by mail-list so I didn't see the 
> attachment.

I've put it here:
https://drive.google.com/file/d/0BygFL6N3ZVUAMWxCQ0tDREE1Uzg/view?usp=sharing

And some additional information I've put in another letter that just
sent to mailing list.

> > Server has 64Gb of RAM. Is it possible that it is unable to keep all
> > metadata in memory, can we encrease this memory limit, if exists?
> Not possible, it will never happen (if nothing goes wrong....).
> Kernel has the outstanding page cache mechanism, when memory comes short,
> some cached metadata/data can be flushed back(if dirty) to disk to free 
> space.
> And re-read from disk if needed later.
> 
> So kernel don't need to load all the metadata/data into memory, and 
> that's mostly impossible for large fs.

Thanks for this explanation! Still I'm looking for suggestion on how to
cope with btrfs_async_reclaim_metadata_space that is mentioned most
frequently in kworker trace.

> And one missing important informantion: kernel version.

This is kernel 3.16.7-gentoo. 

--
Peter.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: btrfs stuck with lot's of files
  2014-12-02  1:50   ` Peter Volkov
@ 2014-12-02 12:48     ` Duncan
  2014-12-02 18:56       ` Ian Armstrong
  0 siblings, 1 reply; 10+ messages in thread
From: Duncan @ 2014-12-02 12:48 UTC (permalink / raw)
  To: linux-btrfs

Peter Volkov posted on Tue, 02 Dec 2014 04:50:29 +0300 as excerpted:

> В Пн, 01/12/2014 в 10:47 -0800, Robert White пишет:
>> On 12/01/2014 03:46 AM, Peter Volkov wrote:
>>  > (stuff about getting hung up trying to write to one drive)
>> 
>> That drive (/dev/sdn) is probably starting to fail.
>> (about failed drive)
> 
> Thank you Robert for the answer. It is not likely that drive fails here.
> Similar condition (write to a single drive) happens with other drives
> i.e. such write pattern may happen with any drive.
>
> After looking at what happens longer I see the following. During stuck
> single processor core is busy 100% of CPU in kernel space (some kworker
> is taking 100% CPU).

FWIW, agreed that it's unlikely to be the drive, especially if you're not 
seeing bus resets or drive errors in dmesg and smart says the drive is 
fine, as I expect it does/will.  It may be a btrfs bug or scaling issue, 
of which btrfs still has some, or it could simply be the single mode vs 
raid0 mode issue I explain below.

>> >   # btrfs filesystem df /store/
>> > Data, single: total=11.92TiB, used=10.86TiB
>> 
>> Reguardless of the above...
>> 
>> You have a terabyte of unused but allocated data storage. You probably
>> need to balance your system to un-jamb that. That's a lot of space that
>> is unavailable to the metadata (etc).
> 
> Well, I'm afraid that balance will put fs into even longer "stuck".
> 
>> ASIDE: Having your metadata set to RAID1 (as opposed to the default of
>> DUP) seems a little iffy since your data is still set to DUP.
> 
> That's true. But why data is duplicated? During btrfs volume creation
> I've set explicitly -d data single.

I believe Robert mis-wrote (thinko).  The btrfs filesystem df clearly 
shows that your data is in single mode, the data default mode, not dup 
mode, which is normally only available to metadata (not data) on a single-
device filesystem, where it is the metadata default.

However, in the original post you /did/ say raid1 for metadata, raid0 for 
data, and the above btrfs filesystem df again clearly says single, not 
raid0.

Which is very likely to be your problem.  In single mode, btrfs will 
create chunks one at a time, picking the device with the most free space 
to allocate it on.  The normal data chunk size is 1 GiB.  Because of the 
most-free-space allocation rule, with N devices (22 in your case) of the 
same size, after N (22) data chunks are allocated you'll tend to have one 
such chunk on each device.

Each of these 1 GiB chunks (along with space freed up by normal delete 
activity in other allocated data chunks) will be filled before another is 
allocated.

Which will mean you're writing a GiB worth of data to one device before 
you switch to the next one.  With your mostly sub-MiB file write pattern, 
that's probably 1500-2000 files written to a chunk on that single device, 
before another chunk is allocated on the next device.

Thus all your activity on that single device!

In raid0 mode, by contrast, the same 1 GiB chunks will be allocated on 
each device, but a stripe of chunks will be allocated across all devices 
(22 in your case) at the same time, and data being written is broken up 
into much smaller per-device strips.  I'm not sure what the actual per-
device is in raid0 mode, but it's *WELL* under a GiB and I believe in the 
KiBs not MiB range.  It might be 128 KiB, the compression block size when 
the compress mount option is used.

Obviously were you using raid0 data, you'd see the load spread out at 
least somewhat better.  But the df says it's single, not raid0.

To get raid0 mode you can use a balance with filters (see the wiki or 
recent btrfs-balance manpage), or blow away the existing filesystem and 
create a new one, setting --data raid0 when you mkfs.btrfs, and restore 
from backups (which you're already prepared to do if you value your data 
in any case[1]).

That missing btrfs filesystem show, due to the terminating / in /store/ 
(simply /store should work) is somewhat frustrating here, as it'd show 
per-device sizes and utilization.  Assuming near same-sized devices, with 
11 TiB of data being far greater than the 1 GiB data chunk size times 22 
devices I'd guess you're pretty evened out, utilization-wise, but the 
output from both show and df is necessary to get the full story.

>> FUTHER ASIDE: raid1 metadata and raid5 data might be good for you given
>> 22 volumes and 10% empty empty space it would only cost you half of
>> your existing empty space. If you don't RAID your data, there is no
>> real point to putting your metadata in RAID.
> 
> Is raid5 ready for use? As I read post[1] mentioned on[2] it is still
> some way to make it stable.

You are absolutely correct.  I'd strongly recommend staying AWAY from 
btrfs raid5/6 modes at this time.  While Robert is becoming an active 
regular and has the technical background to point out some things others 
miss, he's still reasonably new to this list and may not have been aware 
of the incomplete status of raid5/6 modes at this time.

Effectively btrfs raid56 (called raid56, no slash, in btrfs lingo, 
because it's the same code that handles both) at this time can be 
considered a slower raid0, with parity strips that are written but not 
able to be used for full recovery at this point, that will "magically" be 
upgraded to raid56 when the btrfs raid56 recovery code is complete.  
Operationally it works fine, and the parity strips are indeed written.  
It's the scrub and recovery code that's not yet complete.  Which means 
consider it a raid0 in terms of recovery, a total loss if a single device 
is lost, and have your backups and/or willingness to simply say bye to 
the data if a device is lost prepared accordingly, and you won't be 
caught unprepared.

Which since you're using single mode now but thought you were using raid0 
mode already, isn't far from your present situation in any case.  So you 
might actually want to think about raid56 modes if you do a mkfs.btrfs 
for some reason, since you're already going to be prepared for a raid0 
level meltdown, loss of all data that's not backed up, and while you'd 
not get a lot of benefit from it right now, you /would/ get the automatic 
upgrade to actually /recoverable/ raid56 when that code is deployed.

The other alternative if your devices and thus filesystem size are big 
enough (> 1 TiB per device, > 22 TiB total), would be raid10 mode for the 
data.  Btrfs raid1 and raid10 is exactly two-way, so you'd have 11-way-
striping instead of the 22-way you'd have with raid0 or the effective 
single-speed you have now due to single-mode data, but would also have 
the two-way-mirroring.  In addition to the normal benefits of two-way-
mirroring, that lets you take advantage of btrfs checksumming and data 
integrity features as well, reading from the good copy (and rewriting the 
bad one) if the first copy found doesn't match checksum.  If I had the 
capacity, raid10 would be my preferred mode here, but it /does/ mean 
halving effective capacity of the filesystem.


Hope that helps and best wishes from a fellow gentooer! =:^)

---
[1] Backups:  While btrfs isn't entirely experimental any more, it's 
still not entirely stable either, and data eating bugs can and do 
happen.  As such, the sysadmin's rule of thumb that says if you don't 
have a backup, you don't care about your data, and an untested backup is 
not a backup, applies even more than it does when your data is on a fully 
mature filesystem.  

Of course the same applies to raid0, so the general btrfs status isn't a 
big change from that in any case and I expect you either already have 
good backups or are prepared to simply lose the data if a device goes bad 
already.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: btrfs stuck with lot's of files
  2014-12-02 12:48     ` Duncan
@ 2014-12-02 18:56       ` Ian Armstrong
  2014-12-02 22:42         ` Duncan
  0 siblings, 1 reply; 10+ messages in thread
From: Ian Armstrong @ 2014-12-02 18:56 UTC (permalink / raw)
  To: linux-btrfs

On Tue, 2 Dec 2014 12:48:21 +0000 (UTC)
Duncan <1i5t5.duncan@cox.net> wrote:

> Peter Volkov posted on Tue, 02 Dec 2014 04:50:29 +0300 as excerpted:
> 
> > В Пн, 01/12/2014 в 10:47 -0800, Robert White пишет:
> >> On 12/01/2014 03:46 AM, Peter Volkov wrote:
> >>  > (stuff about getting hung up trying to write to one drive)
> >> 
> >> That drive (/dev/sdn) is probably starting to fail.
> >> (about failed drive)
> > 
> > Thank you Robert for the answer. It is not likely that drive fails
> > here. Similar condition (write to a single drive) happens with
> > other drives i.e. such write pattern may happen with any drive.
> >
> > After looking at what happens longer I see the following. During
> > stuck single processor core is busy 100% of CPU in kernel space
> > (some kworker is taking 100% CPU).
> 
> FWIW, agreed that it's unlikely to be the drive, especially if you're
> not seeing bus resets or drive errors in dmesg and smart says the
> drive is fine, as I expect it does/will.  It may be a btrfs bug or
> scaling issue, of which btrfs still has some, or it could simply be
> the single mode vs raid0 mode issue I explain below.

I encountered a similar problem here a few days ago on a btrfs raid1
partition while using rsync to clone a (~30GB) directory.

Everything started fine, but I came back an hour later to find rsync had
apparently stalled at about 20% with cpu usage at 100% on a single
kworker thread. I was able to kill rsync eventually, and after a while
(don't know how long, but >10 minutes) cpu usage returned to normal.
Restarting rsync resulted in kworker at 100% cpu in less than a minute.
Once stalled there was little drive access happening. Another raid1
partition (mdadm/ext4) on the same drive pair was having no problems.
Nothing showed in the system logs.

In this instance I'd forgotten to delete a temporary 500GB file before
starting rsync, so although recently balanced (musage=80/dusage=80) it
was running at near capacity.

After a reboot, deleting the 500GB file & running balance, everything
returned to normal. Ran rsync again & it completed fine.

Running slackware current, with Kernel 3.16.4

# btrfs filesystem df /mnt/general
Data, RAID1: total=1.38TiB, used=1.38TiB
System, RAID1: total=32.00MiB, used=256.00KiB
Metadata, RAID1: total=6.00GiB, used=4.67GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

# btrfs filesystem show /mnt/general
Label: none  uuid: 592376ea-769f-4abb-915e-aa5e49162d90
        Total devices 2 FS bytes used 1.38TiB
        devid    1 size 1.79TiB used 1.39TiB path /dev/sda4
        devid    2 size 1.79TiB used 1.39TiB path /dev/sdd4

Btrfs v3.17.2

-- 
Ian

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: btrfs stuck with lot's of files
  2014-12-02 18:56       ` Ian Armstrong
@ 2014-12-02 22:42         ` Duncan
  0 siblings, 0 replies; 10+ messages in thread
From: Duncan @ 2014-12-02 22:42 UTC (permalink / raw)
  To: linux-btrfs

Ian Armstrong posted on Tue, 02 Dec 2014 18:56:13 +0000 as excerpted:

> On Tue, 2 Dec 2014 12:48:21 +0000 (UTC)
> Duncan <1i5t5.duncan@cox.net> wrote:
> 
>> FWIW, agreed that it's unlikely to be the drive, especially if you're
>> not seeing bus resets or drive errors in dmesg and smart says the drive
>> is fine, as I expect it does/will.  It may be a btrfs bug or scaling
>> issue, of which btrfs still has some, or it could simply be the single
>> mode vs raid0 mode issue I explain below.
> 
> I encountered a similar problem here a few days ago on a btrfs raid1
> partition while using rsync to clone a (~30GB) directory.
> 
> Everything started fine, but I came back an hour later to find rsync had
> apparently stalled at about 20% with cpu usage at 100% on a single
> kworker thread. I was able to kill rsync eventually, and after a while
> (don't know how long, but >10 minutes) cpu usage returned to normal.
> Restarting rsync resulted in kworker at 100% cpu in less than a minute.
> Once stalled there was little drive access happening. Another raid1
> partition (mdadm/ext4) on the same drive pair was having no problems.
> Nothing showed in the system logs.
> 
> In this instance I'd forgotten to delete a temporary 500GB file before
> starting rsync, so although recently balanced (musage=80/dusage=80) it
> was running at near capacity.
> 
> After a reboot, deleting the 500GB file & running balance, everything
> returned to normal. Ran rsync again & it completed fine.
> 
> Running slackware current, with Kernel 3.16.4

FWIW that was my point -- there are still such bugs out there, often 
corner-case so they don't affect most folks most of the time, but out 
there.

I had a similar stall recently, a kworker stuck at 100% that went away 
after I killed whatever app had triggered the problem (pan, the news 
program I'm writing this with, as it happens).  In my case I chalked it 
up to a known corner-case bug in my slightly old 3.17.0 kernel (my use-
case doesn't do read-only snapshots so I'm not affected by that known bug 
that effectively blacklists 3.17.0 for some users; this would have been a 
different one).  I don't /know/ it was that bug, but it most likely was, 
as it's a known but rare corner-case that AFAIK is already fixed in the 
late 3.18-rcs.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Reiterate: btrfs stuck with lot's of files
  2014-12-02  1:33 ` Qu Wenruo
  2014-12-02  2:00   ` Peter Volkov
@ 2014-12-04 22:58   ` Peter Volkov
  2014-12-04 23:55     ` Chris Murphy
  1 sibling, 1 reply; 10+ messages in thread
From: Peter Volkov @ 2014-12-04 22:58 UTC (permalink / raw)
  To: linux-btrfs@vger.kernel.org

Hi, guys again. Looking at this issue, I suspect this is bug in btrfs.
We'll have to clean up this installation soon, so if there is any
request to do some debugging, please, ask. I'll try to reiterate what
was said in this thread.

Short story: btrfs filesystem made of 22 1Tb disks with lot's of files
(~30240000). Write load is 25 Mbyte/second. After some time file system
became unable to cope with this load. Also at this time `sync` takes
ages to finish, shutdown -r hangs (I guess related to sync).

Also I see there is one some kernel kworker that is main suspect for
this behavior: all the time it takes 100% of CPU core, jumping from core
to core. At the same time according to iostat write/read speed is close
to zero and everything is stuck.

Siting some details from previous messages:

> > top - 13:10:58 up 1 day,  9:26,  5 users,  load average: 157.76, 156.61, 149.29
> > Tasks: 235 total,   2 running, 233 sleeping,   0 stopped,   0 zombie
> > %Cpu(s): 19.8 us, 15.0 sy,  0.0 ni, 60.7 id,  3.9 wa,  0.0 hi,  0.6 si, 0.0 st
> > KiB Mem:  65922104 total, 65414856 used,   507248 free,     1844 buffers
> > KiB Swap:        0 total,        0 used,        0 free. 62570804 cached Mem
> >
> >    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
> > COMMAND
> >   8644 root      20   0       0      0      0 R  96.5  0.0 127:21.95 kworker/u16:16
> >   5047 dvr       20   0 6884292 122668   4132 S   6.4  0.2 258:59.49 dvrserver
> > 30223 root      20   0   20140   2600   2132 R   6.4  0.0   0:00.01 top
> >      1 root      20   0    4276   1628   1524 S   0.0  0.0   0:40.19 init
> >
> > There are about 300 treads on server, some of which are writing on disk.
> > A bit information about this btrfs filesystem: this is 22 disk file
> > system with raid1 for metadata and raid0 for data:
> >
> >   # btrfs filesystem df /store/
> > Data, single: total=11.92TiB, used=10.86TiB
> > System, RAID1: total=8.00MiB, used=1.27MiB
> > System, single: total=4.00MiB, used=0.00B
> > Metadata, RAID1: total=46.00GiB, used=33.49GiB
> > Metadata, single: total=8.00MiB, used=0.00B
> > GlobalReserve, single: total=512.00MiB, used=128.00KiB
> >   # btrfs property get /store/
> > ro=false
> > label=store
> >   # btrfs device stats /store/
> > (shows all zeros)
> >   # btrfs balance status /store/
> > No balance found on '/store/'

 # btrfs filesystem show
Label: 'store'  uuid: 296404d1-bd3f-417d-8501-02f8d7906bcf
	Total devices 22 FS bytes used 6.50TiB
	devid    1 size 931.51GiB used 558.02GiB path /dev/sdb
	devid    2 size 931.51GiB used 559.00GiB path /dev/sdc
	devid    3 size 931.51GiB used 559.00GiB path /dev/sdd
	devid    4 size 931.51GiB used 559.00GiB path /dev/sde
	devid    5 size 931.51GiB used 559.00GiB path /dev/sdf
	devid    6 size 931.51GiB used 559.00GiB path /dev/sdg
	devid    7 size 931.51GiB used 559.00GiB path /dev/sdh
	devid    8 size 931.51GiB used 559.00GiB path /dev/sdi
	devid    9 size 931.51GiB used 559.00GiB path /dev/sdj
	devid   10 size 931.51GiB used 559.00GiB path /dev/sdk
	devid   11 size 931.51GiB used 559.00GiB path /dev/sdl
	devid   12 size 931.51GiB used 559.00GiB path /dev/sdm
	devid   13 size 931.51GiB used 559.00GiB path /dev/sdn
	devid   14 size 931.51GiB used 559.00GiB path /dev/sdo
	devid   15 size 931.51GiB used 559.00GiB path /dev/sdp
	devid   16 size 931.51GiB used 559.00GiB path /dev/sdq
	devid   17 size 931.51GiB used 559.00GiB path /dev/sdr
	devid   18 size 931.51GiB used 559.00GiB path /dev/sds
	devid   19 size 931.51GiB used 559.00GiB path /dev/sdt
	devid   20 size 931.51GiB used 559.00GiB path /dev/sdu
	devid   21 size 931.51GiB used 559.01GiB path /dev/sdv
	devid   22 size 931.51GiB used 560.01GiB path /dev/sdw

Btrfs v3.17.1

> > iostat 1 exposes following problem:
> >
> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> >            16.96    0.00   17.09   65.95    0.00    0.00
> >
> > Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
> > sda               0.00         0.00         0.00          0          0
> > sdc               0.00         0.00         0.00          0          0
> > sdb               0.00         0.00         0.00          0          0
> > sde               0.00         0.00         0.00          0          0
> > sdd               0.00         0.00         0.00          0          0
> > sdf               0.00         0.00         0.00          0          0
> > sdg               0.00         0.00         0.00          0          0
> > sdj               0.00         0.00         0.00          0          0
> > sdh               0.00         0.00         0.00          0          0
> > sdk               0.00         0.00         0.00          0          0
> > sdi               1.00         0.00       200.00          0        200
> > sdl               0.00         0.00         0.00          0          0
> > sdn              48.00         0.00     17260.00          0      17260
> > sdm               0.00         0.00         0.00          0          0
> > sdp               0.00         0.00         0.00          0          0
> > sdo               0.00         0.00         0.00          0          0
> > sdq               0.00         0.00         0.00          0          0
> > sdr               0.00         0.00         0.00          0          0
> > sds               0.00         0.00         0.00          0          0
> > sdt               0.00         0.00         0.00          0          0
> > sdv               0.00         0.00         0.00          0          0
> > sdw               0.00         0.00         0.00          0          0
> > sdu               0.00         0.00         0.00          0          0

At that time I saw such load profile. Write load changed from disk to
disk with time, so I do not suspect broken disk. Currently write profile
is different:
https://drive.google.com/file/d/0BygFL6N3ZVUAVmxaZ1Q5VTZpSGc/view?usp=sharing
Sometimes like above, sometimes all zero, most time load is very low.

> > write goes to one disk. I've tried to debug what's going in kworker and
> > did
> >
> > $ echo workqueue:workqueue_queue_work
> >> /sys/kernel/debug/tracing/set_event
> > $ cat /sys/kernel/debug/tracing/trace_pipe > trace_pipe.out2

I've put result here:
https://drive.google.com/file/d/0BygFL6N3ZVUAMWxCQ0tDREE1Uzg/view?usp=sharing

> > Server has 64Gb of RAM. 
kernel is 3.16.7-gentoo

--
Peter.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Reiterate: btrfs stuck with lot's of files
  2014-12-04 22:58   ` Reiterate: " Peter Volkov
@ 2014-12-04 23:55     ` Chris Murphy
  0 siblings, 0 replies; 10+ messages in thread
From: Chris Murphy @ 2014-12-04 23:55 UTC (permalink / raw)
  To: linux-btrfs@vger.kernel.org

On Thu, Dec 4, 2014 at 3:58 PM, Peter Volkov <pva@gentoo.org> wrote:
> Hi, guys again. Looking at this issue, I suspect this is bug in btrfs.
> We'll have to clean up this installation soon, so if there is any
> request to do some debugging, please, ask. I'll try to reiterate what
> was said in this thread.
>
> Short story: btrfs filesystem made of 22 1Tb disks with lot's of files
> (~30240000). Write load is 25 Mbyte/second. After some time file system
> became unable to cope with this load. Also at this time `sync` takes
> ages to finish, shutdown -r hangs (I guess related to sync).
>
> Also I see there is one some kernel kworker that is main suspect for
> this behavior: all the time it takes 100% of CPU core, jumping from core
> to core. At the same time according to iostat write/read speed is close
> to zero and everything is stuck.
>
> Siting some details from previous messages:
>
>> > top - 13:10:58 up 1 day,  9:26,  5 users,  load average: 157.76, 156.61, 149.29
>> > Tasks: 235 total,   2 running, 233 sleeping,   0 stopped,   0 zombie
>> > %Cpu(s): 19.8 us, 15.0 sy,  0.0 ni, 60.7 id,  3.9 wa,  0.0 hi,  0.6 si, 0.0 st
>> > KiB Mem:  65922104 total, 65414856 used,   507248 free,     1844 buffers
>> > KiB Swap:        0 total,        0 used,        0 free. 62570804 cached Mem
>> >
>> >    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
>> > COMMAND
>> >   8644 root      20   0       0      0      0 R  96.5  0.0 127:21.95 kworker/u16:16
>> >   5047 dvr       20   0 6884292 122668   4132 S   6.4  0.2 258:59.49 dvrserver
>> > 30223 root      20   0   20140   2600   2132 R   6.4  0.0   0:00.01 top
>> >      1 root      20   0    4276   1628   1524 S   0.0  0.0   0:40.19 init
>> >
>> > There are about 300 treads on server, some of which are writing on disk.
>> > A bit information about this btrfs filesystem: this is 22 disk file
>> > system with raid1 for metadata and raid0 for data:
>> >
>> >   # btrfs filesystem df /store/
>> > Data, single: total=11.92TiB, used=10.86TiB
>> > System, RAID1: total=8.00MiB, used=1.27MiB
>> > System, single: total=4.00MiB, used=0.00B
>> > Metadata, RAID1: total=46.00GiB, used=33.49GiB
>> > Metadata, single: total=8.00MiB, used=0.00B
>> > GlobalReserve, single: total=512.00MiB, used=128.00KiB
>> >   # btrfs property get /store/
>> > ro=false
>> > label=store
>> >   # btrfs device stats /store/
>> > (shows all zeros)
>> >   # btrfs balance status /store/
>> > No balance found on '/store/'
>
>  # btrfs filesystem show
> Label: 'store'  uuid: 296404d1-bd3f-417d-8501-02f8d7906bcf
>         Total devices 22 FS bytes used 6.50TiB
>         devid    1 size 931.51GiB used 558.02GiB path /dev/sdb
>         devid    2 size 931.51GiB used 559.00GiB path /dev/sdc
>         devid    3 size 931.51GiB used 559.00GiB path /dev/sdd
>         devid    4 size 931.51GiB used 559.00GiB path /dev/sde
>         devid    5 size 931.51GiB used 559.00GiB path /dev/sdf
>         devid    6 size 931.51GiB used 559.00GiB path /dev/sdg
>         devid    7 size 931.51GiB used 559.00GiB path /dev/sdh
>         devid    8 size 931.51GiB used 559.00GiB path /dev/sdi
>         devid    9 size 931.51GiB used 559.00GiB path /dev/sdj
>         devid   10 size 931.51GiB used 559.00GiB path /dev/sdk
>         devid   11 size 931.51GiB used 559.00GiB path /dev/sdl
>         devid   12 size 931.51GiB used 559.00GiB path /dev/sdm
>         devid   13 size 931.51GiB used 559.00GiB path /dev/sdn
>         devid   14 size 931.51GiB used 559.00GiB path /dev/sdo
>         devid   15 size 931.51GiB used 559.00GiB path /dev/sdp
>         devid   16 size 931.51GiB used 559.00GiB path /dev/sdq
>         devid   17 size 931.51GiB used 559.00GiB path /dev/sdr
>         devid   18 size 931.51GiB used 559.00GiB path /dev/sds
>         devid   19 size 931.51GiB used 559.00GiB path /dev/sdt
>         devid   20 size 931.51GiB used 559.00GiB path /dev/sdu
>         devid   21 size 931.51GiB used 559.01GiB path /dev/sdv
>         devid   22 size 931.51GiB used 560.01GiB path /dev/sdw
>
> Btrfs v3.17.1
>
>> > iostat 1 exposes following problem:
>> >
>> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>> >            16.96    0.00   17.09   65.95    0.00    0.00
>> >
>> > Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
>> > sda               0.00         0.00         0.00          0          0
>> > sdc               0.00         0.00         0.00          0          0
>> > sdb               0.00         0.00         0.00          0          0
>> > sde               0.00         0.00         0.00          0          0
>> > sdd               0.00         0.00         0.00          0          0
>> > sdf               0.00         0.00         0.00          0          0
>> > sdg               0.00         0.00         0.00          0          0
>> > sdj               0.00         0.00         0.00          0          0
>> > sdh               0.00         0.00         0.00          0          0
>> > sdk               0.00         0.00         0.00          0          0
>> > sdi               1.00         0.00       200.00          0        200
>> > sdl               0.00         0.00         0.00          0          0
>> > sdn              48.00         0.00     17260.00          0      17260
>> > sdm               0.00         0.00         0.00          0          0
>> > sdp               0.00         0.00         0.00          0          0
>> > sdo               0.00         0.00         0.00          0          0
>> > sdq               0.00         0.00         0.00          0          0
>> > sdr               0.00         0.00         0.00          0          0
>> > sds               0.00         0.00         0.00          0          0
>> > sdt               0.00         0.00         0.00          0          0
>> > sdv               0.00         0.00         0.00          0          0
>> > sdw               0.00         0.00         0.00          0          0
>> > sdu               0.00         0.00         0.00          0          0
>
> At that time I saw such load profile. Write load changed from disk to
> disk with time, so I do not suspect broken disk. Currently write profile
> is different:
> https://drive.google.com/file/d/0BygFL6N3ZVUAVmxaZ1Q5VTZpSGc/view?usp=sharing
> Sometimes like above, sometimes all zero, most time load is very low.
>
>> > write goes to one disk. I've tried to debug what's going in kworker and
>> > did
>> >
>> > $ echo workqueue:workqueue_queue_work
>> >> /sys/kernel/debug/tracing/set_event
>> > $ cat /sys/kernel/debug/tracing/trace_pipe > trace_pipe.out2
>
> I've put result here:
> https://drive.google.com/file/d/0BygFL6N3ZVUAMWxCQ0tDREE1Uzg/view?usp=sharing
>

Is Btrfs single profile expected to parallel write to block devices?

Initially, any write is a new write rather than an overwrite, because
of COW. All writes go into a single chunk on a single device until the
chunk is full, then onto the next device with a new chunk until that
chunk is full. And so on. This behavior only changes once all space is
allocated as a data or metadata chunk on all block devices, which
actually could take some time. If there are many chunks on many
devices that are 90% full, then I don't know how Btrfs decides which
chunks it writes to. But I still don't think it's highly parallelized
like it is on XFS.

Are reads are parallelized in this case? Unless there's parallelized
reads and writes, the single profile isn't scalable. So before
something is a bug, I'd wonder if the design expects this layout to be
used for the intended use case rather than raid0. The chances of a
single drive dying with 22 drives in the volume is astronomically
high, probably 100% over as short as 6 months, and then what?

I'm unaware of either existing or planned functionality to allow such
a volume to remain functional: to do that, Btrfs needs to delete all
affected files so they're no longer referenced. I've actually thought
of this layout for use with GlusterFS and Ceph, in such a way that a
drive can die and Btrfs informs the distributed filesystem above it
what files are no longer available by this particular storage brick;
next the brick's filesystem can be "cleaned up" by deleting all
missing files, then deleting the missing device, thereby stabilizing
the existing fs. The distributed file system starts replicating
missing files according to its policies.

But right now, if any device dies in your example layout, the
filesystem is functionally lost. Yes you can get remaining data out of
it, but it's in a sense 1/22nd's broken and not fixable as far as I
know. But I haven't tried fixing this manually, e.g. do a scrub to get
a missing files listing and start delete those files, add a new
device, and delete the missing device. If the missing files aren't
explicitly deleted, I think the fs still has references for them and
will just return read/corruption errors rather than denying the file
even exists.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-12-04 23:55 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-12-01 11:46 btrfs stuck with lot's of files Peter Volkov
2014-12-01 18:47 ` Robert White
2014-12-02  1:50   ` Peter Volkov
2014-12-02 12:48     ` Duncan
2014-12-02 18:56       ` Ian Armstrong
2014-12-02 22:42         ` Duncan
2014-12-02  1:33 ` Qu Wenruo
2014-12-02  2:00   ` Peter Volkov
2014-12-04 22:58   ` Reiterate: " Peter Volkov
2014-12-04 23:55     ` Chris Murphy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox