* XFS / xfs_repair - problem reading very large sparse files on very large filesystem
@ 2021-11-04 9:09 Nikola Ciprich
2021-11-04 16:20 ` Eric Sandeen
2021-11-04 23:04 ` Dave Chinner
0 siblings, 2 replies; 10+ messages in thread
From: Nikola Ciprich @ 2021-11-04 9:09 UTC (permalink / raw)
To: linux-xfs; +Cc: nikola.ciprich
Hello fellow XFS users and developers,
we've stumbled upon strange problem which I think might be somewhere
in XFS code.
we have very large ceph-based storage on top which there is 1.5PiB volume
with XFS filesystem. This contains very large (ie 500TB) sparse files,
partially filled with data.
problem is, trying to read those files leads to processes blocked in D
state showing very very bad performance - ~200KiB/s, 50IOPS.
I tried running xfs_repair on the volume, but this seems to behave in
very similar way - very quickly it gets into almost stalled state, without
almost any progress..
[root@spbstdnas ~]# xfs_repair -P -t 60 -v -v -v -v /dev/sdk
Phase 1 - find and verify superblock...
- max_mem = 154604838, icount = 9664, imem = 37, dblock = 382464425984, dmem = 186750208
Memory available for repair (150981MB) may not be sufficient.
At least 182422MB is needed to repair this filesystem efficiently
If repair fails due to lack of memory, please
increase system RAM and/or swap space to at least 364844MB.
- block cache size set to 4096 entries
Phase 2 - using internal log
- zero log...
zero_log: head block 1454674 tail block 1454674
- scan filesystem freespace and inode maps...
- found root inode chunk
libxfs_bcache: 0x26aa3a0
Max supported entries = 4096
Max supported entries = 4096
Max utilized entries = 4096
Active entries = 4048
Hash table size = 512
Hits = 0
Misses = 76653
Hit ratio = 0.00
MRU 0 entries = 4048 (100%)
MRU 1 entries = 0 ( 0%)
MRU 2 entries = 0 ( 0%)
MRU 3 entries = 0 ( 0%)
MRU 4 entries = 0 ( 0%)
MRU 5 entries = 0 ( 0%)
MRU 6 entries = 0 ( 0%)
MRU 7 entries = 0 ( 0%)
MRU 8 entries = 0 ( 0%)
MRU 9 entries = 0 ( 0%)
MRU 10 entries = 0 ( 0%)
MRU 11 entries = 0 ( 0%)
MRU 12 entries = 0 ( 0%)
MRU 13 entries = 0 ( 0%)
MRU 14 entries = 0 ( 0%)
MRU 15 entries = 0 ( 0%)
Dirty MRU 16 entries = 0 ( 0%)
Hash buckets with 2 entries 5 ( 0%)
Hash buckets with 3 entries 11 ( 0%)
Hash buckets with 4 entries 30 ( 2%)
Hash buckets with 5 entries 36 ( 4%)
Hash buckets with 6 entries 57 ( 8%)
Hash buckets with 7 entries 90 ( 15%)
Hash buckets with 8 entries 80 ( 15%)
Hash buckets with 9 entries 74 ( 16%)
Hash buckets with 10 entries 62 ( 15%)
Hash buckets with 11 entries 31 ( 8%)
Hash buckets with 12 entries 16 ( 4%)
Hash buckets with 13 entries 10 ( 3%)
Hash buckets with 14 entries 7 ( 2%)
Hash buckets with 15 entries 2 ( 0%)
Hash buckets with 16 entries 1 ( 0%)
Phase 3 - for each AG...
- scan and clear agi unlinked lists...
- process known inodes and perform inode discovery...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
VM has 200GB of RAM, but the xfs_repair does not use more then 1GB,
CPU is idle. it just only reads the same slow speed, ~200K/s, 50IOPS.
I've carefully checked, and the storage speed is much much faster, checked
with blktrace which areas of the volume it is currently reading, and trying
fio / dd on them shows it can perform much faster (as well as randomly reading
any area of the volume or trying randomread or seq read fio benchmarks)
I've found one, very old report pretty much resembling my problem:
https://www.spinics.net/lists/xfs/msg06585.html
but it is 10 years old and didn't lead to any conclusion.
Is it possible there is still some bug common for XFS kernel module and xfs_repair?
I tried 5.4.135 and 5.10.31 kernels, xfs_progs 4.5.0 and 5.13.0
(OS is x86_64 centos 7)
any hints on how could I further debug that?
I'd be very gratefull for any help
with best regards
nikola ciprich
--
-------------------------------------
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28.rijna 168, 709 00 Ostrava
tel.: +420 591 166 214
fax: +420 596 621 273
mobil: +420 777 093 799
www.linuxbox.cz
mobil servis: +420 737 238 656
email servis: servis@linuxbox.cz
-------------------------------------
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: XFS / xfs_repair - problem reading very large sparse files on very large filesystem
2021-11-04 9:09 XFS / xfs_repair - problem reading very large sparse files on very large filesystem Nikola Ciprich
@ 2021-11-04 16:20 ` Eric Sandeen
2021-11-05 14:13 ` Nikola Ciprich
2021-11-04 23:04 ` Dave Chinner
1 sibling, 1 reply; 10+ messages in thread
From: Eric Sandeen @ 2021-11-04 16:20 UTC (permalink / raw)
To: Nikola Ciprich, linux-xfs
On 11/4/21 4:09 AM, Nikola Ciprich wrote:
> Hello fellow XFS users and developers,
>
> we've stumbled upon strange problem which I think might be somewhere
> in XFS code.
>
> we have very large ceph-based storage on top which there is 1.5PiB volume
> with XFS filesystem. This contains very large (ie 500TB) sparse files,
> partially filled with data.
>
> problem is, trying to read those files leads to processes blocked in D
> state showing very very bad performance - ~200KiB/s, 50IOPS.
I'm guessing they are horrifically fragmented? What does xfs_bmap tell you
about the number of extents in one of these files?
When it is blocked, where is it blocked? (try sysrq-w)
> I tried running xfs_repair on the volume, but this seems to behave in
> very similar way - very quickly it gets into almost stalled state, without
> almost any progress..
Perceived performance won't be fixed by repair, but...
> [root@spbstdnas ~]# xfs_repair -P -t 60 -v -v -v -v /dev/sdk
> Phase 1 - find and verify superblock...
> - max_mem = 154604838, icount = 9664, imem = 37, dblock = 382464425984, dmem = 186750208
> Memory available for repair (150981MB) may not be sufficient.
> At least 182422MB is needed to repair this filesystem efficiently
> If repair fails due to lack of memory, please
> increase system RAM and/or swap space to at least 364844MB.
... it /is/ telling you that it would like a lot more memory to do
its job.
> Phase 2 - using internal log
> - zero log...
> zero_log: head block 1454674 tail block 1454674
> - scan filesystem freespace and inode maps...
> - found root inode chunk
...
> Phase 3 - for each AG...
> - scan and clear agi unlinked lists...
> - process known inodes and perform inode discovery...
> - agno = 0
> - agno = 1
> - agno = 2
>
>
> - agno = 3
>
>
> VM has 200GB of RAM, but the xfs_repair does not use more then 1GB,
> CPU is idle. it just only reads the same slow speed, ~200K/s, 50IOPS.
Rather than diagnosing repair at this point, let's first see where you're
blocked when you're reading the sparse files on the filesystem as suggested
above.
-Eric
> I've carefully checked, and the storage speed is much much faster, checked
> with blktrace which areas of the volume it is currently reading, and trying
> fio / dd on them shows it can perform much faster (as well as randomly reading
> any area of the volume or trying randomread or seq read fio benchmarks)
>
> I've found one, very old report pretty much resembling my problem:
>
> https://www.spinics.net/lists/xfs/msg06585.html
>
> but it is 10 years old and didn't lead to any conclusion.
>
> Is it possible there is still some bug common for XFS kernel module and xfs_repair?
>
> I tried 5.4.135 and 5.10.31 kernels, xfs_progs 4.5.0 and 5.13.0
> (OS is x86_64 centos 7)
>
> any hints on how could I further debug that?
>
> I'd be very gratefull for any help
>
> with best regards
>
> nikola ciprich
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: XFS / xfs_repair - problem reading very large sparse files on very large filesystem
2021-11-04 9:09 XFS / xfs_repair - problem reading very large sparse files on very large filesystem Nikola Ciprich
2021-11-04 16:20 ` Eric Sandeen
@ 2021-11-04 23:04 ` Dave Chinner
1 sibling, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2021-11-04 23:04 UTC (permalink / raw)
To: Nikola Ciprich; +Cc: linux-xfs
On Thu, Nov 04, 2021 at 10:09:15AM +0100, Nikola Ciprich wrote:
> Hello fellow XFS users and developers,
>
> we've stumbled upon strange problem which I think might be somewhere
> in XFS code.
>
> we have very large ceph-based storage on top which there is 1.5PiB volume
> with XFS filesystem. This contains very large (ie 500TB) sparse files,
> partially filled with data.
>
> problem is, trying to read those files leads to processes blocked in D
> state showing very very bad performance - ~200KiB/s, 50IOPS.
It's been told it to go slow... :/
> I tried running xfs_repair on the volume, but this seems to behave in
> very similar way - very quickly it gets into almost stalled state, without
> almost any progress..
>
> [root@spbstdnas ~]# xfs_repair -P -t 60 -v -v -v -v /dev/sdk
.... because "-P" turns off prefetching and all the IO optimisation
that comes along with the prefetching mechanisms. In effect, "-P"
means "go really slowly".
Try:
# xfs_repair -o bhash_size=101371 -o ag_stride=100 /dev/sdk
To get a good sized buffer cache and a decent (but not excessive)
amount of concurrency in the scanning processing. It still may end
up being slow if it has to single thread walk a huge btree
(essentially pointer chasing on disk), but at least that won't hold
up all the other scanning that isn't dependent on that huge btree..
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: XFS / xfs_repair - problem reading very large sparse files on very large filesystem
2021-11-04 16:20 ` Eric Sandeen
@ 2021-11-05 14:13 ` Nikola Ciprich
2021-11-05 14:17 ` Nikola Ciprich
0 siblings, 1 reply; 10+ messages in thread
From: Nikola Ciprich @ 2021-11-05 14:13 UTC (permalink / raw)
To: Eric Sandeen; +Cc: linux-xfs, Nikola Ciprich
Hello Eric,
I'm sorry for late reply.
> I'm guessing they are horrifically fragmented? What does xfs_bmap tell you
> about the number of extents in one of these files?
unfortunately, xfs_bmap blocks on this file too:
[ +0.000321] task:xfs_io state:D stack: 0 pid:15728 ppid: 15725 flags:0x00000080
[ +0.000333] Call Trace:
[ +0.000161] __schedule+0x231/0x760
[ +0.000195] ? page_add_new_anon_rmap+0x9e/0x1f0
[ +0.000207] schedule+0x3c/0xa0
[ +0.000175] rwsem_down_write_slowpath+0x32c/0x4e0
[ +0.000216] ? get_page_from_freelist+0x190d/0x1c60
[ +0.000250] xfs_ilock_data_map_shared+0x29/0x30 [xfs]
[ +0.000312] xfs_getbmap+0xe2/0x7b0 [xfs]
[ +0.000197] ? _cond_resched+0x15/0x30
[ +0.000203] ? __kmalloc_node+0x4a4/0x4e0
[ +0.000230] xfs_ioc_getbmap+0xf5/0x270 [xfs]
[ +0.000260] xfs_file_ioctl+0x4da/0xbc0 [xfs]
[ +0.000205] ? __mod_memcg_lruvec_state+0x21/0x100
[ +0.000203] ? page_add_new_anon_rmap+0x9e/0x1f0
[ +0.000209] ? __raw_spin_unlock+0x5/0x10
[ +0.000188] ? __handle_mm_fault+0xbb0/0x1410
[ +0.000221] ? handle_mm_fault+0xd0/0x290
[ +0.000191] __x64_sys_ioctl+0x84/0xc0
[ +0.000181] do_syscall_64+0x33/0x40
[ +0.000188] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ +0.000213] RIP: 0033:0x7fdc81f694a7
[ +0.000192] RSP: 002b:00007ffe98c69998 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ +0.000319] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fdc81f694a7
[ +0.000311] RDX: 00000000010d6a00 RSI: ffffffffc0205838 RDI: 0000000000000003
[ +0.000322] RBP: 0000000000000020 R08: 0000000000000000 R09: 0000000000000600
[ +0.000303] R10: 0000000000000048 R11: 0000000000000246 R12: 0000000000000000
[ +0.000310] R13: 00000000010d6a00 R14: 0000000000000000 R15: 0000000000000000
>
> When it is blocked, where is it blocked? (try sysrq-w)
[ +0.016252] task:pv state:D stack: 0 pid:15507 ppid: 1161 flags:0x00004080
[ +0.000373] Call Trace:
[ +0.000177] __schedule+0x231/0x760
[ +0.000190] schedule+0x3c/0xa0
[ +0.000175] schedule_timeout+0x215/0x2b0
[ +0.000197] ? blk_mq_get_tag+0x244/0x280
[ +0.000201] __down+0x9b/0xf0
[ +0.000189] ? blk_mq_complete_request_remote+0x50/0xc0
[ +0.000223] down+0x3b/0x50
[ +0.000385] xfs_buf_lock+0x2c/0xb0 [xfs]
[ +0.000259] xfs_buf_find.isra.32+0x3d9/0x610 [xfs]
[ +0.000275] xfs_buf_get_map+0x4c/0x2e0 [xfs]
[ +0.000199] ? submit_bio+0x43/0x160
[ +0.000232] xfs_buf_read_map+0x55/0x2c0 [xfs]
[ +0.000237] ? xfs_btree_read_buf_block.constprop.40+0x95/0xd0 [xfs]
[ +0.000328] xfs_trans_read_buf_map+0x123/0x2d0 [xfs]
[ +0.000279] ? xfs_btree_read_buf_block.constprop.40+0x95/0xd0 [xfs]
[ +0.000299] xfs_btree_read_buf_block.constprop.40+0x95/0xd0 [xfs]
[ +0.000301] xfs_btree_lookup_get_block+0x95/0x170 [xfs]
[ +0.000263] ? xfs_bmap_validate_extent+0xa0/0xa0 [xfs]
[ +0.000257] xfs_btree_visit_block+0x85/0xc0 [xfs]
[ +0.000237] ? xfs_bmap_validate_extent+0xa0/0xa0 [xfs]
[ +0.000263] xfs_btree_visit_blocks+0x109/0x120 [xfs]
[ +0.000246] xfs_iread_extents+0x9f/0x170 [xfs]
[ +0.000246] ? xfs_bmapi_read+0x23b/0x2c0 [xfs]
[ +0.000233] xfs_bmapi_read+0x23b/0x2c0 [xfs]
[ +0.000214] ? _cond_resched+0x15/0x30
[ +0.000214] ? down_write+0xe/0x40
[ +0.000230] xfs_read_iomap_begin+0xea/0x1e0 [xfs]
[ +0.000228] iomap_apply+0x94/0x2d0
[ +0.000181] ? iomap_page_mkwrite_actor+0x70/0x70
[ +0.008736] ? iomap_page_mkwrite_actor+0x70/0x70
[ +0.000219] iomap_readahead+0x9a/0x150
[ +0.000207] ? iomap_page_mkwrite_actor+0x70/0x70
[ +0.000216] read_pages+0x8e/0x1f0
[ +0.000183] page_cache_ra_unbounded+0x19d/0x1f0
[ +0.000207] generic_file_buffered_read+0x3f8/0x800
[ +0.000266] xfs_file_buffered_aio_read+0x44/0xb0 [xfs]
[ +0.000280] xfs_file_read_iter+0x68/0xc0 [xfs]
[ +0.000204] new_sync_read+0x118/0x1a0
[ +0.000195] vfs_read+0xf1/0x180
[ +0.000173] ksys_read+0x59/0xd0
[ +0.000187] do_syscall_64+0x33/0x40
[ +0.000186] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ +0.000215] RIP: 0033:0x7f209bf06b40
[ +0.000179] RSP: 002b:00007ffd9d11aeb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ +0.000383] RAX: ffffffffffffffda RBX: 00007ffd9d11b0c0 RCX: 00007f209bf06b40
[ +0.000308] RDX: 0000000000020000 RSI: 00007f209c3d9010 RDI: 0000000000000003
[ +0.000309] RBP: 00007ffd9d11b0c4 R08: 0000000000000000 R09: 0000000000000004
[ +0.000308] R10: 00007ffd9d11a2a0 R11: 0000000000000246 R12: 00000000018290d0
[ +0.000318] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000020000
>
> >I tried running xfs_repair on the volume, but this seems to behave in
> >very similar way - very quickly it gets into almost stalled state, without
> >almost any progress..
>
> Perceived performance won't be fixed by repair, but...
>
> >[root@spbstdnas ~]# xfs_repair -P -t 60 -v -v -v -v /dev/sdk
> >Phase 1 - find and verify superblock...
> > - max_mem = 154604838, icount = 9664, imem = 37, dblock = 382464425984, dmem = 186750208
> >Memory available for repair (150981MB) may not be sufficient.
> >At least 182422MB is needed to repair this filesystem efficiently
> >If repair fails due to lack of memory, please
> >increase system RAM and/or swap space to at least 364844MB.
>
> ... it /is/ telling you that it would like a lot more memory to do
> its job.
>
> >Phase 2 - using internal log
> > - zero log...
> >zero_log: head block 1454674 tail block 1454674
> > - scan filesystem freespace and inode maps...
> > - found root inode chunk
> ...
> >Phase 3 - for each AG...
> > - scan and clear agi unlinked lists...
> > - process known inodes and perform inode discovery...
> > - agno = 0
> > - agno = 1
> > - agno = 2
> >
> >
> > - agno = 3
> >
> >VM has 200GB of RAM, but the xfs_repair does not use more then 1GB,
> >CPU is idle. it just only reads the same slow speed, ~200K/s, 50IOPS.
>
> Rather than diagnosing repair at this point, let's first see where you're
> blocked when you're reading the sparse files on the filesystem as suggested
> above.
OK.
please let me know, if I could provide any further info
with best regards
nikola ciprich
>
> -Eric
>
> >I've carefully checked, and the storage speed is much much faster, checked
> >with blktrace which areas of the volume it is currently reading, and trying
> >fio / dd on them shows it can perform much faster (as well as randomly reading
> >any area of the volume or trying randomread or seq read fio benchmarks)
> >
> >I've found one, very old report pretty much resembling my problem:
> >
> >https://www.spinics.net/lists/xfs/msg06585.html
> >
> >but it is 10 years old and didn't lead to any conclusion.
> >
> >Is it possible there is still some bug common for XFS kernel module and xfs_repair?
> >
> >I tried 5.4.135 and 5.10.31 kernels, xfs_progs 4.5.0 and 5.13.0
> >(OS is x86_64 centos 7)
> >
> >any hints on how could I further debug that?
> >
> >I'd be very gratefull for any help
> >
> >with best regards
> >
> >nikola ciprich
> >
> >
>
--
-------------------------------------
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28.rijna 168, 709 00 Ostrava
tel.: +420 591 166 214
fax: +420 596 621 273
mobil: +420 777 093 799
www.linuxbox.cz
mobil servis: +420 737 238 656
email servis: servis@linuxbox.cz
-------------------------------------
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: XFS / xfs_repair - problem reading very large sparse files on very large filesystem
2021-11-05 14:13 ` Nikola Ciprich
@ 2021-11-05 14:17 ` Nikola Ciprich
2021-11-05 14:56 ` Eric Sandeen
0 siblings, 1 reply; 10+ messages in thread
From: Nikola Ciprich @ 2021-11-05 14:17 UTC (permalink / raw)
To: Eric Sandeen; +Cc: linux-xfs, Nikola Ciprich
> > I'm guessing they are horrifically fragmented? What does xfs_bmap tell you
> > about the number of extents in one of these files?
>
> unfortunately, xfs_bmap blocks on this file too:
anyways, trying to run it on another similar file, which doesn't seem
to suffer such problem (of 10TB size) shows 680657 allocation
groups which I guess is not very good..
here's the output if it is of some use:
https://storage.linuxbox.cz/index.php/s/AyZGW5Xdfxg47f6
--
-------------------------------------
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28.rijna 168, 709 00 Ostrava
tel.: +420 591 166 214
fax: +420 596 621 273
mobil: +420 777 093 799
www.linuxbox.cz
mobil servis: +420 737 238 656
email servis: servis@linuxbox.cz
-------------------------------------
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: XFS / xfs_repair - problem reading very large sparse files on very large filesystem
2021-11-05 14:17 ` Nikola Ciprich
@ 2021-11-05 14:56 ` Eric Sandeen
2021-11-05 15:59 ` Nikola Ciprich
0 siblings, 1 reply; 10+ messages in thread
From: Eric Sandeen @ 2021-11-05 14:56 UTC (permalink / raw)
To: Nikola Ciprich, Eric Sandeen; +Cc: linux-xfs
On 11/5/21 9:17 AM, Nikola Ciprich wrote:
>>> I'm guessing they are horrifically fragmented? What does xfs_bmap tell you
>>> about the number of extents in one of these files?
>>
>> unfortunately, xfs_bmap blocks on this file too:
>
> anyways, trying to run it on another similar file, which doesn't seem
> to suffer such problem (of 10TB size) shows 680657 allocation
> groups which I guess is not very good..
>
> here's the output if it is of some use:
>
> https://storage.linuxbox.cz/index.php/s/AyZGW5Xdfxg47f6
Just to be clear - I think Dave and I interpreted your original email slightly
differently.
Are these large files on the 1.5PB filesystem filesystem images themselves,
or some other type of file?
And - the repair you were running was against the 1.5PB filesystem?
(see also Dave's reply in this thread)
-Eric
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: XFS / xfs_repair - problem reading very large sparse files on very large filesystem
2021-11-05 14:56 ` Eric Sandeen
@ 2021-11-05 15:59 ` Nikola Ciprich
2021-11-05 16:11 ` Eric Sandeen
0 siblings, 1 reply; 10+ messages in thread
From: Nikola Ciprich @ 2021-11-05 15:59 UTC (permalink / raw)
To: Eric Sandeen; +Cc: Eric Sandeen, linux-xfs, Nikola Ciprich
> >
> >here's the output if it is of some use:
> >
> >https://storage.linuxbox.cz/index.php/s/AyZGW5Xdfxg47f6
>
> Just to be clear - I think Dave and I interpreted your original email slightly
> differently.
>
> Are these large files on the 1.5PB filesystem filesystem images themselves,
> or some other type of file?
>
> And - the repair you were running was against the 1.5PB filesystem?
>
> (see also Dave's reply in this thread)
>
Hello Eric,
I was running fsck on the 1.5PB fs (I interrupted it, as it doesn't seem
to be the main problem now). Large files are archives of videofiles from camera
streaming software, I don't know much about them, I was told at the beginning
that all writes will be sequential, which apparently are not, so for new
files, we'll be preallocating them.
btw blocked read from file I sent backtrace seems to have started finally (after
maybe an hour) and runs 8-20MB/s
nik
> -Eric
>
--
-------------------------------------
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28.rijna 168, 709 00 Ostrava
tel.: +420 591 166 214
fax: +420 596 621 273
mobil: +420 777 093 799
www.linuxbox.cz
mobil servis: +420 737 238 656
email servis: servis@linuxbox.cz
-------------------------------------
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: XFS / xfs_repair - problem reading very large sparse files on very large filesystem
2021-11-05 15:59 ` Nikola Ciprich
@ 2021-11-05 16:11 ` Eric Sandeen
2021-11-05 16:19 ` Nikola Ciprich
0 siblings, 1 reply; 10+ messages in thread
From: Eric Sandeen @ 2021-11-05 16:11 UTC (permalink / raw)
To: Nikola Ciprich, Eric Sandeen; +Cc: Eric Sandeen, linux-xfs
On 11/5/21 10:59 AM, Nikola Ciprich wrote:
>>>
>>> here's the output if it is of some use:
>>>
>>> https://storage.linuxbox.cz/index.php/s/AyZGW5Xdfxg47f6
>>
>> Just to be clear - I think Dave and I interpreted your original email slightly
>> differently.
>>
>> Are these large files on the 1.5PB filesystem filesystem images themselves,
>> or some other type of file?
>>
>> And - the repair you were running was against the 1.5PB filesystem?
>>
>> (see also Dave's reply in this thread)
>>
> Hello Eric,
>
> I was running fsck on the 1.5PB fs (I interrupted it, as it doesn't seem
> to be the main problem now). Large files are archives of videofiles from camera
> streaming software, I don't know much about them, I was told at the beginning
> that all writes will be sequential, which apparently are not, so for new
> files, we'll be preallocating them.
ok, thanks for the clarification.
Though I've never heard of streaming video writes that weren't sequential ...
have you actually observed that via strace or whatnot?
What might be happening is that if you are streaming multiple files into a single
directory at the same time, it competes for the allocator, and they will interleave.
XFS has an allocator mode called "filestreams" which was designed just for this
(video ingest).
If you set the "S" attribute on the target directory, IIRC it should enable this
mode. You can do that with the xfs_io "chattr" command.
Might be worth a test, or wait for dchinner to chime in on whether this is a
reasonable suggestion...
-Eric
> btw blocked read from file I sent backtrace seems to have started finally (after
> maybe an hour) and runs 8-20MB/s
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: XFS / xfs_repair - problem reading very large sparse files on very large filesystem
2021-11-05 16:11 ` Eric Sandeen
@ 2021-11-05 16:19 ` Nikola Ciprich
2021-11-07 22:25 ` Dave Chinner
0 siblings, 1 reply; 10+ messages in thread
From: Nikola Ciprich @ 2021-11-05 16:19 UTC (permalink / raw)
To: Eric Sandeen; +Cc: Eric Sandeen, linux-xfs, Nikola Ciprich
>
> ok, thanks for the clarification.
no problem... in the meantime, xfs_bmap finished as well,
resulting output has 1.5GB, showing total of 25354643 groups :-O
>
> Though I've never heard of streaming video writes that weren't sequential ...
> have you actually observed that via strace or whatnot?
those are streams from many cameras, somehow multiplexed by processing software.
The guy I communicate with, whos responsible unfortunately does not know
many details
>
> What might be happening is that if you are streaming multiple files into a single
> directory at the same time, it competes for the allocator, and they will interleave.
>
> XFS has an allocator mode called "filestreams" which was designed just for this
> (video ingest).
thanks for the tip, I'll check that!
anyways I'll rather preallocate files fully for now, it takes a lot of time, but
should be the safest way before we know what exactly is wrong.. and I'll also
avoid creating such huge filesystems, as it leads to more trouble.. (like needs of huge
amounts of RAM for fs repair)
>
> If you set the "S" attribute on the target directory, IIRC it should enable this
> mode. You can do that with the xfs_io "chattr" command.
>
> Might be worth a test, or wait for dchinner to chime in on whether this is a
> reasonable suggestion...
OK
BR
nik
>
> -Eric
>
> >btw blocked read from file I sent backtrace seems to have started finally (after
> >maybe an hour) and runs 8-20MB/s
>
--
-------------------------------------
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28.rijna 168, 709 00 Ostrava
tel.: +420 591 166 214
fax: +420 596 621 273
mobil: +420 777 093 799
www.linuxbox.cz
mobil servis: +420 737 238 656
email servis: servis@linuxbox.cz
-------------------------------------
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: XFS / xfs_repair - problem reading very large sparse files on very large filesystem
2021-11-05 16:19 ` Nikola Ciprich
@ 2021-11-07 22:25 ` Dave Chinner
0 siblings, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2021-11-07 22:25 UTC (permalink / raw)
To: Nikola Ciprich; +Cc: Eric Sandeen, Eric Sandeen, linux-xfs
On Fri, Nov 05, 2021 at 05:19:47PM +0100, Nikola Ciprich wrote:
> >
> > ok, thanks for the clarification.
>
> no problem... in the meantime, xfs_bmap finished as well,
> resulting output has 1.5GB, showing total of 25354643 groups :-O
Yeah, that'll do it. If you are on spinning disks, at ~250 extents
per btree block you're talking about a hundred thousand IOs to read
in the extent list on first access to the file after mount.
> > Though I've never heard of streaming video writes that weren't sequential ...
> > have you actually observed that via strace or whatnot?
> those are streams from many cameras, somehow multiplexed by processing software.
> The guy I communicate with, whos responsible unfortunately does not know
> many details
The multiplexing is the problem here. Look at the allocation pattern
in the trace.
680367: [872751104..872759863]: 870787280..870796039
680368: [872759864..872760423]: 870799440..870799999
680369: [872760424..872761527]: 870921888..870922991
680370: [872761528..872762079]: 870959584..870960135
680371: [872762080..872763631]: 871192144..871193695
680372: [872763632..872763647]: 871183760..871183775
680373: [872763648..872767487]: hole
680374: [872767488..872768687]: 870796040..870797239
680375: [872768688..872769887]: 870800000..870801199
680376: [872769888..872772367]: 870922992..870925471
680377: [872772368..872773559]: 870989000..870990191
680378: [872773560..872775639]: 871193696..871195775
680379: [872775640..872775679]: hole
680380: [872775680..872776231]: 870797240..870797791
680381: [872776232..872776775]: 870801200..870801743
680382: [872776776..872777847]: 870870440..870871511
680383: [872777848..872778383]: 870990192..870990727
680384: [872778384..872779727]: 871195776..871197119
680385: [872779728..872779791]: 871175064..871175127
680386: [872779792..872783871]: hole
680387: [872783872..872785519]: 870797792..870799439
680388: [872785520..872786927]: 870801744..870803151
680389: [872786928..872789671]: 870925472..870928215
680390: [872789672..872791087]: 870990728..870992143
680391: [872791088..872791991]: 871197120..871198023
680392: [872791992..872792063]: hole
Lets lay that out into sequential blocks:
Stream 1:
680367: [872751104..872759863]: 870787280..870796039
680374: [872767488..872768687]: 870796040..870797239
680380: [872775680..872776231]: 870797240..870797791
680387: [872783872..872785519]: 870797792..870799439
Stream 2:
680368: [872759864..872760423]: 870799440..870799999
680375: [872768688..872769887]: 870800000..870801199
680381: [872776232..872776775]: 870801200..870801743
680388: [872785520..872786927]: 870801744..870803151
Stream 3:
680369: [872760424..872761527]: 870921888..870922991
680376: [872769888..872772367]: 870922992..870925471
680382: [872776776..872777847]: 870870440..870871511 (discontig)
680389: [872786928..872789671]: 870925472..870928215
Stream 4:
680370: [872761528..872762079]: 870959584..870960135
680377: [872772368..872773559]: 870989000..870990191
680383: [872777848..872778383]: 870990192..870990727
680390: [872789672..872791087]: 870990728..870992143
Stream 5:
680371: [872762080..872763631]: 871192144..871193695
680378: [872773560..872775639]: 871193696..871195775
680384: [872778384..872779727]: 871195776..871197119
680391: [872791088..872791991]: 871197120..871198023
Stream 6:
680372: [872763632..872763647]: 871183760..871183775
680373: [872763648..872767487]: hole (contig with 680372)
680379: [872775640..872775679]: hole
680385: [872779728..872779791]: 871175064..871175127
680386: [872779792..872783871]: hole (contig with 680385)
680392: [872791992..872792063]: hole
The reason I point this out, is that the way tha XFS allocator works
is that is peels off a chunk of the longest free extent on every
new physical allocation for non-contiguous file offsets.
Hence when we see this physical allocation pattern:
680367: [872751104..872759863]: 870787280..870796039
680374: [872767488..872768687]: 870796040..870797239
680380: [872775680..872776231]: 870797240..870797791
680387: [872783872..872785519]: 870797792..870799439
It indicates the order in which the writes are occurring. Hence it
would appear that the application is doing sparse writes for chunks
in the file, that it then goes back and partially files holes later
with another run of sparse writes. Eventually, all holes are filled,
but you end up with a fragmented file.
This is actually by design - the XFS allocator is optimised for
efficient write IO (i.e. sequentialises writes as much as possible)
rather than optimal read IO.
From the allocation pattern, I suspect there are 6 cameras in this
multiplexer setup, each sample time that it needs to store an image
has a frame from each camera, and a series of frames is written per
camera before writing the next set of frames from the next camera.
Hence the allocation pattern on disk is effectively sequential for
each camera stream as they are written, but when viewed as a
multiplexed file, it's extremely fragmented because the individual
camera streams are interleaved..
> > What might be happening is that if you are streaming multiple
> > files into a single directory at the same time, it competes for
> > the allocator, and they will interleave.
> >
> > XFS has an allocator mode called "filestreams" which was
> > designed just for this (video ingest).
Won't do anything - that's for ensure "file per frame" video ingest
places all the files for a given video stream contiguously in an AG.
This looks like "multiple cameras and many frames per file" which
means the filestreams code will not trigger or do anything different
here.
> anyways I'll rather preallocate files fully for now, it takes a
> lot of time, but should be the safest way before we know what
> exactly is wrong..
That may well cause serious problems for camera data ingest, because
it forces the ingest write IO pattern to be non-contiguous rather
than sequential. Hence instead of larger, sequentialised writes per
incoming data set as the above pattern suggests, preallocation will
change to be many more smaller, sparse write IOs that cannot merge.
This will increase write IO latency and reduce the amount of data
that can be written to disk. The likely result of this is that it
will reduce the number of cameras that can be supported per spinning
disk.
I would suggest that the best solution is to rotate camera data
files at a much smaller size so that the extent list doesn't get too
large. e.g. max file size is 1TB, keep historic records in 500x1TB
files instead of one single 500TB file...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2021-11-07 22:25 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-11-04 9:09 XFS / xfs_repair - problem reading very large sparse files on very large filesystem Nikola Ciprich
2021-11-04 16:20 ` Eric Sandeen
2021-11-05 14:13 ` Nikola Ciprich
2021-11-05 14:17 ` Nikola Ciprich
2021-11-05 14:56 ` Eric Sandeen
2021-11-05 15:59 ` Nikola Ciprich
2021-11-05 16:11 ` Eric Sandeen
2021-11-05 16:19 ` Nikola Ciprich
2021-11-07 22:25 ` Dave Chinner
2021-11-04 23:04 ` Dave Chinner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox