* Speeding up xfs_repair on filesystem with millions of inodes
@ 2015-10-27 12:10 Michael Weissenbacher
2015-10-27 19:38 ` Dave Chinner
0 siblings, 1 reply; 5+ messages in thread
From: Michael Weissenbacher @ 2015-10-27 12:10 UTC (permalink / raw)
To: xfs
Hi List!
I have an xfs filesystem which probably suffered a corruption due to a
bad UPS (even though the RAID controller has a good BBU). At the time
the power loss occurred the filesystem was mounted with the "nobarrier"
option.
We noticed the problem several weeks later, when some rsync-based backup
jobs started to hang for days without progress when doing a simple "rm".
This was accompanied by some messages in dmesg like this one:
Oct 15 21:53:14 mojave kernel: [4976164.170021] INFO: task kswapd0:38
blocked for more than 120 seconds.
Oct 15 21:53:14 mojave kernel: [4976164.170100] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 15 21:53:14 mojave kernel: [4976164.170180] kswapd0 D
ffffffff8180bea0 0 38 2 0x00000000
Oct 15 21:53:14 mojave kernel: [4976164.170185] ffff880225f73968
0000000000000046 ffff880225c42e20 0000000000013180
Oct 15 21:53:14 mojave kernel: [4976164.170188] ffff880225f73fd8
ffff880225f72010 0000000000013180 0000000000013180
Oct 15 21:53:14 mojave kernel: [4976164.170191] ffff880225f73fd8
0000000000013180 ffff880225c42e20 ffff88022611dc40
Oct 15 21:53:14 mojave kernel: [4976164.170194] Call Trace:
Oct 15 21:53:14 mojave kernel: [4976164.170204] [<ffffffff8166a8e9>]
schedule+0x29/0x70
Oct 15 21:53:14 mojave kernel: [4976164.170207] [<ffffffff8166a9bc>]
io_schedule+0x8c/0xd0
Oct 15 21:53:14 mojave kernel: [4976164.170211] [<ffffffff8126c5ef>]
__xfs_iflock+0xdf/0x110
Oct 15 21:53:14 mojave kernel: [4976164.170216] [<ffffffff8106b070>] ?
autoremove_wake_function+0x40/0x40
Oct 15 21:53:14 mojave kernel: [4976164.170219] [<ffffffff812273b4>]
xfs_reclaim_inode+0xc4/0x330
Oct 15 21:53:14 mojave kernel: [4976164.170222] [<ffffffff81227816>]
xfs_reclaim_inodes_ag+0x1f6/0x330
Oct 15 21:53:14 mojave kernel: [4976164.170225] [<ffffffff81227983>]
xfs_reclaim_inodes_nr+0x33/0x40
Oct 15 21:53:14 mojave kernel: [4976164.170228] [<ffffffff81230085>]
xfs_fs_free_cached_objects+0x15/0x20
Oct 15 21:53:14 mojave kernel: [4976164.170233] [<ffffffff8117943e>]
prune_super+0x11e/0x1a0
Oct 15 21:53:14 mojave kernel: [4976164.170237] [<ffffffff8112903f>]
shrink_slab+0x19f/0x2d0
Oct 15 21:53:14 mojave kernel: [4976164.170240] [<ffffffff8112c3c8>]
kswapd+0x698/0xae0
Oct 15 21:53:14 mojave kernel: [4976164.170243] [<ffffffff8106b030>] ?
wake_up_bit+0x40/0x40
Oct 15 21:53:14 mojave kernel: [4976164.170246] [<ffffffff8112bd30>] ?
zone_reclaim+0x410/0x410
Oct 15 21:53:14 mojave kernel: [4976164.170249] [<ffffffff8106a97e>]
kthread+0xce/0xe0
Oct 15 21:53:14 mojave kernel: [4976164.170252] [<ffffffff8106a8b0>] ?
kthread_freezable_should_stop+0x70/0x70
Oct 15 21:53:14 mojave kernel: [4976164.170256] [<ffffffff8167475c>]
ret_from_fork+0x7c/0xb0
Oct 15 21:53:14 mojave kernel: [4976164.170258] [<ffffffff8106a8b0>] ?
kthread_freezable_should_stop+0x70/0x70
So i decided to unmount the fs and run xfs_repair on it. Unfortunately,
after almost a week, this hasn't finished yet. It seems to do so much
swapping that it hardly makes any progress. Currently it has been in
Phase 6 (traversing filesystem) for several days.
I found a thread suggesting to add an ssd as swap drive, which i did
yesterday. I also added the "-P" option to xfs_repair since it helped in
some cases similar in the past.
I am using the latest xfs_repair version 3.2.4, compiled myself.
The filesystem is 16TB in size and contains about 150 million inodes.
The machine has 8GB of RAM available.
The kernel version at the time of the power loss was 3.10.44 and was
upgraded to 3.10.90 afterwards.
My questions are the following:
- Is there anything else i could try to speed up the progress besides
beefing up the RAM of the machine? Currently it has 8GB which is not
very much for the task i suppose. I read about the "-m" option and about
"-o bhash=" but i am unsure if they could help in this case.
- Are there any rough guidelines on how much RAM is needed for
xfs_repair on a given filesystem? How does it depend on the number of
inodes or on the size of the file system?
- How long could the quota check on mount take when the repair is
finished (the filesystem is mounted with usrquota, grpquota).
thanks in advance,
Michael
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Speeding up xfs_repair on filesystem with millions of inodes
2015-10-27 12:10 Speeding up xfs_repair on filesystem with millions of inodes Michael Weissenbacher
@ 2015-10-27 19:38 ` Dave Chinner
2015-10-27 22:51 ` Michael Weissenbacher
0 siblings, 1 reply; 5+ messages in thread
From: Dave Chinner @ 2015-10-27 19:38 UTC (permalink / raw)
To: Michael Weissenbacher; +Cc: xfs
On Tue, Oct 27, 2015 at 01:10:06PM +0100, Michael Weissenbacher wrote:
> Hi List!
> I have an xfs filesystem which probably suffered a corruption due to a
> bad UPS (even though the RAID controller has a good BBU). At the time
> the power loss occurred the filesystem was mounted with the "nobarrier"
> option.
>
> We noticed the problem several weeks later, when some rsync-based backup
> jobs started to hang for days without progress when doing a simple "rm".
> This was accompanied by some messages in dmesg like this one:
[cleanup line-wrapped paste mess]
> INFO: task kswapd0:38 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> kswapd0 D ffffffff8180bea0 0 38 2 0x00000000
> ffff880225f73968 0000000000000046 ffff880225c42e20 0000000000013180
> ffff880225f73fd8 ffff880225f72010 0000000000013180 0000000000013180
> ffff880225f73fd8 0000000000013180 ffff880225c42e20 ffff88022611dc40
> Call Trace:
> [<ffffffff8166a8e9>] schedule+0x29/0x70
> [<ffffffff8166a9bc>] io_schedule+0x8c/0xd0
> [<ffffffff8126c5ef>] __xfs_iflock+0xdf/0x110
> [<ffffffff8106b070>] ? autoremove_wake_function+0x40/0x40
> [<ffffffff812273b4>] xfs_reclaim_inode+0xc4/0x330
> [<ffffffff81227816>] xfs_reclaim_inodes_ag+0x1f6/0x330
> [<ffffffff81227983>] xfs_reclaim_inodes_nr+0x33/0x40
> [<ffffffff81230085>] xfs_fs_free_cached_objects+0x15/0x20
> [<ffffffff8117943e>] prune_super+0x11e/0x1a0
> [<ffffffff8112903f>] shrink_slab+0x19f/0x2d0
> [<ffffffff8112c3c8>] kswapd+0x698/0xae0
> [<ffffffff8106b030>] ? wake_up_bit+0x40/0x40
> [<ffffffff8112bd30>] ? zone_reclaim+0x410/0x410
> [<ffffffff8106a97e>] kthread+0xce/0xe0
> [<ffffffff8106a8b0>] ? kthread_freezable_should_stop+0x70/0x70
> [<ffffffff8167475c>] ret_from_fork+0x7c/0xb0
> [<ffffffff8106a8b0>] ? kthread_freezable_should_stop+0x70/0x70
It's waiting on inode IO to complete in memory reclaim. I'd say you
have a problem with lots of dirty inodes in memory and very slow
writeback due to using something like RAID5/6 (this can be
*seriously* slow as mentioned recently here:
http://oss.sgi.com/archives/xfs/2015-10/msg00560.html).
> So i decided to unmount the fs and run xfs_repair on it. Unfortunately,
> after almost a week, this hasn't finished yet. It seems to do so much
> swapping that it hardly makes any progress. Currently it has been in
> Phase 6 (traversing filesystem) for several days.
Was it making progress, just burning CPU, or was it just hung?
Attaching the actual output of repair is also helpful, as are all
the things here:
http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
> I found a thread suggesting to add an ssd as swap drive, which i did
> yesterday. I also added the "-P" option to xfs_repair since it helped in
> some cases similar in the past.
"-P" slows xfs_repair down greatly.
> I am using the latest xfs_repair version 3.2.4, compiled myself.
>
> The filesystem is 16TB in size and contains about 150 million inodes.
> The machine has 8GB of RAM available.
http://xfs.org/index.php/XFS_FAQ#Q:_Which_factors_influence_the_memory_usage_of_xfs_repair.3F
> The kernel version at the time of the power loss was 3.10.44 and was
> upgraded to 3.10.90 afterwards.
>
> My questions are the following:
> - Is there anything else i could try to speed up the progress besides
> beefing up the RAM of the machine? Currently it has 8GB which is not
> very much for the task i suppose. I read about the "-m" option and about
If repair is swapping, then adding more RAM and/or faster swap space
will help. There is nothing that you can tweak that changes the
runtime or behaviour of phase 6 - it is single threaded and requires
traversal of the entire filesystem directory heirarchy to find all
the disconnected inodes so they can be moved to lost+found. And it
does write inodes, so if you have a slow SATA RAID5/6...
> "-o bhash=" but i am unsure if they could help in this case.
It can, but increasing it makes repair use more memory. You might
like to try "-o ag_stride=-1" to reduce phase 2-5 memory usage, but
that does not affect phase 6 behaviour...
> - Are there any rough guidelines on how much RAM is needed for
> xfs_repair on a given filesystem? How does it depend on the number of
> inodes or on the size of the file system?
See above. Those numbers don't include reclaimable memory like the
buffer cache footprint, which is affected by bhash and concurrency....
> - How long could the quota check on mount take when the repair is
> finished (the filesystem is mounted with usrquota, grpquota).
As long as it takes to read all the inodes.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Speeding up xfs_repair on filesystem with millions of inodes
2015-10-27 19:38 ` Dave Chinner
@ 2015-10-27 22:51 ` Michael Weissenbacher
2015-10-28 0:17 ` Dave Chinner
0 siblings, 1 reply; 5+ messages in thread
From: Michael Weissenbacher @ 2015-10-27 22:51 UTC (permalink / raw)
To: xfs
Hi Dave!
First of all, today i cancelled the running xfs_repair (CTRL-C) and
upped the system RAM from 8GB to 16GB - the maximum possible with this
hardware.
Dave Chinner wrote:
> It's waiting on inode IO to complete in memory reclaim. I'd say you
> have a problem with lots of dirty inodes in memory and very slow
> writeback due to using something like RAID5/6 (this can be
> *seriously* slow as mentioned recently here:
> http://oss.sgi.com/archives/xfs/2015-10/msg00560.html).
Unfortunately, this is a rather slow RAID-6 setup with 7200RPM disks.
However, before the power loss occurred it performed quite OK for our
use case and without any hiccups. But some time after the power loss
some "rm" commands hung and didn't proceed at all. There was no CPU
usage and there was hardly any I/O on the file system. That's why I
suspected some sort of corruption.
Dave Chinner wrote:
> Was it (xfs_repair) making progress, just burning CPU, or was it just hung?
> Attaching the actual output of repair is also helpful, as are all
> the things here:
> ...
The xfs_repair seemed to be making progress, albeit very very slowly. In
iotop i saw about 99% I/O usage on kswapd0. Looking at the HDD LED's of
the array, i could see that there was hardly any access to it at all
(only once about every 10-15 seconds).
I didn't include xfs_repair output, since it showed nothing unusual.
---snip---
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
- scan filesystem freespace and inode maps...
- found root inode chunk
Phase 3 - for each AG...
- scan and clear agi unlinked lists...
- process known inodes and perform inode discovery...
- agno = 0
...
- agno = 14
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- check for inodes claiming duplicate blocks...
- agno = 0
...
- agno = 14
Phase 5 - rebuild AG headers and trees...
- reset superblock...
Phase 6 - check inode connectivity...
- resetting contents of realtime bitmap and summary inodes
- traversing filesystem ...
---snip---
(and sitting there for about 72 hours)
Dave Chinner wrote:
> "-P" slows xfs_repair down greatly.
Ok, I removed the "-P" option now.
Dave Chinner wrote:
> If repair is swapping, then adding more RAM and/or faster swap space
> will help. There is nothing that you can tweak that changes the
> runtime or behaviour of phase 6 - it is single threaded and requires
> traversal of the entire filesystem directory heirarchy to find all
> the disconnected inodes so they can be moved to lost+found. And it
> does write inodes, so if you have a slow SATA RAID5/6...
Ok, so if i understand you correctly, none of the parameters will help
for phase 6? I know that RAID-6 has slow write characteristics. But in
fact I didn't see any writes at all with iotop and iostat.
Dave Chinner wrote:
>
> See above. Those numbers don't include reclaimable memory like the
> buffer cache footprint, which is affected by bhash and concurrency....
>
As said above, i did now double the RAM of the machine from 8GB to 16GB.
Now I started xfs_repair again with the following options. I hope that
the verbose output will help to understand better what's actually going on.
# xfs_repair -m 8192 -vv /dev/sdb1
Besides, is it wise to limit the memory with "-m" to keep the system
from swapping or should I be better using the defaults (which would use
75% of RAM)?
Thank you very much for your insight, I will keep the list posted about
any progress.
Michael
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Speeding up xfs_repair on filesystem with millions of inodes
2015-10-27 22:51 ` Michael Weissenbacher
@ 2015-10-28 0:17 ` Dave Chinner
2015-10-28 17:31 ` Michael Weissenbacher
0 siblings, 1 reply; 5+ messages in thread
From: Dave Chinner @ 2015-10-28 0:17 UTC (permalink / raw)
To: Michael Weissenbacher; +Cc: xfs
On Tue, Oct 27, 2015 at 11:51:35PM +0100, Michael Weissenbacher wrote:
> Hi Dave!
> First of all, today i cancelled the running xfs_repair (CTRL-C) and
> upped the system RAM from 8GB to 16GB - the maximum possible with this
> hardware.
>
> Dave Chinner wrote:
> > It's waiting on inode IO to complete in memory reclaim. I'd say you
> > have a problem with lots of dirty inodes in memory and very slow
> > writeback due to using something like RAID5/6 (this can be
> > *seriously* slow as mentioned recently here:
> > http://oss.sgi.com/archives/xfs/2015-10/msg00560.html).
> Unfortunately, this is a rather slow RAID-6 setup with 7200RPM disks.
> However, before the power loss occurred it performed quite OK for our
> use case and without any hiccups. But some time after the power loss
> some "rm" commands hung and didn't proceed at all. There was no CPU
> usage and there was hardly any I/O on the file system. That's why I
> suspected some sort of corruption.
Maybe you have a disk that is dying. Do your drives have TLER
enabled on them?
> Dave Chinner wrote:
> > Was it (xfs_repair) making progress, just burning CPU, or was it just hung?
> > Attaching the actual output of repair is also helpful, as are all
> > the things here:
> > ...
> The xfs_repair seemed to be making progress, albeit very very slowly. In
> iotop i saw about 99% I/O usage on kswapd0. Looking at the HDD LED's of
> the array, i could see that there was hardly any access to it at all
> (only once about every 10-15 seconds).
kswapd is tryingto reclaim kernel memory, which has nothing directly
to do with xfs_repair IO or cpu usage. Unless, of course, it is
trying to do reclaim for grab more memory for xfs_repair...
> I didn't include xfs_repair output, since it showed nothing unusual.
> ---snip---
> Phase 1 - find and verify superblock...
> Phase 2 - using internal log
> - zero log...
> - scan filesystem freespace and inode maps...
> - found root inode chunk
> Phase 3 - for each AG...
> - scan and clear agi unlinked lists...
> - process known inodes and perform inode discovery...
> - agno = 0
> ...
> - agno = 14
> - process newly discovered inodes...
> Phase 4 - check for duplicate blocks...
> - setting up duplicate extent list...
> - check for inodes claiming duplicate blocks...
> - agno = 0
> ...
> - agno = 14
> Phase 5 - rebuild AG headers and trees...
> - reset superblock...
> Phase 6 - check inode connectivity...
> - resetting contents of realtime bitmap and summary inodes
> - traversing filesystem ...
> ---snip---
> (and sitting there for about 72 hours)
It really hasn't made much progress if it's still traversing the fs
after 72 hours.
> Dave Chinner wrote:
> > If repair is swapping, then adding more RAM and/or faster swap space
> > will help. There is nothing that you can tweak that changes the
> > runtime or behaviour of phase 6 - it is single threaded and requires
> > traversal of the entire filesystem directory heirarchy to find all
> > the disconnected inodes so they can be moved to lost+found. And it
> > does write inodes, so if you have a slow SATA RAID5/6...
> Ok, so if i understand you correctly, none of the parameters will help
> for phase 6? I know that RAID-6 has slow write characteristics. But in
> fact I didn't see any writes at all with iotop and iostat.
If kswapd is doing all the work, then it's essentially got no memory
available. I would add significantly more swap space as well (e.g.
add swap files to the root filesystem - you can do this while repair
is running, too). If there's sufficient swap space, then repair
should use it fairly efficiently - it doesn't tend to thrash swap
because most of it's memory usage is for information that is only
accessed once per phase or is parked until it is needed in a later
phase so it doesn't need to be read from disk again...
> Dave Chinner wrote:
> >
> > See above. Those numbers don't include reclaimable memory like the
> > buffer cache footprint, which is affected by bhash and concurrency....
> >
> As said above, i did now double the RAM of the machine from 8GB to 16GB.
> Now I started xfs_repair again with the following options. I hope that
> the verbose output will help to understand better what's actually going on.
> # xfs_repair -m 8192 -vv /dev/sdb1
>
> Besides, is it wise to limit the memory with "-m" to keep the system
> from swapping or should I be better using the defaults (which would use
> 75% of RAM)?
Defaults, but it's really only a guideline for cache sizing. If
repair needs more memory to store metadata it is validating (like
the directory structure) then it will consume as much as it needs.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Speeding up xfs_repair on filesystem with millions of inodes
2015-10-28 0:17 ` Dave Chinner
@ 2015-10-28 17:31 ` Michael Weissenbacher
0 siblings, 0 replies; 5+ messages in thread
From: Michael Weissenbacher @ 2015-10-28 17:31 UTC (permalink / raw)
To: xfs
Hi Dave!
Everything is in good shape again. This time xfs_repair finished without
detecting any problems. So i suppose the only problem was that there
wasn't enough RAM.
---snip---
XFS_REPAIR Summary Wed Oct 28 15:02:30 2015
Phase Start End Duration
Phase 1: 10/27 23:19:34 10/27 23:19:34
Phase 2: 10/27 23:19:34 10/27 23:19:57 23 seconds
Phase 3: 10/27 23:19:57 10/28 04:10:50 4 hours, 50 minutes, 53
seconds
Phase 4: 10/28 04:10:50 10/28 09:03:00 4 hours, 52 minutes, 10
seconds
Phase 5: 10/28 09:03:00 10/28 09:03:16 16 seconds
Phase 6: 10/28 09:03:16 10/28 15:02:29 5 hours, 59 minutes, 13
seconds
Phase 7: 10/28 15:02:29 10/28 15:02:29
Total run time: 15 hours, 42 minutes, 55 seconds
---snip---
On 28.10.2015 01:17, Dave Chinner wrote:
>
> Maybe you have a disk that is dying. Do your drives have TLER
> enabled on them?
>
Thanks for the hint. These are all enterprise-grade Nearline-SAS drives
(SEAGATE ST32000444SS) attached to a Dell PERC 6/i controller. I think
it isn't even possible to turn TLER on or off on them. They should all
be in good shape since the controller automatically does periodic patrol
reads.
On 28.10.2015 01:17, Dave Chinner wrote:
>
> If kswapd is doing all the work, then it's essentially got no memory
> available. I would add significantly more swap space as well (e.g.
> add swap files to the root filesystem - you can do this while repair
> is running, too). If there's sufficient swap space, then repair
> should use it fairly efficiently - it doesn't tend to thrash swap
> because most of it's memory usage is for information that is only
> accessed once per phase or is parked until it is needed in a later
> phase so it doesn't need to be read from disk again...
>
Good to know. However, the system was never low on swap. It has 40GB
swap available and never used more than 10GB during the repair (with 8GB
RAM). On the second run, with 16GB RAM, the xfs_repair never used any
swap at all.
On 28.10.2015 01:17, Dave Chinner wrote:
>
> Defaults, but it's really only a guideline for cache sizing. If
> repair needs more memory to store metadata it is validating (like
> the directory structure) then it will consume as much as it needs.
>
Will keep that in mind.
Thanks again for your help.
with kind regards,
Michael
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2015-10-28 17:31 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-10-27 12:10 Speeding up xfs_repair on filesystem with millions of inodes Michael Weissenbacher
2015-10-27 19:38 ` Dave Chinner
2015-10-27 22:51 ` Michael Weissenbacher
2015-10-28 0:17 ` Dave Chinner
2015-10-28 17:31 ` Michael Weissenbacher
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox