* Speeding up xfs_repair on filesystem with millions of inodes @ 2015-10-27 12:10 Michael Weissenbacher 2015-10-27 19:38 ` Dave Chinner 0 siblings, 1 reply; 5+ messages in thread From: Michael Weissenbacher @ 2015-10-27 12:10 UTC (permalink / raw) To: xfs Hi List! I have an xfs filesystem which probably suffered a corruption due to a bad UPS (even though the RAID controller has a good BBU). At the time the power loss occurred the filesystem was mounted with the "nobarrier" option. We noticed the problem several weeks later, when some rsync-based backup jobs started to hang for days without progress when doing a simple "rm". This was accompanied by some messages in dmesg like this one: Oct 15 21:53:14 mojave kernel: [4976164.170021] INFO: task kswapd0:38 blocked for more than 120 seconds. Oct 15 21:53:14 mojave kernel: [4976164.170100] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 15 21:53:14 mojave kernel: [4976164.170180] kswapd0 D ffffffff8180bea0 0 38 2 0x00000000 Oct 15 21:53:14 mojave kernel: [4976164.170185] ffff880225f73968 0000000000000046 ffff880225c42e20 0000000000013180 Oct 15 21:53:14 mojave kernel: [4976164.170188] ffff880225f73fd8 ffff880225f72010 0000000000013180 0000000000013180 Oct 15 21:53:14 mojave kernel: [4976164.170191] ffff880225f73fd8 0000000000013180 ffff880225c42e20 ffff88022611dc40 Oct 15 21:53:14 mojave kernel: [4976164.170194] Call Trace: Oct 15 21:53:14 mojave kernel: [4976164.170204] [<ffffffff8166a8e9>] schedule+0x29/0x70 Oct 15 21:53:14 mojave kernel: [4976164.170207] [<ffffffff8166a9bc>] io_schedule+0x8c/0xd0 Oct 15 21:53:14 mojave kernel: [4976164.170211] [<ffffffff8126c5ef>] __xfs_iflock+0xdf/0x110 Oct 15 21:53:14 mojave kernel: [4976164.170216] [<ffffffff8106b070>] ? autoremove_wake_function+0x40/0x40 Oct 15 21:53:14 mojave kernel: [4976164.170219] [<ffffffff812273b4>] xfs_reclaim_inode+0xc4/0x330 Oct 15 21:53:14 mojave kernel: [4976164.170222] [<ffffffff81227816>] xfs_reclaim_inodes_ag+0x1f6/0x330 Oct 15 21:53:14 mojave kernel: [4976164.170225] [<ffffffff81227983>] xfs_reclaim_inodes_nr+0x33/0x40 Oct 15 21:53:14 mojave kernel: [4976164.170228] [<ffffffff81230085>] xfs_fs_free_cached_objects+0x15/0x20 Oct 15 21:53:14 mojave kernel: [4976164.170233] [<ffffffff8117943e>] prune_super+0x11e/0x1a0 Oct 15 21:53:14 mojave kernel: [4976164.170237] [<ffffffff8112903f>] shrink_slab+0x19f/0x2d0 Oct 15 21:53:14 mojave kernel: [4976164.170240] [<ffffffff8112c3c8>] kswapd+0x698/0xae0 Oct 15 21:53:14 mojave kernel: [4976164.170243] [<ffffffff8106b030>] ? wake_up_bit+0x40/0x40 Oct 15 21:53:14 mojave kernel: [4976164.170246] [<ffffffff8112bd30>] ? zone_reclaim+0x410/0x410 Oct 15 21:53:14 mojave kernel: [4976164.170249] [<ffffffff8106a97e>] kthread+0xce/0xe0 Oct 15 21:53:14 mojave kernel: [4976164.170252] [<ffffffff8106a8b0>] ? kthread_freezable_should_stop+0x70/0x70 Oct 15 21:53:14 mojave kernel: [4976164.170256] [<ffffffff8167475c>] ret_from_fork+0x7c/0xb0 Oct 15 21:53:14 mojave kernel: [4976164.170258] [<ffffffff8106a8b0>] ? kthread_freezable_should_stop+0x70/0x70 So i decided to unmount the fs and run xfs_repair on it. Unfortunately, after almost a week, this hasn't finished yet. It seems to do so much swapping that it hardly makes any progress. Currently it has been in Phase 6 (traversing filesystem) for several days. I found a thread suggesting to add an ssd as swap drive, which i did yesterday. I also added the "-P" option to xfs_repair since it helped in some cases similar in the past. I am using the latest xfs_repair version 3.2.4, compiled myself. The filesystem is 16TB in size and contains about 150 million inodes. The machine has 8GB of RAM available. The kernel version at the time of the power loss was 3.10.44 and was upgraded to 3.10.90 afterwards. My questions are the following: - Is there anything else i could try to speed up the progress besides beefing up the RAM of the machine? Currently it has 8GB which is not very much for the task i suppose. I read about the "-m" option and about "-o bhash=" but i am unsure if they could help in this case. - Are there any rough guidelines on how much RAM is needed for xfs_repair on a given filesystem? How does it depend on the number of inodes or on the size of the file system? - How long could the quota check on mount take when the repair is finished (the filesystem is mounted with usrquota, grpquota). thanks in advance, Michael _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Speeding up xfs_repair on filesystem with millions of inodes 2015-10-27 12:10 Speeding up xfs_repair on filesystem with millions of inodes Michael Weissenbacher @ 2015-10-27 19:38 ` Dave Chinner 2015-10-27 22:51 ` Michael Weissenbacher 0 siblings, 1 reply; 5+ messages in thread From: Dave Chinner @ 2015-10-27 19:38 UTC (permalink / raw) To: Michael Weissenbacher; +Cc: xfs On Tue, Oct 27, 2015 at 01:10:06PM +0100, Michael Weissenbacher wrote: > Hi List! > I have an xfs filesystem which probably suffered a corruption due to a > bad UPS (even though the RAID controller has a good BBU). At the time > the power loss occurred the filesystem was mounted with the "nobarrier" > option. > > We noticed the problem several weeks later, when some rsync-based backup > jobs started to hang for days without progress when doing a simple "rm". > This was accompanied by some messages in dmesg like this one: [cleanup line-wrapped paste mess] > INFO: task kswapd0:38 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > kswapd0 D ffffffff8180bea0 0 38 2 0x00000000 > ffff880225f73968 0000000000000046 ffff880225c42e20 0000000000013180 > ffff880225f73fd8 ffff880225f72010 0000000000013180 0000000000013180 > ffff880225f73fd8 0000000000013180 ffff880225c42e20 ffff88022611dc40 > Call Trace: > [<ffffffff8166a8e9>] schedule+0x29/0x70 > [<ffffffff8166a9bc>] io_schedule+0x8c/0xd0 > [<ffffffff8126c5ef>] __xfs_iflock+0xdf/0x110 > [<ffffffff8106b070>] ? autoremove_wake_function+0x40/0x40 > [<ffffffff812273b4>] xfs_reclaim_inode+0xc4/0x330 > [<ffffffff81227816>] xfs_reclaim_inodes_ag+0x1f6/0x330 > [<ffffffff81227983>] xfs_reclaim_inodes_nr+0x33/0x40 > [<ffffffff81230085>] xfs_fs_free_cached_objects+0x15/0x20 > [<ffffffff8117943e>] prune_super+0x11e/0x1a0 > [<ffffffff8112903f>] shrink_slab+0x19f/0x2d0 > [<ffffffff8112c3c8>] kswapd+0x698/0xae0 > [<ffffffff8106b030>] ? wake_up_bit+0x40/0x40 > [<ffffffff8112bd30>] ? zone_reclaim+0x410/0x410 > [<ffffffff8106a97e>] kthread+0xce/0xe0 > [<ffffffff8106a8b0>] ? kthread_freezable_should_stop+0x70/0x70 > [<ffffffff8167475c>] ret_from_fork+0x7c/0xb0 > [<ffffffff8106a8b0>] ? kthread_freezable_should_stop+0x70/0x70 It's waiting on inode IO to complete in memory reclaim. I'd say you have a problem with lots of dirty inodes in memory and very slow writeback due to using something like RAID5/6 (this can be *seriously* slow as mentioned recently here: http://oss.sgi.com/archives/xfs/2015-10/msg00560.html). > So i decided to unmount the fs and run xfs_repair on it. Unfortunately, > after almost a week, this hasn't finished yet. It seems to do so much > swapping that it hardly makes any progress. Currently it has been in > Phase 6 (traversing filesystem) for several days. Was it making progress, just burning CPU, or was it just hung? Attaching the actual output of repair is also helpful, as are all the things here: http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F > I found a thread suggesting to add an ssd as swap drive, which i did > yesterday. I also added the "-P" option to xfs_repair since it helped in > some cases similar in the past. "-P" slows xfs_repair down greatly. > I am using the latest xfs_repair version 3.2.4, compiled myself. > > The filesystem is 16TB in size and contains about 150 million inodes. > The machine has 8GB of RAM available. http://xfs.org/index.php/XFS_FAQ#Q:_Which_factors_influence_the_memory_usage_of_xfs_repair.3F > The kernel version at the time of the power loss was 3.10.44 and was > upgraded to 3.10.90 afterwards. > > My questions are the following: > - Is there anything else i could try to speed up the progress besides > beefing up the RAM of the machine? Currently it has 8GB which is not > very much for the task i suppose. I read about the "-m" option and about If repair is swapping, then adding more RAM and/or faster swap space will help. There is nothing that you can tweak that changes the runtime or behaviour of phase 6 - it is single threaded and requires traversal of the entire filesystem directory heirarchy to find all the disconnected inodes so they can be moved to lost+found. And it does write inodes, so if you have a slow SATA RAID5/6... > "-o bhash=" but i am unsure if they could help in this case. It can, but increasing it makes repair use more memory. You might like to try "-o ag_stride=-1" to reduce phase 2-5 memory usage, but that does not affect phase 6 behaviour... > - Are there any rough guidelines on how much RAM is needed for > xfs_repair on a given filesystem? How does it depend on the number of > inodes or on the size of the file system? See above. Those numbers don't include reclaimable memory like the buffer cache footprint, which is affected by bhash and concurrency.... > - How long could the quota check on mount take when the repair is > finished (the filesystem is mounted with usrquota, grpquota). As long as it takes to read all the inodes. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Speeding up xfs_repair on filesystem with millions of inodes 2015-10-27 19:38 ` Dave Chinner @ 2015-10-27 22:51 ` Michael Weissenbacher 2015-10-28 0:17 ` Dave Chinner 0 siblings, 1 reply; 5+ messages in thread From: Michael Weissenbacher @ 2015-10-27 22:51 UTC (permalink / raw) To: xfs Hi Dave! First of all, today i cancelled the running xfs_repair (CTRL-C) and upped the system RAM from 8GB to 16GB - the maximum possible with this hardware. Dave Chinner wrote: > It's waiting on inode IO to complete in memory reclaim. I'd say you > have a problem with lots of dirty inodes in memory and very slow > writeback due to using something like RAID5/6 (this can be > *seriously* slow as mentioned recently here: > http://oss.sgi.com/archives/xfs/2015-10/msg00560.html). Unfortunately, this is a rather slow RAID-6 setup with 7200RPM disks. However, before the power loss occurred it performed quite OK for our use case and without any hiccups. But some time after the power loss some "rm" commands hung and didn't proceed at all. There was no CPU usage and there was hardly any I/O on the file system. That's why I suspected some sort of corruption. Dave Chinner wrote: > Was it (xfs_repair) making progress, just burning CPU, or was it just hung? > Attaching the actual output of repair is also helpful, as are all > the things here: > ... The xfs_repair seemed to be making progress, albeit very very slowly. In iotop i saw about 99% I/O usage on kswapd0. Looking at the HDD LED's of the array, i could see that there was hardly any access to it at all (only once about every 10-15 seconds). I didn't include xfs_repair output, since it showed nothing unusual. ---snip--- Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... - scan filesystem freespace and inode maps... - found root inode chunk Phase 3 - for each AG... - scan and clear agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 ... - agno = 14 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 ... - agno = 14 Phase 5 - rebuild AG headers and trees... - reset superblock... Phase 6 - check inode connectivity... - resetting contents of realtime bitmap and summary inodes - traversing filesystem ... ---snip--- (and sitting there for about 72 hours) Dave Chinner wrote: > "-P" slows xfs_repair down greatly. Ok, I removed the "-P" option now. Dave Chinner wrote: > If repair is swapping, then adding more RAM and/or faster swap space > will help. There is nothing that you can tweak that changes the > runtime or behaviour of phase 6 - it is single threaded and requires > traversal of the entire filesystem directory heirarchy to find all > the disconnected inodes so they can be moved to lost+found. And it > does write inodes, so if you have a slow SATA RAID5/6... Ok, so if i understand you correctly, none of the parameters will help for phase 6? I know that RAID-6 has slow write characteristics. But in fact I didn't see any writes at all with iotop and iostat. Dave Chinner wrote: > > See above. Those numbers don't include reclaimable memory like the > buffer cache footprint, which is affected by bhash and concurrency.... > As said above, i did now double the RAM of the machine from 8GB to 16GB. Now I started xfs_repair again with the following options. I hope that the verbose output will help to understand better what's actually going on. # xfs_repair -m 8192 -vv /dev/sdb1 Besides, is it wise to limit the memory with "-m" to keep the system from swapping or should I be better using the defaults (which would use 75% of RAM)? Thank you very much for your insight, I will keep the list posted about any progress. Michael _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Speeding up xfs_repair on filesystem with millions of inodes 2015-10-27 22:51 ` Michael Weissenbacher @ 2015-10-28 0:17 ` Dave Chinner 2015-10-28 17:31 ` Michael Weissenbacher 0 siblings, 1 reply; 5+ messages in thread From: Dave Chinner @ 2015-10-28 0:17 UTC (permalink / raw) To: Michael Weissenbacher; +Cc: xfs On Tue, Oct 27, 2015 at 11:51:35PM +0100, Michael Weissenbacher wrote: > Hi Dave! > First of all, today i cancelled the running xfs_repair (CTRL-C) and > upped the system RAM from 8GB to 16GB - the maximum possible with this > hardware. > > Dave Chinner wrote: > > It's waiting on inode IO to complete in memory reclaim. I'd say you > > have a problem with lots of dirty inodes in memory and very slow > > writeback due to using something like RAID5/6 (this can be > > *seriously* slow as mentioned recently here: > > http://oss.sgi.com/archives/xfs/2015-10/msg00560.html). > Unfortunately, this is a rather slow RAID-6 setup with 7200RPM disks. > However, before the power loss occurred it performed quite OK for our > use case and without any hiccups. But some time after the power loss > some "rm" commands hung and didn't proceed at all. There was no CPU > usage and there was hardly any I/O on the file system. That's why I > suspected some sort of corruption. Maybe you have a disk that is dying. Do your drives have TLER enabled on them? > Dave Chinner wrote: > > Was it (xfs_repair) making progress, just burning CPU, or was it just hung? > > Attaching the actual output of repair is also helpful, as are all > > the things here: > > ... > The xfs_repair seemed to be making progress, albeit very very slowly. In > iotop i saw about 99% I/O usage on kswapd0. Looking at the HDD LED's of > the array, i could see that there was hardly any access to it at all > (only once about every 10-15 seconds). kswapd is tryingto reclaim kernel memory, which has nothing directly to do with xfs_repair IO or cpu usage. Unless, of course, it is trying to do reclaim for grab more memory for xfs_repair... > I didn't include xfs_repair output, since it showed nothing unusual. > ---snip--- > Phase 1 - find and verify superblock... > Phase 2 - using internal log > - zero log... > - scan filesystem freespace and inode maps... > - found root inode chunk > Phase 3 - for each AG... > - scan and clear agi unlinked lists... > - process known inodes and perform inode discovery... > - agno = 0 > ... > - agno = 14 > - process newly discovered inodes... > Phase 4 - check for duplicate blocks... > - setting up duplicate extent list... > - check for inodes claiming duplicate blocks... > - agno = 0 > ... > - agno = 14 > Phase 5 - rebuild AG headers and trees... > - reset superblock... > Phase 6 - check inode connectivity... > - resetting contents of realtime bitmap and summary inodes > - traversing filesystem ... > ---snip--- > (and sitting there for about 72 hours) It really hasn't made much progress if it's still traversing the fs after 72 hours. > Dave Chinner wrote: > > If repair is swapping, then adding more RAM and/or faster swap space > > will help. There is nothing that you can tweak that changes the > > runtime or behaviour of phase 6 - it is single threaded and requires > > traversal of the entire filesystem directory heirarchy to find all > > the disconnected inodes so they can be moved to lost+found. And it > > does write inodes, so if you have a slow SATA RAID5/6... > Ok, so if i understand you correctly, none of the parameters will help > for phase 6? I know that RAID-6 has slow write characteristics. But in > fact I didn't see any writes at all with iotop and iostat. If kswapd is doing all the work, then it's essentially got no memory available. I would add significantly more swap space as well (e.g. add swap files to the root filesystem - you can do this while repair is running, too). If there's sufficient swap space, then repair should use it fairly efficiently - it doesn't tend to thrash swap because most of it's memory usage is for information that is only accessed once per phase or is parked until it is needed in a later phase so it doesn't need to be read from disk again... > Dave Chinner wrote: > > > > See above. Those numbers don't include reclaimable memory like the > > buffer cache footprint, which is affected by bhash and concurrency.... > > > As said above, i did now double the RAM of the machine from 8GB to 16GB. > Now I started xfs_repair again with the following options. I hope that > the verbose output will help to understand better what's actually going on. > # xfs_repair -m 8192 -vv /dev/sdb1 > > Besides, is it wise to limit the memory with "-m" to keep the system > from swapping or should I be better using the defaults (which would use > 75% of RAM)? Defaults, but it's really only a guideline for cache sizing. If repair needs more memory to store metadata it is validating (like the directory structure) then it will consume as much as it needs. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Speeding up xfs_repair on filesystem with millions of inodes 2015-10-28 0:17 ` Dave Chinner @ 2015-10-28 17:31 ` Michael Weissenbacher 0 siblings, 0 replies; 5+ messages in thread From: Michael Weissenbacher @ 2015-10-28 17:31 UTC (permalink / raw) To: xfs Hi Dave! Everything is in good shape again. This time xfs_repair finished without detecting any problems. So i suppose the only problem was that there wasn't enough RAM. ---snip--- XFS_REPAIR Summary Wed Oct 28 15:02:30 2015 Phase Start End Duration Phase 1: 10/27 23:19:34 10/27 23:19:34 Phase 2: 10/27 23:19:34 10/27 23:19:57 23 seconds Phase 3: 10/27 23:19:57 10/28 04:10:50 4 hours, 50 minutes, 53 seconds Phase 4: 10/28 04:10:50 10/28 09:03:00 4 hours, 52 minutes, 10 seconds Phase 5: 10/28 09:03:00 10/28 09:03:16 16 seconds Phase 6: 10/28 09:03:16 10/28 15:02:29 5 hours, 59 minutes, 13 seconds Phase 7: 10/28 15:02:29 10/28 15:02:29 Total run time: 15 hours, 42 minutes, 55 seconds ---snip--- On 28.10.2015 01:17, Dave Chinner wrote: > > Maybe you have a disk that is dying. Do your drives have TLER > enabled on them? > Thanks for the hint. These are all enterprise-grade Nearline-SAS drives (SEAGATE ST32000444SS) attached to a Dell PERC 6/i controller. I think it isn't even possible to turn TLER on or off on them. They should all be in good shape since the controller automatically does periodic patrol reads. On 28.10.2015 01:17, Dave Chinner wrote: > > If kswapd is doing all the work, then it's essentially got no memory > available. I would add significantly more swap space as well (e.g. > add swap files to the root filesystem - you can do this while repair > is running, too). If there's sufficient swap space, then repair > should use it fairly efficiently - it doesn't tend to thrash swap > because most of it's memory usage is for information that is only > accessed once per phase or is parked until it is needed in a later > phase so it doesn't need to be read from disk again... > Good to know. However, the system was never low on swap. It has 40GB swap available and never used more than 10GB during the repair (with 8GB RAM). On the second run, with 16GB RAM, the xfs_repair never used any swap at all. On 28.10.2015 01:17, Dave Chinner wrote: > > Defaults, but it's really only a guideline for cache sizing. If > repair needs more memory to store metadata it is validating (like > the directory structure) then it will consume as much as it needs. > Will keep that in mind. Thanks again for your help. with kind regards, Michael _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2015-10-28 17:31 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-10-27 12:10 Speeding up xfs_repair on filesystem with millions of inodes Michael Weissenbacher 2015-10-27 19:38 ` Dave Chinner 2015-10-27 22:51 ` Michael Weissenbacher 2015-10-28 0:17 ` Dave Chinner 2015-10-28 17:31 ` Michael Weissenbacher
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox