public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* [xfs_check Out of memory: ]
@ 2013-12-27  6:48 Stor??
  2013-12-27  7:41 ` Jeff Liu
  0 siblings, 1 reply; 19+ messages in thread
From: Stor?? @ 2013-12-27  6:48 UTC (permalink / raw)
  To: xfs


[-- Attachment #1.1: Type: text/plain, Size: 2950 bytes --]

Hey:
 
20T xfs file system 
 
 
 
/usr/sbin/xfs_check: line 28: 14447 Killed                  xfs_db$DBOPTS -i -p xfs_check -c "check$OPTS" $1
 
 
 
snmpd invoked oom-killer: gfp_mask=0x1201d2, order=0, oomkilladj=0
 
Pid: 4753, comm: snmpd Tainted: G          2.6.27.19-5-default #95
 
 
 
Call Trace:
 
 [<ffffffff8020d899>] show_trace_log_lvl+0x41/0x58
 
 [<ffffffff8020deff>] dump_stack+0x69/0x6f
 
 [<ffffffff8027fc54>] oom_kill_process+0x5c/0x1fe
 
 [<ffffffff8028023a>] out_of_memory+0x169/0x1ff
 
 [<ffffffff802831d0>] __alloc_pages_internal+0x2e4/0x3ce
 
 [<ffffffff8028504a>] __do_page_cache_readahead+0x79/0x183
 
 [<ffffffff8027f46a>] filemap_fault+0x15d/0x337
 
 [<ffffffff8028bae2>] __do_fault+0x52/0x37a
 
 [<ffffffff8028d765>] handle_mm_fault+0x382/0x75e
 
 [<ffffffff80482b1d>] do_page_fault+0x45a/0x81c
 
 [<ffffffff80480a99>] error_exit+0x0/0x51
 
 [<00007fccea2b1d00>] 0x7fccea2b1d00
 
 
 
Mem-Info:
 
Node 0 DMA per-cpu:
 
CPU    0: hi:    0, btch:   1 usd:   0
 
CPU    1: hi:    0, btch:   1 usd:   0
 
CPU    2: hi:    0, btch:   1 usd:   0
 
CPU    3: hi:    0, btch:   1 usd:   0
 
Node 0 DMA32 per-cpu:
 
CPU    0: hi:  186, btch:  31 usd: 160
 
CPU    1: hi:  186, btch:  31 usd: 173
 
CPU    2: hi:  186, btch:  31 usd:  50
 
CPU    3: hi:  186, btch:  31 usd: 176
 
Node 0 Normal per-cpu:
 
CPU    0: hi:  186, btch:  31 usd: 116
 
CPU    1: hi:  186, btch:  31 usd:  95
 
CPU    2: hi:  186, btch:  31 usd: 162
 
CPU    3: hi:  186, btch:  31 usd: 161
 
Active:3818904 inactive:3847 dirty:2 writeback:0 unstable:0
 
 free:19191 slab:25747 mapped:1530 pagetables:12518 bounce:0
 
Node 0 DMA free:7460kB min:4kB low:4kB high:4kB active:0kB inactive:0kB present:6160kB pages_scanned:0 all_unreclaimable? yes
 
lowmem_reserve[]: 0 2716 15820 15820
 
Node 0 DMA32 free:55240kB min:2760kB low:3448kB high:4140kB active:2386964kB inactive:4kB present:2782036kB pages_scanned:4026093 all_unreclaimable? yes
 
lowmem_reserve[]: 0 0 13104 13104
 
Node 0 Normal free:14064kB min:13328kB low:16660kB high:19992kB active:12888652kB inactive:15384kB present:13418496kB pages_scanned:22792092 all_unreclaimable? yes
 
lowmem_reserve[]: 0 0 0 0
 
Node 0 DMA: 5*4kB 2*8kB 0*16kB 2*32kB 3*64kB 2*128kB 3*256kB 0*512kB 2*1024kB 0*2048kB 1*4096kB = 7460kB
 
Node 0 DMA32: 76*4kB 179*8kB 164*16kB 122*32kB 78*64kB 62*128kB 33*256kB 24*512kB 7*1024kB 1*2048kB 1*4096kB = 55240kB
 
Node 0 Normal: 192*4kB 58*8kB 12*16kB 3*32kB 0*64kB 0*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 3*4096kB = 14064kB
 
5354 total pagecache pages
 
0 pages in swap cache
 
Swap cache stats: add 0, delete 0, find 0/0
 
Free swap  = 0kB
 
Total swap = 0kB
 
4161536 pages RAM
 
139058 pages reserved
 
24037 pages shared
 
3992120 pages non-shared
 
Out of memory: kill process 14447 (xfs_db) score 670225 or a child
 
Killed process 14447 (xfs_db)

[-- Attachment #1.2: Type: text/html, Size: 12219 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [xfs_check Out of memory: ]
  2013-12-27  6:48 [xfs_check Out of memory: ] Stor??
@ 2013-12-27  7:41 ` Jeff Liu
  2013-12-27  8:07   ` Arkadiusz Miśkiewicz
  0 siblings, 1 reply; 19+ messages in thread
From: Jeff Liu @ 2013-12-27  7:41 UTC (permalink / raw)
  To: Stor??, xfs

On 12/27 2013 14:48 PM, Stor?? wrote:
> Hey:
> 
> 20T xfs file system
> 
>  
> 
> /usr/sbin/xfs_check: line 28: 14447 Killed                 
> xfs_db$DBOPTS -i -p xfs_check -c "check$OPTS" $1
xfs_check is deprecated and please use xfs_repair -n instead.

The following back traces show us that it seems your system is run out memory
when executing xfs_check, thus, snmp daemon/xfs_db were killed.

Thanks,
-Jeff 
> 
> snmpd invoked oom-killer: gfp_mask=0x1201d2, order=0, oomkilladj=0
> 
> Pid: 4753, comm: snmpd Tainted: G          2.6.27.19-5-default #95
> 
>  
> 
> Call Trace:
> 
>  [<ffffffff8020d899>] show_trace_log_lvl+0x41/0x58
> 
>  [<ffffffff8020deff>] dump_stack+0x69/0x6f
> 
>  [<ffffffff8027fc54>] oom_kill_process+0x5c/0x1fe
> 
>  [<ffffffff8028023a>] out_of_memory+0x169/0x1ff
> 
>  [<ffffffff802831d0>] __alloc_pages_internal+0x2e4/0x3ce
> 
>  [<ffffffff8028504a>] __do_page_cache_readahead+0x79/0x183
> 
>  [<ffffffff8027f46a>] filemap_fault+0x15d/0x337
> 
>  [<ffffffff8028bae2>] __do_fault+0x52/0x37a
> 
>  [<ffffffff8028d765>] handle_mm_fault+0x382/0x75e
> 
>  [<ffffffff80482b1d>] do_page_fault+0x45a/0x81c
> 
>  [<ffffffff80480a99>] error_exit+0x0/0x51
> 
>  [<00007fccea2b1d00>] 0x7fccea2b1d00
> 
>  
> 
> Mem-Info:
> 
> Node 0 DMA per-cpu:
> 
> CPU    0: hi:    0, btch:   1 usd:   0
> 
> CPU    1: hi:    0, btch:   1 usd:   0
> 
> CPU    2: hi:    0, btch:   1 usd:   0
> 
> CPU    3: hi:    0, btch:   1 usd:   0
> 
> Node 0 DMA32 per-cpu:
> 
> CPU    0: hi:  186, btch:  31 usd: 160
> 
> CPU    1: hi:  186, btch:  31 usd: 173
> 
> CPU    2: hi:  186, btch:  31 usd:  50
> 
> CPU    3: hi:  186, btch:  31 usd: 176
> 
> Node 0 Normal per-cpu:
> 
> CPU    0: hi:  186, btch:  31 usd: 116
> 
> CPU    1: hi:  186, btch:  31 usd:  95
> 
> CPU    2: hi:  186, btch:  31 usd: 162
> 
> CPU    3: hi:  186, btch:  31 usd: 161
> 
> Active:3818904 inactive:3847 dirty:2 writeback:0 unstable:0
> 
>  free:19191 slab:25747 mapped:1530 pagetables:12518 bounce:0
> 
> Node 0 DMA free:7460kB min:4kB low:4kB high:4kB active:0kB inactive:0kB
> present:6160kB pages_scanned:0 all_unreclaimable? yes
> 
> lowmem_reserve[]: 0 2716 15820 15820
> 
> Node 0 DMA32 free:55240kB min:2760kB low:3448kB high:4140kB
> active:2386964kB inactive:4kB present:2782036kB pages_scanned:4026093
> all_unreclaimable? yes
> 
> lowmem_reserve[]: 0 0 13104 13104
> 
> Node 0 Normal free:14064kB min:13328kB low:16660kB high:19992kB
> active:12888652kB inactive:15384kB present:13418496kB
> pages_scanned:22792092 all_unreclaimable? yes
> 
> lowmem_reserve[]: 0 0 0 0
> 
> Node 0 DMA: 5*4kB 2*8kB 0*16kB 2*32kB 3*64kB 2*128kB 3*256kB 0*512kB
> 2*1024kB 0*2048kB 1*4096kB = 7460kB
> 
> Node 0 DMA32: 76*4kB 179*8kB 164*16kB 122*32kB 78*64kB 62*128kB 33*256kB
> 24*512kB 7*1024kB 1*2048kB 1*4096kB = 55240kB
> 
> Node 0 Normal: 192*4kB 58*8kB 12*16kB 3*32kB 0*64kB 0*128kB 1*256kB
> 0*512kB 0*1024kB 0*2048kB 3*4096kB = 14064kB
> 
> 5354 total pagecache pages
> 
> 0 pages in swap cache
> 
> Swap cache stats: add 0, delete 0, find 0/0
> 
> Free swap  = 0kB
> 
> Total swap = 0kB
> 
> 4161536 pages RAM
> 
> 139058 pages reserved
> 
> 24037 pages shared
> 
> 3992120 pages non-shared
> 
> Out of memory: kill process 14447 (xfs_db) score 670225 or a child
> 
> Killed process 14447 (xfs_db)
> 
> 
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [xfs_check Out of memory: ]
  2013-12-27  7:41 ` Jeff Liu
@ 2013-12-27  8:07   ` Arkadiusz Miśkiewicz
  2013-12-27 22:42     ` Dave Chinner
  0 siblings, 1 reply; 19+ messages in thread
From: Arkadiusz Miśkiewicz @ 2013-12-27  8:07 UTC (permalink / raw)
  To: xfs; +Cc: Stor??, Jeff Liu

On Friday 27 of December 2013, Jeff Liu wrote:
> On 12/27 2013 14:48 PM, Stor?? wrote:
> > Hey:
> > 
> > 20T xfs file system
> > 
> > 
> > 
> > /usr/sbin/xfs_check: line 28: 14447 Killed
> > xfs_db$DBOPTS -i -p xfs_check -c "check$OPTS" $1
> 
> xfs_check is deprecated and please use xfs_repair -n instead.
> 
> The following back traces show us that it seems your system is run out
> memory when executing xfs_check, thus, snmp daemon/xfs_db were killed.

This reminds me a question...

Could xfs_repair store its temporary data (some of that data, the biggest 
parte) on disk instead of in memory?

I don't know it that would make sense, so asking. Not sure if xfs_repair needs 
to access that data frequently (so on disk makes no sense) or maybe it needs 
only for iteration purposes in some later phase (so on disk should work).

Anyway memory usage of xfs_repair was always a problem for me (like 16GB not 
enough for 7TB fs due to huge amount of fies being stored). With parallel scan 
it's even worse obviously.

> Thanks,
> -Jeff

-- 
Arkadiusz Miśkiewicz, arekm / maven.pl

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [xfs_check Out of memory: ]
  2013-12-27  8:07   ` Arkadiusz Miśkiewicz
@ 2013-12-27 22:42     ` Dave Chinner
  2013-12-27 23:20       ` Arkadiusz Miśkiewicz
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2013-12-27 22:42 UTC (permalink / raw)
  To: Arkadiusz Miśkiewicz; +Cc: Stor??, Jeff Liu, xfs

On Fri, Dec 27, 2013 at 09:07:22AM +0100, Arkadiusz Miśkiewicz wrote:
> On Friday 27 of December 2013, Jeff Liu wrote:
> > On 12/27 2013 14:48 PM, Stor?? wrote:
> > > Hey:
> > > 
> > > 20T xfs file system
> > > 
> > > 
> > > 
> > > /usr/sbin/xfs_check: line 28: 14447 Killed
> > > xfs_db$DBOPTS -i -p xfs_check -c "check$OPTS" $1
> > 
> > xfs_check is deprecated and please use xfs_repair -n instead.
> > 
> > The following back traces show us that it seems your system is run out
> > memory when executing xfs_check, thus, snmp daemon/xfs_db were killed.
> 
> This reminds me a question...
> 
> Could xfs_repair store its temporary data (some of that data, the biggest 
> parte) on disk instead of in memory?

Where on disk? We can't write to the disk until we've verified all
the free space is really free space, and guess what uses all the
memory? Besides, if the information is not being referenced
regularly (and it usually isn't), then swap space is about as
efficient as any database we might come up with...

> I don't know it that would make sense, so asking. Not sure if xfs_repair needs 
> to access that data frequently (so on disk makes no sense) or maybe it needs 
> only for iteration purposes in some later phase (so on disk should work).
> 
> Anyway memory usage of xfs_repair was always a problem for me (like 16GB not 
> enough for 7TB fs due to huge amount of fies being stored). With parallel scan 
> it's even worse obviously.

Yes, your problem is that the filesystem you are checking contains
40+GB of metadata and a large amount of that needs to be kept in
memory from phase 3 through to phase 6. If you really want to add
some kind of database interface to store this information somewhere
else, then I'll review the patches. ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [xfs_check Out of memory: ]
  2013-12-27 22:42     ` Dave Chinner
@ 2013-12-27 23:20       ` Arkadiusz Miśkiewicz
  2013-12-28 16:55         ` Stan Hoeppner
  2013-12-29  9:50         ` Dave Chinner
  0 siblings, 2 replies; 19+ messages in thread
From: Arkadiusz Miśkiewicz @ 2013-12-27 23:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Stor??, Jeff Liu, xfs

On Friday 27 of December 2013, Dave Chinner wrote:
> On Fri, Dec 27, 2013 at 09:07:22AM +0100, Arkadiusz Miśkiewicz wrote:
> > On Friday 27 of December 2013, Jeff Liu wrote:
> > > On 12/27 2013 14:48 PM, Stor?? wrote:
> > > > Hey:
> > > > 
> > > > 20T xfs file system
> > > > 
> > > > 
> > > > 
> > > > /usr/sbin/xfs_check: line 28: 14447 Killed
> > > > xfs_db$DBOPTS -i -p xfs_check -c "check$OPTS" $1
> > > 
> > > xfs_check is deprecated and please use xfs_repair -n instead.
> > > 
> > > The following back traces show us that it seems your system is run out
> > > memory when executing xfs_check, thus, snmp daemon/xfs_db were killed.
> > 
> > This reminds me a question...
> > 
> > Could xfs_repair store its temporary data (some of that data, the biggest
> > parte) on disk instead of in memory?
> 
> Where on disk? 

In directory/file that I'll tell it to use (since I usualy have few xfs 
filesystems on single server and so far only one at a time breaks).

> We can't write to the disk until we've verified all
> the free space is really free space, and guess what uses all the
> memory? Besides, if the information is not being referenced
> regularly (and it usually isn't), then swap space is about as
> efficient as any database we might come up with...

It's not about efficiency. It's about not killing the system (by not eating 
all memory, OOM). If I can (optionally) trade repair speed for not eating ram 
then it's desired sometimes. Better to have slow repair than no repair 8)

Could xfs_repair tell kernel that this data should always end up on swap first 
(allowing other programs/daemons to use regular memory) prehaps? (Don't know 
interface that would allow to do that in kernel though). That would be some 
half baked solution.

> > I don't know it that would make sense, so asking. Not sure if xfs_repair
> > needs to access that data frequently (so on disk makes no sense) or
> > maybe it needs only for iteration purposes in some later phase (so on
> > disk should work).
> > 
> > Anyway memory usage of xfs_repair was always a problem for me (like 16GB
> > not enough for 7TB fs due to huge amount of fies being stored). With
> > parallel scan it's even worse obviously.
> 
> Yes, your problem is that the filesystem you are checking contains
> 40+GB of metadata and a large amount of that needs to be kept in
> memory from phase 3 through to phase 6.

Is that data (or most of that data) frequenly accessed? Or something that's 
iterated over let say once in each phase? 


Anyway current "fun" with repair and huge filesystems looks like this:
- 16GB of memory, run xfs_repair, system goes into unusable state because 
whole ram is eaten (ends up with OOM); wait several hours
- reboot, add 20GB of swap, run xfs_repair, the same happens again; wait half 
a day
- reboot, add another 20GB of swap space, run xfs repair - success!; wait 
another day
- in all steps system is simply unusable for other services. Nothing else will 
work since entire ram gets eaten by repair. So doesn't help me to have 4 xfs 
filesystems and only one broken - have to shut down all services only for that 
repair to work
- with parallel git repair it is even worse obviously (OOM happens sooner than 
later)
- can't add more RAM easily, machine is at remote location, uses obsolete 
DDR2, have no more ram slots and so on
- total repair time for all that steps is few times longer than neccessary 
(successful repair took 7.5h while all these steps took 2 days)
- what's worse tools give no estimations of ram needed etc but that's afaik 
unfixable. This means that it is not known how much memory will be needed. You 
need to run repair and see. Also if more files gets stored then next repair in 
few monts could require twice more ram. You never know what to expect.

Now how to prevent these problems? Currently I see only one "solution" - add 
more RAM.

Unfortunately that's not a sloution - won't work in many cases described 
above.

So looks like my future backup servers will need to have 64GB, 128GB or maybe 
even more ram that will be there only for xfs_repair usage. That's gigantic 
waste of resources. And there are modern processors that don't work with more 
than 32GB of ram - like "Intel Xeon E3-1220v2" ( http://tnij.org/tkqas9e ). So 
adding ram means replacing CPU, likely replacing mainboard. Fun :)

> If you really want to add
> some kind of database interface to store this information somewhere
> else, then I'll review the patches. ;)

Right. So only "easy" task finding the one who understands the code and can 
write such interface left. Anyone?

IMO ram usage is a real problem for xfs_repair and there has to be some 
upstream solution other than "buy more" (and waste more) approach.

> Cheers,
> 
> Dave.

-- 
Arkadiusz Miśkiewicz, arekm / maven.pl

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [xfs_check Out of memory: ]
  2013-12-27 23:20       ` Arkadiusz Miśkiewicz
@ 2013-12-28 16:55         ` Stan Hoeppner
  2013-12-28 17:35           ` Jay Ashworth
  2013-12-28 23:39           ` Arkadiusz Miśkiewicz
  2013-12-29  9:50         ` Dave Chinner
  1 sibling, 2 replies; 19+ messages in thread
From: Stan Hoeppner @ 2013-12-28 16:55 UTC (permalink / raw)
  To: Arkadiusz Miśkiewicz, Dave Chinner; +Cc: Stor??, Jeff Liu, xfs

On 12/27/2013 5:20 PM, Arkadiusz Miśkiewicz wrote:
...
> - can't add more RAM easily, machine is at remote location, uses obsolete 
> DDR2, have no more ram slots and so on
...
> So looks like my future backup servers will need to have 64GB, 128GB or maybe 
> even more ram that will be there only for xfs_repair usage. That's gigantic 
> waste of resources. And there are modern processors that don't work with more 
> than 32GB of ram - like "Intel Xeon E3-1220v2" ( http://tnij.org/tkqas9e ). So 
> adding ram means replacing CPU, likely replacing mainboard. Fun :)
..
> IMO ram usage is a real problem for xfs_repair and there has to be some 
> upstream solution other than "buy more" (and waste more) approach.

The problem isn't xfs_repair.  The problem is that you expect this tool
to handle an infinite number of inodes while using a finite amount of
memory, or at least somewhat less memory than you have installed.  We
don't see your problem reported very often which seems to indicate your
situation is a corner case, or that others simply size their systems
properly without complaint.

If you'd actually like advice on how to solve this, today, with
realistic solutions, in lieu of the devs recoding xfs_repair for the
single goal of using less memory, then here are your options:

1.  Rewrite or redo your workload to not create so many small files,
    so many inodes, i.e. use a database
2.  Add more RAM to the system
3.  Add an SSD of sufficient size/speed for swap duty to handle
    xfs_repair requirements for filesystems with arbitrarily high
    inode counts

Your quickest, cheapest, and all encompassing solution to this problem
today is #3.  This prevents the need to size the RAM on each machine to
meet the needs of xfs_repair given an arbitrary number of inodes, as
you'll always have more than enough swap.  And it is likely less
expensive than adding/replacing DIMMs.  The fastest random read/write
IOPS SSD on the market is the Samsung 840 Pro which is ~$1/GB in the
States, a 128GB unit for $130.  This unit has a 5 year warranty and
sustained ~90K read/write 4KB IOPS.

Create a 100GB swap partition and leave the remainder unallocated.  The
unallocated space will automatically be used for GC and wear leveling,
increasing the life of all cells in the drive.

The fact that the systems are remote, that you have no more DIMM slots,
are not good arguments for you to make in this context.  Every system
will require some type of hardware addition/replacement/maintenance.
And this is not the first software "problem" that requires more hardware
to solve.  If your application that creates these millions of files
needed twice as much RAM, forcing an upgrade, would you be complaining
this way on their mailing list?  If so I'd suggest the problem lay
somewhere other than xfs_repair and that application.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [xfs_check Out of memory: ]
  2013-12-28 16:55         ` Stan Hoeppner
@ 2013-12-28 17:35           ` Jay Ashworth
  2013-12-28 22:01             ` Stan Hoeppner
  2013-12-28 23:39           ` Arkadiusz Miśkiewicz
  1 sibling, 1 reply; 19+ messages in thread
From: Jay Ashworth @ 2013-12-28 17:35 UTC (permalink / raw)
  To: xfs

----- Original Message -----
> From: "Stan Hoeppner" <stan@hardwarefreak.com>

> Create a 100GB swap partition and leave the remainder unallocated. The
> unallocated space will automatically be used for GC and wear leveling,
> increasing the life of all cells in the drive.

*Great* tip.  :-)

Cheers, 
-- jra
-- 
Jay R. Ashworth                  Baylink                       jra@baylink.com
Designer                     The Things I Think                       RFC 2100
Ashworth & Associates     http://baylink.pitas.com         2000 Land Rover DII
St Petersburg FL USA               #natog                      +1 727 647 1274

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [xfs_check Out of memory: ]
  2013-12-28 17:35           ` Jay Ashworth
@ 2013-12-28 22:01             ` Stan Hoeppner
  0 siblings, 0 replies; 19+ messages in thread
From: Stan Hoeppner @ 2013-12-28 22:01 UTC (permalink / raw)
  To: Jay Ashworth, xfs

On 12/28/2013 11:35 AM, Jay Ashworth wrote:
> ----- Original Message -----
>> From: "Stan Hoeppner" <stan@hardwarefreak.com>
> 
>> Create a 100GB swap partition and leave the remainder unallocated. The
>> unallocated space will automatically be used for GC and wear leveling,
>> increasing the life of all cells in the drive.
> 
> *Great* tip.  :-)

It's not so much of a tip but simply re-stating common knowledge, or
what should be by now.  Over provisioning of SSDs has been written about
pretty extensively.

One benefit of over provisioning I failed to mention above is that it
also typically decreases the latency and increases the throughput of
random writes significantly.  Linux swap writes tend to be sequential,
so this aspect of over provisioning won't pay dividends for the OP's use
case.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [xfs_check Out of memory: ]
  2013-12-28 16:55         ` Stan Hoeppner
  2013-12-28 17:35           ` Jay Ashworth
@ 2013-12-28 23:39           ` Arkadiusz Miśkiewicz
  2013-12-29  0:54             ` Stan Hoeppner
  1 sibling, 1 reply; 19+ messages in thread
From: Arkadiusz Miśkiewicz @ 2013-12-28 23:39 UTC (permalink / raw)
  To: stan; +Cc: Stor??, Jeff Liu, xfs

On Saturday 28 of December 2013, Stan Hoeppner wrote:
> On 12/27/2013 5:20 PM, Arkadiusz Miśkiewicz wrote:
> ...
> 
> > - can't add more RAM easily, machine is at remote location, uses obsolete
> > DDR2, have no more ram slots and so on
> 
> ...
> 
> > So looks like my future backup servers will need to have 64GB, 128GB or
> > maybe even more ram that will be there only for xfs_repair usage. That's
> > gigantic waste of resources. And there are modern processors that don't
> > work with more than 32GB of ram - like "Intel Xeon E3-1220v2" (
> > http://tnij.org/tkqas9e ). So adding ram means replacing CPU, likely
> > replacing mainboard. Fun :)
> 
> ..
> 
> > IMO ram usage is a real problem for xfs_repair and there has to be some
> > upstream solution other than "buy more" (and waste more) approach.
> 
> The problem isn't xfs_repair.  

This problem is fully solvable on xfs_repair side (if disk space outside of 
broken xfs fs is available).

> The problem is that you expect this tool
> to handle an infinite number of inodes while using a finite amount of
> memory, or at least somewhat less memory than you have installed.  We
> don't see your problem reported very often which seems to indicate your
> situation is a corner case, or that others simply

It's not something common. Happens from time to time judging based on #xfs 
questions.

> size their systems
> properly without complaint.

I guess having milions of tiny files (few kb each file) in simply not 
something common rather than "properly sizing systems".

> If you'd actually like advice on how to solve this, today, with
> realistic solutions, in lieu of the devs recoding xfs_repair for the
> single goal of using less memory, then here are your options:
> 
> 1.  Rewrite or redo your workload to not create so many small files,
>     so many inodes, i.e. use a database

It's a backup copy that needs to be directly accessible (so you could run 
production directly from backup server for example).  That solution won't 
work.

> 2.  Add more RAM to the system

> 3.  Add an SSD of sufficient size/speed for swap duty to handle
>     xfs_repair requirements for filesystems with arbitrarily high
>     inode counts

That would work... if the server was locally available.

Right now my working "solution" is:
- add 40GB of swap space
- stop all other services
- run xfs_repair, leave it for 1-2 days

Adding SSD is my only long term option it seems.

> The fact that the systems are remote, that you have no more DIMM slots,
> are not good arguments for you to make in this context.  Every system
> will require some type of hardware addition/replacement/maintenance.
> And this is not the first software "problem" that requires more hardware
> to solve.  If your application that creates these millions of files
> needed twice as much RAM, forcing an upgrade, would you be complaining
> this way on their mailing list?

If that application could do its job without requiring 2xRAM then surely I 
would write about this to ml.

> If so I'd suggest the problem lay
> somewhere other than xfs_repair and that application.

IMO this problem could be solved on xfs_repair side but well... someone would 
have to write patches and that's unlikely to happen.

So now more important question. How to actually estimate these things? 
Example: 10TB xfs filesystem fully written with files - 10kb each file (html 
pages, images etc) - web server. How much ram my server would need for repair 
to succeed?

-- 
Arkadiusz Miśkiewicz, arekm / maven.pl

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [xfs_check Out of memory: ]
  2013-12-28 23:39           ` Arkadiusz Miśkiewicz
@ 2013-12-29  0:54             ` Stan Hoeppner
  2013-12-29 11:23               ` Arkadiusz Miśkiewicz
  0 siblings, 1 reply; 19+ messages in thread
From: Stan Hoeppner @ 2013-12-29  0:54 UTC (permalink / raw)
  To: Arkadiusz Miśkiewicz; +Cc: Stor??, Jeff Liu, xfs

On 12/28/2013 5:39 PM, Arkadiusz Miśkiewicz wrote:
> On Saturday 28 of December 2013, Stan Hoeppner wrote:
>> On 12/27/2013 5:20 PM, Arkadiusz Miśkiewicz wrote:
>> ...
>>
>>> - can't add more RAM easily, machine is at remote location, uses obsolete
>>> DDR2, have no more ram slots and so on
>>
>> ...
>>
>>> So looks like my future backup servers will need to have 64GB, 128GB or
>>> maybe even more ram that will be there only for xfs_repair usage. That's
>>> gigantic waste of resources. And there are modern processors that don't
>>> work with more than 32GB of ram - like "Intel Xeon E3-1220v2" (
>>> http://tnij.org/tkqas9e ). So adding ram means replacing CPU, likely
>>> replacing mainboard. Fun :)
>>
>> ..
>>
>>> IMO ram usage is a real problem for xfs_repair and there has to be some
>>> upstream solution other than "buy more" (and waste more) approach.
>>
>> The problem isn't xfs_repair.  
> 
> This problem is fully solvable on xfs_repair side (if disk space outside of 
> broken xfs fs is available).
> 
>> The problem is that you expect this tool
>> to handle an infinite number of inodes while using a finite amount of
>> memory, or at least somewhat less memory than you have installed.  We
>> don't see your problem reported very often which seems to indicate your
>> situation is a corner case, or that others simply
> 
> It's not something common. Happens from time to time judging based on #xfs 
> questions.
> 
>> size their systems
>> properly without complaint.
> 
> I guess having milions of tiny files (few kb each file) in simply not 
> something common rather than "properly sizing systems".
> 
>> If you'd actually like advice on how to solve this, today, with
>> realistic solutions, in lieu of the devs recoding xfs_repair for the
>> single goal of using less memory, then here are your options:
>>
>> 1.  Rewrite or redo your workload to not create so many small files,
>>     so many inodes, i.e. use a database
> 
> It's a backup copy that needs to be directly accessible (so you could run 
> production directly from backup server for example).  That solution won't 
> work.

So it's an rsnapshot server and you have many millions of hardlinks.
The obvious solution here is to simply use a greater number of smaller
XFS filesystems with fewer hardlinks in each.  This is by far the best
way to avoid the xfs_repair memory consumption issue due to massive
inode count.

You might even be able to accomplish this using sparse files.  This
would preclude the need to repartition your storage for more
filesystems, and would allow better utilization of your storage.  Dave
is the sparse filesystem expert so I'll defer to him on whether this is
possible, or applicable to your workload.

>> 2.  Add more RAM to the system
> 
>> 3.  Add an SSD of sufficient size/speed for swap duty to handle
>>     xfs_repair requirements for filesystems with arbitrarily high
>>     inode counts
> 
> That would work... if the server was locally available.
> 
> Right now my working "solution" is:
> - add 40GB of swap space
> - stop all other services
> - run xfs_repair, leave it for 1-2 days
> 
> Adding SSD is my only long term option it seems.

It's not a perfect solution by any means, and the SSD you choose matters
greatly, which I why I recommended the Samsung 840 Pro.  More RAM is the
best option with your current setup, but is not available for your
system.  Using more filesystems with fewer inodes in each is by far the
best option, WRT xfs_repair and limited memory.

>> The fact that the systems are remote, that you have no more DIMM slots,
>> are not good arguments for you to make in this context.  Every system
>> will require some type of hardware addition/replacement/maintenance.
>> And this is not the first software "problem" that requires more hardware
>> to solve.  If your application that creates these millions of files
>> needed twice as much RAM, forcing an upgrade, would you be complaining
>> this way on their mailing list?
> 
> If that application could do its job without requiring 2xRAM then surely I 
> would write about this to ml.
> 
>> If so I'd suggest the problem lay
>> somewhere other than xfs_repair and that application.
> 
> IMO this problem could be solved on xfs_repair side but well... someone would 
> have to write patches and that's unlikely to happen.
> 
> So now more important question. How to actually estimate these things? 
> Example: 10TB xfs filesystem fully written with files - 10kb each file (html 
> pages, images etc) - web server. How much ram my server would need for repair 
> to succeed?

One method is to simply ask xfs_repair how much memory it needs to
repair the filesystem.  Usage:

$ umount /mount/point
$ xfs_repair -n -m 1 -vv /mount/point
$ mount /mount/point

e.g.

$ umount /dev/sda7
$ xfs_repair -n -m 1 -vv /dev/sda7
Phase 1 - find and verify superblock...
        - max_mem = 1024, icount = 85440, imem = 333, dblock =
          24414775, dmem = 11921
Required memory for repair is greater that the maximum specified with
the -m option. Please increase it to at least 60.
$ mount /dev/sda7

This is a 100GB inode32 test filesystem with 83K inodes.  xfs_repair
tells us it requires a minimum 60MB of memory for this filesystem.  This
is a minimum.  The actual repair may require more, but the figure given
should be pretty close.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [xfs_check Out of memory: ]
  2013-12-27 23:20       ` Arkadiusz Miśkiewicz
  2013-12-28 16:55         ` Stan Hoeppner
@ 2013-12-29  9:50         ` Dave Chinner
  2013-12-29 11:57           ` Arkadiusz Miśkiewicz
  2013-12-30  1:55           ` Stan Hoeppner
  1 sibling, 2 replies; 19+ messages in thread
From: Dave Chinner @ 2013-12-29  9:50 UTC (permalink / raw)
  To: Arkadiusz Miśkiewicz; +Cc: Stor??, Jeff Liu, xfs

On Sat, Dec 28, 2013 at 12:20:39AM +0100, Arkadiusz Miśkiewicz wrote:
> On Friday 27 of December 2013, Dave Chinner wrote:
> > On Fri, Dec 27, 2013 at 09:07:22AM +0100, Arkadiusz Miśkiewicz wrote:
> > > On Friday 27 of December 2013, Jeff Liu wrote:
> > > > On 12/27 2013 14:48 PM, Stor?? wrote:
> > > > > Hey:
> > > > > 
> > > > > 20T xfs file system
> > > > > 
> > > > > 
> > > > > 
> > > > > /usr/sbin/xfs_check: line 28: 14447 Killed
> > > > > xfs_db$DBOPTS -i -p xfs_check -c "check$OPTS" $1
> > > > 
> > > > xfs_check is deprecated and please use xfs_repair -n instead.
> > > > 
> > > > The following back traces show us that it seems your system is run out
> > > > memory when executing xfs_check, thus, snmp daemon/xfs_db were killed.
> > > 
> > > This reminds me a question...
> > > 
> > > Could xfs_repair store its temporary data (some of that data, the biggest
> > > parte) on disk instead of in memory?
> > 
> > Where on disk? 
> 
> In directory/file that I'll tell it to use (since I usualy have few xfs 
> filesystems on single server and so far only one at a time breaks).

How is that any different from just adding swap space to the server?

> Could xfs_repair tell kernel that this data should always end up on swap first 
> (allowing other programs/daemons to use regular memory) prehaps? (Don't know 
> interface that would allow to do that in kernel though). That would be some 
> half baked solution.

It's up to the kernel to manage what gets swapped and what doesn't.
I suppose you could use control groups to constrict the RAM
xfs_repair uses, but how to configure such policy is way ouside my
area of expertise.

> > > I don't know it that would make sense, so asking. Not sure if xfs_repair
> > > needs to access that data frequently (so on disk makes no sense) or
> > > maybe it needs only for iteration purposes in some later phase (so on
> > > disk should work).
> > > 
> > > Anyway memory usage of xfs_repair was always a problem for me (like 16GB
> > > not enough for 7TB fs due to huge amount of fies being stored). With
> > > parallel scan it's even worse obviously.
> > 
> > Yes, your problem is that the filesystem you are checking contains
> > 40+GB of metadata and a large amount of that needs to be kept in
> > memory from phase 3 through to phase 6.
> 
> Is that data (or most of that data) frequenly accessed? Or something that's 
> iterated over let say once in each phase? 

free/used space is tracked in a btree. It gets set up, for example,
in phase 3, then iterated in phase 4 where inode bmap btrees are
validated, and then phase 5 rebuilds the on disk free space trees
from what is validated as used/free space in phase 4.

So, the free space information is used in each phase it is required,
and then it is discarded from memory.

Inodes are tracked in a AVL tree. They get set up and validated
against the AGI inode btrees in phase 3, then validated against the
directory structure in phase 6. Most get tossed outo f memory during
phase 6, but those with multiple link counts are held on to until
phase 7 where the link counts are validated.

So, the data that is pulled into memory during phases 2 and 3 (i.e.
all the metadata in the filesystem) cannot be fully validated and
freed until later phases complete. The indexes are regularly
traversed, so soul dnot get swapped. The leaves shoul donly get hit
once per phase, so should be swapped in and out only once per phase
that uses the information.

[snip trial and error xfs_repair OOM complaints]

Basically, you don't know how much metadata is in your filesystem,
so you don't know how much swap space to add up front. Simple: add
100GB of swap file on a fast drive (e.g. an SSD) and that will make
repair run to completion faster than any amount of work I could do
to make it faster.

Basically, you are asking us to make xfs-repair omniscient so it
always either succeeds or fails immediately so that you don't have
to plan for disaster recovery...

> - what's worse tools give no estimations of ram needed etc but that's afaik 

# xfs_repair -vv -m 1 -n /dev/<foo>

> > If you really want to add
> > some kind of database interface to store this information somewhere
> > else, then I'll review the patches. ;)
> 
> Right. So only "easy" task finding the one who understands the code and can 
> write such interface left. Anyone?
>
> IMO ram usage is a real problem for xfs_repair and there has to be some 
> upstream solution other than "buy more" (and waste more) approach.

I think you are forgetting that developer time is *expensive* and
*scarce*. This is essentially a solved problem: An SSD in a USB3
enclosure as a temporary swap device is by far the most cost
effective way to make repair scale to arbitrary amounts of metadata.
It certainly scales far better than developer time and testing
resources...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [xfs_check Out of memory: ]
  2013-12-29  0:54             ` Stan Hoeppner
@ 2013-12-29 11:23               ` Arkadiusz Miśkiewicz
  0 siblings, 0 replies; 19+ messages in thread
From: Arkadiusz Miśkiewicz @ 2013-12-29 11:23 UTC (permalink / raw)
  To: stan; +Cc: Stor??, Jeff Liu, xfs

On Sunday 29 of December 2013, Stan Hoeppner wrote:
> On 12/28/2013 5:39 PM, Arkadiusz Miśkiewicz wrote:
> > On Saturday 28 of December 2013, Stan Hoeppner wrote:
> >> On 12/27/2013 5:20 PM, Arkadiusz Miśkiewicz wrote:

> > It's a backup copy that needs to be directly accessible (so you could run
> > production directly from backup server for example).  That solution won't
> > work.
> 
> So it's an rsnapshot server and you have many millions of hardlinks.

Something like that (initially it was just copy of few other servers but now 
hardlinks are also in use).

> The obvious solution here is to simply use a greater number of smaller
> XFS filesystems with fewer hardlinks in each.  This is by far the best
> way to avoid the xfs_repair memory consumption issue due to massive
> inode count. 
> You might even be able to accomplish this using sparse files.  This
> would preclude the need to repartition your storage for more
> filesystems, and would allow better utilization of your storage.  Dave
> is the sparse filesystem expert so I'll defer to him on whether this is
> possible, or applicable to your workload.

I'll go SSD way since making things more complicated just for xfs_repair isn't 
sane.

[...]
> > Adding SSD is my only long term option it seems.
> 
> It's not a perfect solution by any means, and the SSD you choose matters
> greatly, which I why I recommended the Samsung 840 Pro.  More RAM is the
> best option with your current setup, but is not available for your
> system.  Using more filesystems with fewer inodes in each is by far the
> best option, WRT xfs_repair and limited memory.

The server is over 30TB but I used 7TB partitions. Unfortunately it's not 
possible to go low with these since hardlinks needs to be on the same 
partition etc.

[...]
> > So now more important question. How to actually estimate these things?
> > Example: 10TB xfs filesystem fully written with files - 10kb each file
> > (html pages, images etc) - web server. How much ram my server would need
> > for repair to succeed?
> 
> One method is to simply ask xfs_repair how much memory it needs to
> repair the filesystem.  Usage:

Assume I'm planning new server and I need to figure that out without actually 
having hardware or fs. How to estimate this?

If there is a way I'll gladly describe it and add to xfs faq.

xfs_repair estimate doesn't work, too - see below.

> $ umount /mount/point
> $ xfs_repair -n -m 1 -vv /mount/point
> $ mount /mount/point
> 
> e.g.
> 
> $ umount /dev/sda7
> $ xfs_repair -n -m 1 -vv /dev/sda7
> Phase 1 - find and verify superblock...
>         - max_mem = 1024, icount = 85440, imem = 333, dblock =
>           24414775, dmem = 11921
> Required memory for repair is greater that the maximum specified with
> the -m option. Please increase it to at least 60.
> $ mount /dev/sda7

Phase 1 - find and verify superblock...
        - max_mem = 1024, icount = 124489792, imem = 486288, dblock = 
1953509376, dmem = 953862
Required memory for repair is greater that the maximum specified
with the -m option. Please increase it to at least 1455.

So minimal 1.5GB but the real usage were nowhere near that minimal estimate. 
xfs_repair needed somewhere around 30-40GB for this fs.

So 2x64GB SSD (raid1) for swap should be ok for now but in long term 2x128GB 
is the way to go it seems.

-- 
Arkadiusz Miśkiewicz, arekm / maven.pl

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [xfs_check Out of memory: ]
  2013-12-29  9:50         ` Dave Chinner
@ 2013-12-29 11:57           ` Arkadiusz Miśkiewicz
  2013-12-29 23:27             ` Dave Chinner
  2013-12-30  1:55           ` Stan Hoeppner
  1 sibling, 1 reply; 19+ messages in thread
From: Arkadiusz Miśkiewicz @ 2013-12-29 11:57 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Stor??, Jeff Liu, xfs

On Sunday 29 of December 2013, Dave Chinner wrote:
> On Sat, Dec 28, 2013 at 12:20:39AM +0100, Arkadiusz Miśkiewicz wrote:
> > On Friday 27 of December 2013, Dave Chinner wrote:
> > > On Fri, Dec 27, 2013 at 09:07:22AM +0100, Arkadiusz Miśkiewicz wrote:
> > > > On Friday 27 of December 2013, Jeff Liu wrote:
> > > > > On 12/27 2013 14:48 PM, Stor?? wrote:
[...]
> > > > This reminds me a question...
> > > > 
> > > > Could xfs_repair store its temporary data (some of that data, the
> > > > biggest parte) on disk instead of in memory?
> > > 
> > > Where on disk?
> > 
> > In directory/file that I'll tell it to use (since I usualy have few xfs
> > filesystems on single server and so far only one at a time breaks).
> 
> How is that any different from just adding swap space to the server?

It's different by allowing other services to work while repair is in progress. 
If swap gets eaten then entire server goes down on knees. Keeping thins on 
disk would mean that other services work uninterrupted and repair gets slow 
(but works).

> > Could xfs_repair tell kernel that this data should always end up on swap
> > first (allowing other programs/daemons to use regular memory) prehaps?
> > (Don't know interface that would allow to do that in kernel though).
> > That would be some half baked solution.
> 
> It's up to the kernel to manage what gets swapped and what doesn't.

I was hoping for some interface like fadvice FADV_DONTNEED but there is no 
similar thing for malloced memory I guess.

> I suppose you could use control groups to constrict the RAM
> xfs_repair uses, but how to configure such policy is way ouside my
> area of expertise.

Hmm, have to try, maybe that would work. Like setting up cgroup with 8GB ram 
limit and 40GB of swap. Other services would have their ram available. Good 
hint.

> > Right. So only "easy" task finding the one who understands the code and
> > can write such interface left. Anyone?
> > 
> > IMO ram usage is a real problem for xfs_repair and there has to be some
> > upstream solution other than "buy more" (and waste more) approach.
> 
> I think you are forgetting that developer time is *expensive* and
> *scarce*.

I'm aware of that and not expecting any developer to implement this (unless 
some developer hits the same problems and will have hw constrains ;)

> This is essentially a solved problem: An SSD in a USB3
> enclosure as a temporary swap device is by far the most cost
> effective way to make repair scale to arbitrary amounts of metadata.
> It certainly scales far better than developer time and testing
> resources...

Ok.

I'm not saying that everyone should now start adding "on disk" db for 
xfs_repair. I just think that that soulution would work, regardless of 
hardware and would make it possible to repair huge filesystems (with tons of 
metadata) even on low memory machines (without having to change hardware).

If there is interest among developers to implement this (obiously not) is 
another matter and shouldn't matter on discussing approach.

What is more interesting for me is talking about possible problems with on 
disk approach and not looking for a solution to my particular case.

> Cheers,
> 
> Dave.

ps. I'll go with 2x64GB or 2x128GB SSD in raid1 for swap space approach for my 
case.

-- 
Arkadiusz Miśkiewicz, arekm / maven.pl

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [xfs_check Out of memory: ]
  2013-12-29 11:57           ` Arkadiusz Miśkiewicz
@ 2013-12-29 23:27             ` Dave Chinner
  0 siblings, 0 replies; 19+ messages in thread
From: Dave Chinner @ 2013-12-29 23:27 UTC (permalink / raw)
  To: Arkadiusz Miśkiewicz; +Cc: Stor??, Jeff Liu, xfs

On Sun, Dec 29, 2013 at 12:57:13PM +0100, Arkadiusz Miśkiewicz wrote:
> On Sunday 29 of December 2013, Dave Chinner wrote:
> > On Sat, Dec 28, 2013 at 12:20:39AM +0100, Arkadiusz Miśkiewicz wrote:
> > > On Friday 27 of December 2013, Dave Chinner wrote:
> > > > On Fri, Dec 27, 2013 at 09:07:22AM +0100, Arkadiusz Miśkiewicz wrote:
> > > > > On Friday 27 of December 2013, Jeff Liu wrote:
> > > > > > On 12/27 2013 14:48 PM, Stor?? wrote:
> [...]
> > > > > This reminds me a question...
> > > > > 
> > > > > Could xfs_repair store its temporary data (some of that data, the
> > > > > biggest parte) on disk instead of in memory?
> > > > 
> > > > Where on disk?
> > > 
> > > In directory/file that I'll tell it to use (since I usualy have few xfs
> > > filesystems on single server and so far only one at a time breaks).
> > 
> > How is that any different from just adding swap space to the server?
> 
> It's different by allowing other services to work while repair is in progress. 
> If swap gets eaten then entire server goes down on knees. Keeping thins on 
> disk would mean that other services work uninterrupted and repair gets slow 
> (but works).

Well, that depends on what disk you put the external db on. If that
is shared, then you're going to have problems with IO latency
causing service degradation....

> > > Right. So only "easy" task finding the one who understands the code and
> > > can write such interface left. Anyone?
> > > 
> > > IMO ram usage is a real problem for xfs_repair and there has to be some
> > > upstream solution other than "buy more" (and waste more) approach.
> > 
> > I think you are forgetting that developer time is *expensive* and
> > *scarce*.
> 
> I'm aware of that and not expecting any developer to implement this (unless 
> some developer hits the same problems and will have hw constrains ;)

The main issue here is that your filesystem usage is well outside
the 95th percentile, and so you are in the realm of custom solutions
that require significant engineering effort to resolve. That's not
to say they can't be solved, just that solving them is an expensive
undertaking...

> > This is essentially a solved problem: An SSD in a USB3 enclosure
> > as a temporary swap device is by far the most cost effective way
> > to make repair scale to arbitrary amounts of metadata.  It
> > certainly scales far better than developer time and testing
> > resources...
> 
> Ok.
> 
> I'm not saying that everyone should now start adding "on disk" db
> for xfs_repair. I just think that that soulution would work,
> regardless of hardware and would make it possible to repair huge
> filesystems (with tons of metadata) even on low memory machines
> (without having to change hardware).

It's always been the case that you can create a filesystem that a
specific machine does not have the resouces to be able to repair. We
can't prevent that from occurring. e.g. no amount of on-disk
database work will make repair complete on an embedded NAS box with
512MB of RAM, a 2GB system disk with a filesystem that spans 2x4TB
drives....

> If there is interest among developers to implement this (obiously
> not) is another matter and shouldn't matter on discussing
> approach.
> 
> What is more interesting for me is talking about possible problems
> with on disk approach and not looking for a solution to my
> particular case.

The problem with adding a database interface is that we have to
re-engineer all the internal structures that xfs_repair uses and the
indexes we use to track them. They need to be abstracted in a data
base friendly manner, and then new code has to be writen to manage
the database and insert/modify/remove the information in the
database.  Then there is work to find the most suitable database, as
simple key/value pair databases won't scale to tracking hundreds of
millions of records. That is likely to create significant
dependencies for xfsprogs, of which we can't pull into things like
the debian udeb builds which are used for building the recovery disk
images that contain xfs_repair. So we have to make it all build time
conditional, and then we'll have different capabilities from
xfs-repair depending on where you run it from. Then we've got to
test it all, document it, etc. 

And there's still no guarantee that is solves your problem. Not
enough disk space for the database? ENOSPC causes failure instead of
ENOMEM. How do we know how much disk space is needed? We can't
predict that exactly without running repair, same as for memory
usage prediction. And even if we are using a DB rather than RAM,
there's still the possibility of ENOMEM.

These are all solvable issues, but they take time and resources and
expertise we don't currently have to solve. When compared to the
simplicity of "add a usb SSD for swap", it just doesn't make sense
to spend time trying to solve this problem....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [xfs_check Out of memory: ]
  2013-12-29  9:50         ` Dave Chinner
  2013-12-29 11:57           ` Arkadiusz Miśkiewicz
@ 2013-12-30  1:55           ` Stan Hoeppner
  2013-12-30 11:27             ` Matthias Schniedermeyer
                               ` (2 more replies)
  1 sibling, 3 replies; 19+ messages in thread
From: Stan Hoeppner @ 2013-12-30  1:55 UTC (permalink / raw)
  To: Dave Chinner, Arkadiusz Miśkiewicz; +Cc: Stor??, Jeff Liu, xfs

On 12/29/2013 3:50 AM, Dave Chinner wrote:
...
> I think you are forgetting that developer time is *expensive* and
> *scarce*. This is essentially a solved problem: An SSD in a USB3
> enclosure as a temporary swap device is by far the most cost
> effective way to make repair scale to arbitrary amounts of metadata.
> It certainly scales far better than developer time and testing
> resources...

Now this is an interesting idea Dave.  I hadn't considered temporary
swap.  Would USB be reliable enough for this?  I've seen lots problem
reports with folks using USB storage with Linux, random disconnections
and what not.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [xfs_check Out of memory: ]
  2013-12-30  1:55           ` Stan Hoeppner
@ 2013-12-30 11:27             ` Matthias Schniedermeyer
  2013-12-30 13:19             ` Roger Willcocks
  2013-12-30 17:19             ` Stefan Ring
  2 siblings, 0 replies; 19+ messages in thread
From: Matthias Schniedermeyer @ 2013-12-30 11:27 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Stor??, Jeff Liu, xfs

On 29.12.2013 19:55, Stan Hoeppner wrote:
> On 12/29/2013 3:50 AM, Dave Chinner wrote:
> ...
> > I think you are forgetting that developer time is *expensive* and
> > *scarce*. This is essentially a solved problem: An SSD in a USB3
> > enclosure as a temporary swap device is by far the most cost
> > effective way to make repair scale to arbitrary amounts of metadata.
> > It certainly scales far better than developer time and testing
> > resources...
> 
> Now this is an interesting idea Dave.  I hadn't considered temporary
> swap.  Would USB be reliable enough for this?  I've seen lots problem
> reports with folks using USB storage with Linux, random disconnections
> and what not.

It's certainly a problem with several variables.
- Quality of USB-Stack (should be quite good nowadays, but there can 
always be (new) bugs)
- Quality of SATA <-> USB(3) Chip
- Quality of SSD itself

And with Quality i mean everything from physical chip up to firmware.

I'm not quite sure what happens when SWAP crapps out. I think it can be 
everything from "machine goes dead" down to "all programs with pages in 
swap are terminated".

I've transfert quite a few TB over USB3 and i can say, it mostly works. 
But random disconnects happen and you can't really be sure which part is 
the problem as it only happens rarely.

For e.g. currently i have a HDD that randomly but seldomly craps out in 
an USB3 enclosure, after coping a few hundert GB. The drive works 
flawlessly when connected directly by SATA to a (different) computer, at 
least i haven't had a failure after i moved the drive. Is it the drive, 
chip in enclose, firmware between HDD & enclosure not playing nice (like 
too high command timeouts on HDD side and too low on enclosure-sde), 
USB3 stack. Can't really tell.

So:
I would consides SWAP on a device connected via USB3 to be on the risky 
side.

I would validate beforehand if that specific combination of 
xhci/enclosure chip/SSD survies a prolonged time of "high I/O stress".
Like several days of full bandwidth/random I/O.




-- 

Matthias

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [xfs_check Out of memory: ]
  2013-12-30  1:55           ` Stan Hoeppner
  2013-12-30 11:27             ` Matthias Schniedermeyer
@ 2013-12-30 13:19             ` Roger Willcocks
  2013-12-30 16:25               ` Stan Hoeppner
  2013-12-30 17:19             ` Stefan Ring
  2 siblings, 1 reply; 19+ messages in thread
From: Roger Willcocks @ 2013-12-30 13:19 UTC (permalink / raw)
  To: stan; +Cc: Stor??, xfs, Jeff Liu, Roger Willcocks


On 30 Dec 2013, at 01:55, Stan Hoeppner <stan@hardwarefreak.com> wrote:

> On 12/29/2013 3:50 AM, Dave Chinner wrote:
> ...
>> I think you are forgetting that developer time is *expensive* and
>> *scarce*. This is essentially a solved problem: An SSD in a USB3
>> enclosure as a temporary swap device is by far the most cost
>> effective way to make repair scale to arbitrary amounts of metadata.
>> It certainly scales far better than developer time and testing
>> resources...
> 
> Now this is an interesting idea Dave.  I hadn't considered temporary
> swap.  Would USB be reliable enough for this?  I've seen lots problem
> reports with folks using USB storage with Linux, random disconnections
> and what not.
> 

I'll just chip in here and mention that we get around this problem by
exporting the broken xfs volume over iscsi and run xfs-repair on another
machine with more memory / swap space.

--
Roger


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [xfs_check Out of memory: ]
  2013-12-30 13:19             ` Roger Willcocks
@ 2013-12-30 16:25               ` Stan Hoeppner
  0 siblings, 0 replies; 19+ messages in thread
From: Stan Hoeppner @ 2013-12-30 16:25 UTC (permalink / raw)
  To: Roger Willcocks; +Cc: Stor??, Jeff Liu, xfs

On 12/30/2013 7:19 AM, Roger Willcocks wrote:
> 
> On 30 Dec 2013, at 01:55, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> 
>> On 12/29/2013 3:50 AM, Dave Chinner wrote:
>> ...
>>> I think you are forgetting that developer time is *expensive* and
>>> *scarce*. This is essentially a solved problem: An SSD in a USB3
>>> enclosure as a temporary swap device is by far the most cost
>>> effective way to make repair scale to arbitrary amounts of metadata.
>>> It certainly scales far better than developer time and testing
>>> resources...
>>
>> Now this is an interesting idea Dave.  I hadn't considered temporary
>> swap.  Would USB be reliable enough for this?  I've seen lots problem
>> reports with folks using USB storage with Linux, random disconnections
>> and what not.
>>
> 
> I'll just chip in here and mention that we get around this problem by
> exporting the broken xfs volume over iscsi and run xfs-repair on another
> machine with more memory / swap space.

Another interesting, actually excellent idea Roger.  So Arkadiusz could
get by with just one set of SSDs.  Pulling ~40 GB of metadata over GbE
iSCSI should take only about 7 minutes of wire time, assuming his
hosts/net can sustain 100 MB/s.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [xfs_check Out of memory: ]
  2013-12-30  1:55           ` Stan Hoeppner
  2013-12-30 11:27             ` Matthias Schniedermeyer
  2013-12-30 13:19             ` Roger Willcocks
@ 2013-12-30 17:19             ` Stefan Ring
  2 siblings, 0 replies; 19+ messages in thread
From: Stefan Ring @ 2013-12-30 17:19 UTC (permalink / raw)
  To: Linux fs XFS

> Now this is an interesting idea Dave.  I hadn't considered temporary
> swap.  Would USB be reliable enough for this?  I've seen lots problem
> reports with folks using USB storage with Linux, random disconnections
> and what not.

I have two plug computers running Linux from external USB disk drive
enclosures for about 3 years, and I've never had the slightest
problem. So it cannot be a fundamental problem.

But exporting over iSCSI is likely the better option anyway.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2013-12-30 17:19 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-27  6:48 [xfs_check Out of memory: ] Stor??
2013-12-27  7:41 ` Jeff Liu
2013-12-27  8:07   ` Arkadiusz Miśkiewicz
2013-12-27 22:42     ` Dave Chinner
2013-12-27 23:20       ` Arkadiusz Miśkiewicz
2013-12-28 16:55         ` Stan Hoeppner
2013-12-28 17:35           ` Jay Ashworth
2013-12-28 22:01             ` Stan Hoeppner
2013-12-28 23:39           ` Arkadiusz Miśkiewicz
2013-12-29  0:54             ` Stan Hoeppner
2013-12-29 11:23               ` Arkadiusz Miśkiewicz
2013-12-29  9:50         ` Dave Chinner
2013-12-29 11:57           ` Arkadiusz Miśkiewicz
2013-12-29 23:27             ` Dave Chinner
2013-12-30  1:55           ` Stan Hoeppner
2013-12-30 11:27             ` Matthias Schniedermeyer
2013-12-30 13:19             ` Roger Willcocks
2013-12-30 16:25               ` Stan Hoeppner
2013-12-30 17:19             ` Stefan Ring

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox