public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* "XFS: possible memory allocation deadlock in kmem_alloc" on high memory machine
@ 2015-06-01 14:57 Anders Ossowicki
  2015-06-01 21:01 ` Dave Chinner
  0 siblings, 1 reply; 6+ messages in thread
From: Anders Ossowicki @ 2015-06-01 14:57 UTC (permalink / raw)
  To: xfs

Hi,

We've started seeing a slew of these messages in dmesg:

XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)

First question: Is this cause for alarm at all? Should we expect the
disk to blow up in our faces? Should we expect loss of performance?

This is from a machine under heavy load (database server, large dataset,
lots of I/O). It seems to happen only when we hit 15k-20k+ iops on the
disk.

We're running on 3.18.13, built from kernel.org git.

The machine has 3TB of memory and after googling the message for a
while, I guess memory fragmentation could be a likely cause. Looking at
/proc/buddyinfo when these messages show up, we see that there are
almost no fragments of order 1 and none of higher orders.

My completely uneducated guess would be that the kernel can't reap pages
fast enough, so XFS gets impatient waiting for them. That seems like an
issue for mm though but I'd like to confirm if my understanding of what
XFS does is correct.

Most of the memory is used by disk cache:
$ free -g
       total   used   free   shared   buffers   cached
Mem:    3023   3001     22        0         0     2840

Let me know if there is any more info I should provide.

-- 
Anders Ossowicki

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: "XFS: possible memory allocation deadlock in kmem_alloc" on high memory machine
  2015-06-01 14:57 "XFS: possible memory allocation deadlock in kmem_alloc" on high memory machine Anders Ossowicki
@ 2015-06-01 21:01 ` Dave Chinner
  2015-06-02 12:06   ` Anders Ossowicki
  0 siblings, 1 reply; 6+ messages in thread
From: Dave Chinner @ 2015-06-01 21:01 UTC (permalink / raw)
  To: Anders Ossowicki; +Cc: xfs

On Mon, Jun 01, 2015 at 04:57:41PM +0200, Anders Ossowicki wrote:
> Hi,
> 
> We've started seeing a slew of these messages in dmesg:
> 
> XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
> 
> First question: Is this cause for alarm at all? Should we expect the
> disk to blow up in our faces? Should we expect loss of performance?

Nothing should go wrong - XFS will essentially block until it gets
the memory it requires.

> This is from a machine under heavy load (database server, large dataset,
> lots of I/O). It seems to happen only when we hit 15k-20k+ iops on the
> disk.
> 
> We're running on 3.18.13, built from kernel.org git.

Right around the time that I was seeing all sorts of regressions
relating to low memory behaviour and the OOM killer....

> The machine has 3TB of memory and after googling the message for a
> while, I guess memory fragmentation could be a likely cause. Looking at
> /proc/buddyinfo when these messages show up, we see that there are
> almost no fragments of order 1 and none of higher orders.

Ouch. 3TB of memory, and no higher order pages left? Do you have
memory compaction turned on? That should be reforming large pages in
this situation. What type of machine is it?

> My completely uneducated guess would be that the kernel can't reap pages
> fast enough, so XFS gets impatient waiting for them. That seems like an
> issue for mm though but I'd like to confirm if my understanding of what
> XFS does is correct.

Yes, memory fragmentation tends to be a MM problem; nothing XFS can
do about it.

> Most of the memory is used by disk cache:
> $ free -g
>        total   used   free   shared   buffers   cached
> Mem:    3023   3001     22        0         0     2840

Especially as it appears that 2.8TB of your memory is in the page
cache and should be reclaimable.

> Let me know if there is any more info I should provide.

The info asked for here:

http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

will give us more insight into the memory usage, storage and
filesystem, and help us determine the next step...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: "XFS: possible memory allocation deadlock in kmem_alloc" on high memory machine
  2015-06-01 21:01 ` Dave Chinner
@ 2015-06-02 12:06   ` Anders Ossowicki
  2015-06-03  1:52     ` Dave Chinner
  0 siblings, 1 reply; 6+ messages in thread
From: Anders Ossowicki @ 2015-06-02 12:06 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs@oss.sgi.com

On Mon, Jun 01, 2015 at 11:01:13PM +0200, Dave Chinner wrote:
> Nothing should go wrong - XFS will essentially block until it gets
> the memory it requires.

Good to know, thanks!

> > We're running on 3.18.13, built from kernel.org git.
> 
> Right around the time that I was seeing all sorts of regressions
> relating to low memory behaviour and the OOM killer....

We fought with some high cpu load issues back in march, related to
memory management, and we ended up on a recent longterm kernel.
http://thread.gmane.org/gmane.linux.kernel.mm/129858

> Ouch. 3TB of memory, and no higher order pages left? Do you have
> memory compaction turned on? That should be reforming large pages in
> this situation. What type of machine is it?

Memory compaction is turned on. It's an off-the-shelf dell server with 4
12c Xeon processors.

> Yes, memory fragmentation tends to be a MM problem; nothing XFS can
> do about it.

Ya, knowing we're not in immediate danger of a filesystem meltdown, I
think we'll tackle the fragmentation issue next.

> Especially as it appears that 2.8TB of your memory is in the page
> cache and should be reclaimable.

Indeed. I haven't been able to catch the issue while it was ongoing,
since upgrading to 3.13.18, but my guess is that we're not reclaiming
the cache fast enough for some reason, possibly because it takes too
long to find the best reclaimable regions with so many fragment to sift
through.

As for the pertinent system info:

Linux 3.18.13 (we also saw the issue with 3.18.9)
xfs_repair version 3.1.7

4x Intel Xeon E7-8857 v2

$ cat /proc/meminfo
MemTotal:       3170749444 kB
MemFree:        18947564 kB
MemAvailable:   2968870324 kB
Buffers:          270704 kB
Cached:         3008702200 kB
SwapCached:            0 kB
Active:         1617534420 kB
Inactive:       1415684856 kB
Active(anon):   156973416 kB
Inactive(anon):  4856264 kB
Active(file):   1460561004 kB
Inactive(file): 1410828592 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:      25353212 kB
SwapFree:       25353212 kB
Dirty:           1228056 kB
Writeback:        348024 kB
AnonPages:      24244728 kB
Mapped:         137738148 kB
Shmem:          137578880 kB
Slab:           79729144 kB
SReclaimable:   79040008 kB
SUnreclaim:       689136 kB
KernelStack:       22976 kB
PageTables:     19203180 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    1610727932 kB
Committed_AS:   178507488 kB
VmallocTotal:   34359738367 kB
VmallocUsed:     6628972 kB
VmallocChunk:   31937036032 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      172736 kB
DirectMap2M:    13412352 kB
DirectMap1G:    3207593984 kB

We have three hardware raid'ed disks with XFS on them, one of which receives
the bulk of the load. This is a raid 50 volume on SSDs with the raid controller
running in writethrough mode.

$ xfs_info /dev/sdb
meta-data=/dev/sdb               isize=256    agcount=32, agsize=97640448 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=3124494336, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

-- 
Anders Ossowicki

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: "XFS: possible memory allocation deadlock in kmem_alloc" on high memory machine
  2015-06-02 12:06   ` Anders Ossowicki
@ 2015-06-03  1:52     ` Dave Chinner
  2015-06-03  7:07       ` Anders Ossowicki
  0 siblings, 1 reply; 6+ messages in thread
From: Dave Chinner @ 2015-06-03  1:52 UTC (permalink / raw)
  To: Anders Ossowicki; +Cc: xfs@oss.sgi.com

On Tue, Jun 02, 2015 at 02:06:48PM +0200, Anders Ossowicki wrote:
> On Mon, Jun 01, 2015 at 11:01:13PM +0200, Dave Chinner wrote:
> > Nothing should go wrong - XFS will essentially block until it gets
> > the memory it requires.
> 
> Good to know, thanks!
> 
> > > We're running on 3.18.13, built from kernel.org git.
> > 
> > Right around the time that I was seeing all sorts of regressions
> > relating to low memory behaviour and the OOM killer....
> 
> We fought with some high cpu load issues back in march, related to
> memory management, and we ended up on a recent longterm kernel.
> http://thread.gmane.org/gmane.linux.kernel.mm/129858
> 
> > Ouch. 3TB of memory, and no higher order pages left? Do you have
> > memory compaction turned on? That should be reforming large pages in
> > this situation. What type of machine is it?
> 
> Memory compaction is turned on. It's an off-the-shelf dell server with 4
> 12c Xeon processors.
> 
> > Yes, memory fragmentation tends to be a MM problem; nothing XFS can
> > do about it.
> 
> Ya, knowing we're not in immediate danger of a filesystem meltdown, I
> think we'll tackle the fragmentation issue next.
> 
> > Especially as it appears that 2.8TB of your memory is in the page
> > cache and should be reclaimable.
> 
> Indeed. I haven't been able to catch the issue while it was ongoing,
> since upgrading to 3.13.18, but my guess is that we're not reclaiming
> the cache fast enough for some reason, possibly because it takes too
> long to find the best reclaimable regions with so many fragment to sift
> through.

You can always try to drop the page cache to see if that solves the
problem...

> As for the pertinent system info:
> 
> Linux 3.18.13 (we also saw the issue with 3.18.9)
> xfs_repair version 3.1.7
> 
> 4x Intel Xeon E7-8857 v2
> 
> $ cat /proc/meminfo
> MemTotal:       3170749444 kB
> MemFree:        18947564 kB
> MemAvailable:   2968870324 kB
> Buffers:          270704 kB
> Cached:         3008702200 kB
> SwapCached:            0 kB
> Active:         1617534420 kB
> Inactive:       1415684856 kB
> Active(anon):   156973416 kB
> Inactive(anon):  4856264 kB
> Active(file):   1460561004 kB
> Inactive(file): 1410828592 kB

This. You've got 2.8GB of reclaimable page cache there.

> Unevictable:           0 kB
> Mlocked:               0 kB
> SwapTotal:      25353212 kB
> SwapFree:       25353212 kB
> Dirty:           1228056 kB
> Writeback:        348024 kB

And very little of it is dirty, so it should all be immediately
reclaimable or compactable.

> Slab:           79729144 kB
> SReclaimable:   79040008 kB

80GB of slab caches as well - what is the output of /proc/slabinfo?

> We have three hardware raid'ed disks with XFS on them, one of which receives
> the bulk of the load. This is a raid 50 volume on SSDs with the raid controller
> running in writethrough mode.

It doesn't seem like writeback of dirty pages is the problem; more
the case that the page cache is rediculously huge and not being
reclaimed in a sane manner. Do you really need 2.8TB of cached file
data in memory for performance?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: "XFS: possible memory allocation deadlock in kmem_alloc" on high memory machine
  2015-06-03  1:52     ` Dave Chinner
@ 2015-06-03  7:07       ` Anders Ossowicki
  2015-06-03 23:11         ` Dave Chinner
  0 siblings, 1 reply; 6+ messages in thread
From: Anders Ossowicki @ 2015-06-03  7:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs@oss.sgi.com

On Wed, Jun 03, 2015 at 03:52:45AM +0200, Dave Chinner wrote:
> On Tue, Jun 02, 2015 at 02:06:48PM +0200, Anders Ossowicki wrote:
>
> > Slab:           79729144 kB
> > SReclaimable:   79040008 kB
>
> 80GB of slab caches as well - what is the output of /proc/slabinfo?

slabinfo - version: 2.1
# name         <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
btrfs_prelim_ref           0      0     80   51    1 : tunables    0    0    0 : slabdata      0      0      0
btrfs_delayed_data_ref     0      0     96   42    1 : tunables    0    0    0 : slabdata      0      0      0
btrfs_delayed_ref_head     0      0    160   51    2 : tunables    0    0    0 : slabdata      0      0      0
btrfs_delayed_node         0      0    304   53    4 : tunables    0    0    0 : slabdata      0      0      0
btrfs_ordered_extent       0      0    424   38    4 : tunables    0    0    0 : slabdata      0      0      0
btrfs_extent_buffer        0      0    280   58    4 : tunables    0    0    0 : slabdata      0      0      0
btrfs_extent_state         0      0     88   46    1 : tunables    0    0    0 : slabdata      0      0      0
btrfs_delalloc_work        0      0    152   53    2 : tunables    0    0    0 : slabdata      0      0      0
btrfs_trans_handle         0      0    176   46    2 : tunables    0    0    0 : slabdata      0      0      0
btrfs_inode                0      0   1000   32    8 : tunables    0    0    0 : slabdata      0      0      0
ufs_inode_cache            0      0    744   44    8 : tunables    0    0    0 : slabdata      0      0      0
qnx4_inode_cache           0      0    656   49    8 : tunables    0    0    0 : slabdata      0      0      0
hfsplus_attr_cache         0      0   3840    8    8 : tunables    0    0    0 : slabdata      0      0      0
hfsplus_icache             0      0    896   36    8 : tunables    0    0    0 : slabdata      0      0      0
hfs_inode_cache            0      0    768   42    8 : tunables    0    0    0 : slabdata      0      0      0
minix_inode_cache          0      0    648   50    8 : tunables    0    0    0 : slabdata      0      0      0
ntfs_big_inode_cache       0      0    896   36    8 : tunables    0    0    0 : slabdata      0      0      0
ntfs_inode_cache           0      0    312   52    4 : tunables    0    0    0 : slabdata      0      0      0
jfs_mp                    32     32    128   32    1 : tunables    0    0    0 : slabdata      1      1      0
jfs_ip                     0      0   1240   26    8 : tunables    0    0    0 : slabdata      0      0      0
reiser_inode_cache         0      0    744   44    8 : tunables    0    0    0 : slabdata      0      0      0
ext2_inode_cache           0      0    792   41    8 : tunables    0    0    0 : slabdata      0      0      0
nfsd4_openowners       13394  13838    440   37    4 : tunables    0    0    0 : slabdata    374    374      0
nfs_direct_cache           0      0    208   39    2 : tunables    0    0    0 : slabdata      0      0      0
nfs_commit_data         2167   2167    704   46    8 : tunables    0    0    0 : slabdata     48     48      0
nfs_inode_cache       174843 175088   1040   31    8 : tunables    0    0    0 : slabdata   5648   5648      0
fscache_cookie_jar      1702   1702     88   46    1 : tunables    0    0    0 : slabdata     37     37      0
rpc_inode_cache         2448   2448    640   51    8 : tunables    0    0    0 : slabdata     48     48      0
xfs_dquot                  0      0    472   34    4 : tunables    0    0    0 : slabdata      0      0      0
xfs_icr                    0      0    144   56    2 : tunables    0    0    0 : slabdata      0      0      0
xfs_ili              1066228 1066625   152   53    2 : tunables    0    0    0 : slabdata  20125  20125      0
xfs_inode            2522728 2523172  1024   32    8 : tunables    0    0    0 : slabdata  78857  78857      0
xfs_efd_item            8320   8920    400   40    4 : tunables    0    0    0 : slabdata    223    223      0
xfs_da_state            1632   1632    480   34    4 : tunables    0    0    0 : slabdata     48     48      0
xfs_btree_cur           1872   1872    208   39    2 : tunables    0    0    0 : slabdata     48     48      0
ext4_groupinfo_4k        896    896    144   56    2 : tunables    0    0    0 : slabdata     16     16      0
ip6-frags                  0      0    216   37    2 : tunables    0    0    0 : slabdata      0      0      0
UDPLITEv6                  0      0   1088   30    8 : tunables    0    0    0 : slabdata      0      0      0
UDPv6                   1440   1440   1088   30    8 : tunables    0    0    0 : slabdata     48     48      0
tw_sock_TCPv6           1312   1312    256   32    2 : tunables    0    0    0 : slabdata     41     41      0
TCPv6                    768    768   1984   16    8 : tunables    0    0    0 : slabdata     48     48      0
kcopyd_job                 0      0   3312    9    8 : tunables    0    0    0 : slabdata      0      0      0
dm_uevent                  0      0   2632   12    8 : tunables    0    0    0 : slabdata      0      0      0
dm_rq_target_io            0      0    408   40    4 : tunables    0    0    0 : slabdata      0      0      0
scsi_cmd_cache         93743 107688    384   42    4 : tunables    0    0    0 : slabdata   2564   2564      0
cfq_queue              21842  21910    232   35    2 : tunables    0    0    0 : slabdata    626    626      0
bsg_cmd                    0      0    312   52    4 : tunables    0    0    0 : slabdata      0      0      0
mqueue_inode_cache        36     36    896   36    8 : tunables    0    0    0 : slabdata      1      1      0
fuse_request               0      0    416   39    4 : tunables    0    0    0 : slabdata      0      0      0
fuse_inode                 0      0    768   42    8 : tunables    0    0    0 : slabdata      0      0      0
ecryptfs_key_record_cache  0      0    576   56    8 : tunables    0    0    0 : slabdata      0      0      0
ecryptfs_inode_cache       0      0   1024   32    8 : tunables    0    0    0 : slabdata      0      0      0
fat_inode_cache            0      0    720   45    8 : tunables    0    0    0 : slabdata      0      0      0
fat_cache                  0      0     40  102    1 : tunables    0    0    0 : slabdata      0      0      0
hugetlbfs_inode_cache   2592   2592    600   54    8 : tunables    0    0    0 : slabdata     48     48      0
jbd2_journal_handle     4080   4080     48   85    1 : tunables    0    0    0 : slabdata     48     48      0
journal_handle             0      0     24  170    1 : tunables    0    0    0 : slabdata      0      0      0
journal_head           17208  17712    112   36    1 : tunables    0    0    0 : slabdata    492    492      0
revoke_table             256    256     16  256    1 : tunables    0    0    0 : slabdata      1      1      0
revoke_record           6400   6400     32  128    1 : tunables    0    0    0 : slabdata     50     50      0
ext4_inode_cache      642098 775776   1008   32    8 : tunables    0    0    0 : slabdata  24243  24243      0
ext4_free_data         11328  11328     64   64    1 : tunables    0    0    0 : slabdata    177    177      0
ext4_allocation_context 1536   1536    128   32    1 : tunables    0    0    0 : slabdata     48     48      0
ext4_io_end             3024   3024     72   56    1 : tunables    0    0    0 : slabdata     54     54      0
ext4_extent_status     68877 105672     40  102    1 : tunables    0    0    0 : slabdata   1036   1036      0
ext3_inode_cache           0      0    816   40    8 : tunables    0    0    0 : slabdata      0      0      0
dquot                   1760   1760    256   32    2 : tunables    0    0    0 : slabdata     55     55      0
fsnotify_mark              0      0    112   36    1 : tunables    0    0    0 : slabdata      0      0      0
pid_namespace              0      0   2200   14    8 : tunables    0    0    0 : slabdata      0      0      0
posix_timers_cache     12541  12705    248   33    2 : tunables    0    0    0 : slabdata    385    385      0
UDP-Lite                   0      0    960   34    8 : tunables    0    0    0 : slabdata      0      0      0
xfrm_dst_cache             0      0    448   36    4 : tunables    0    0    0 : slabdata      0      0      0
ip_fib_trie              292    292     56   73    1 : tunables    0    0    0 : slabdata      4      4      0
UDP                     1632   1632    960   34    8 : tunables    0    0    0 : slabdata     48     48      0
tw_sock_TCP             2358   2624    256   32    2 : tunables    0    0    0 : slabdata     82     82      0
TCP                     3539   3808   1856   17    8 : tunables    0    0    0 : slabdata    224    224      0
blkdev_queue             323    464   1928   16    8 : tunables    0    0    0 : slabdata     29     29      0
blkdev_requests        13992  14740    368   44    4 : tunables    0    0    0 : slabdata    335    335      0
blkdev_ioc             23201  24141    104   39    1 : tunables    0    0    0 : slabdata    619    619      0
dmaengine-unmap-256       15     15   2112   15    8 : tunables    0    0    0 : slabdata      1      1      0
dmaengine-unmap-128      180    180   1088   30    8 : tunables    0    0    0 : slabdata      6      6      0
sock_inode_cache        6273   6273    640   51    8 : tunables    0    0    0 : slabdata    123    123      0
net_namespace              0      0   4352    7    8 : tunables    0    0    0 : slabdata      0      0      0
shmem_inode_cache       2784   2784    672   48    8 : tunables    0    0    0 : slabdata     58     58      0
ftrace_event_file       1702   1702     88   46    1 : tunables    0    0    0 : slabdata     37     37      0
taskstats               2352   2352    328   49    4 : tunables    0    0    0 : slabdata     48     48      0
proc_inode_cache       19293  24750    648   50    8 : tunables    0    0    0 : slabdata    495    495      0
sigqueue                3111   3111    160   51    2 : tunables    0    0    0 : slabdata     61     61      0
bdev_cache              1014   1014    832   39    8 : tunables    0    0    0 : slabdata     26     26      0
kernfs_node_cache     342112 342244    120   34    1 : tunables    0    0    0 : slabdata  10066  10066      0
mnt_cache               2448   2448    320   51    4 : tunables    0    0    0 : slabdata     48     48      0
inode_cache            36637  45080    584   56    8 : tunables    0    0    0 : slabdata    805    805      0
dentry               3217866 3702384   192   42    2 : tunables    0    0    0 : slabdata  88152  88152      0
iint_cache                 0      0     72   56    1 : tunables    0    0    0 : slabdata      0      0      0
buffer_head        370050715 400741536 104   39    1 : tunables    0    0    0 : slabdata 10275424 10275424  0
vm_area_struct        147654 150128    184   44    2 : tunables    0    0    0 : slabdata   3412   3412      0
mm_struct               9550  10404    896   36    8 : tunables    0    0    0 : slabdata    289    289      0
files_cache             4029   4029    640   51    8 : tunables    0    0    0 : slabdata     79     79      0
signal_cache            4383   5180   1152   28    8 : tunables    0    0    0 : slabdata    185    185      0
sighand_cache           3081   3255   2112   15    8 : tunables    0    0    0 : slabdata    217    217      0
task_xstate            10016  10881    832   39    8 : tunables    0    0    0 : slabdata    279    279      0
task_struct             1913   2070   6432    5    8 : tunables    0    0    0 : slabdata    414    414      0
Acpi-ParseExt           6048   6048     72   56    1 : tunables    0    0    0 : slabdata    108    108      0
Acpi-State               306    306     80   51    1 : tunables    0    0    0 : slabdata      6      6      0
Acpi-Namespace          2040   2040     40  102    1 : tunables    0    0    0 : slabdata     20     20      0
anon_vma               69129  72318     80   51    1 : tunables    0    0    0 : slabdata   1418   1418      0
shared_policy_node     27030  27285     48   85    1 : tunables    0    0    0 : slabdata    321    321      0
numa_policy             8160   8160     24  170    1 : tunables    0    0    0 : slabdata     48     48      0
radix_tree_node     64025078 64148728  584   56    8 : tunables    0    0    0 : slabdata 1145751 1145751    0
idr_layer_cache         3456   3709   2096   15    8 : tunables    0    0    0 : slabdata    251    251      0
dma-kmalloc-8192           0      0   8192    4    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-4096           0      0   4096    8    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-2048           0      0   2048   16    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-1024           0      0   1024   32    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-512           32     32    512   32    4 : tunables    0    0    0 : slabdata      1      1      0
dma-kmalloc-256            0      0    256   32    2 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-128            0      0    128   32    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-64             0      0     64   64    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-32             0      0     32  128    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-16             0      0     16  256    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-8              0      0      8  512    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-192            0      0    192   42    2 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-96             0      0     96   42    1 : tunables    0    0    0 : slabdata      0      0      0
kmalloc-8192             779    832   8192    4    8 : tunables    0    0    0 : slabdata    208    208      0
kmalloc-4096            2918   3304   4096    8    8 : tunables    0    0    0 : slabdata    413    413      0
kmalloc-2048            3530   4224   2048   16    8 : tunables    0    0    0 : slabdata    264    264      0
kmalloc-1024           23980  27264   1024   32    8 : tunables    0    0    0 : slabdata    852    852      0
kmalloc-512           179639 222048    512   32    4 : tunables    0    0    0 : slabdata   6939   6939      0
kmalloc-256           488489 521296    256   32    2 : tunables    0    0    0 : slabdata  16291  16291      0
kmalloc-192            41454  53508    192   42    2 : tunables    0    0    0 : slabdata   1274   1274      0
kmalloc-128            54683  67168    128   32    1 : tunables    0    0    0 : slabdata   2099   2099      0
kmalloc-96             20274  37044     96   42    1 : tunables    0    0    0 : slabdata    882    882      0
kmalloc-64            329220 1136832    64   64    1 : tunables    0    0    0 : slabdata  17763  17763      0
kmalloc-32             76222  90112     32  128    1 : tunables    0    0    0 : slabdata    704    704      0
kmalloc-16            245496 247552     16  256    1 : tunables    0    0    0 : slabdata    967    967      0
kmalloc-8             570025 915456      8  512    1 : tunables    0    0    0 : slabdata   1788   1788      0
kmem_cache_node         1310   1472     64   64    1 : tunables    0    0    0 : slabdata     23     23      0
kmem_cache               512    512    256   32    2 : tunables    0    0    0 : slabdata     16     16      0

And just for good measure, this was meminfo at the same time:

MemTotal:       3170749444 kB
MemFree:        27109596 kB
MemAvailable:   2853545664 kB
Buffers:          442216 kB
Cached:         2882173504 kB
SwapCached:            0 kB
Active:         1636112224 kB
Inactive:       1303578276 kB
Active(anon):   188466588 kB
Inactive(anon):  6202588 kB
Active(file):   1447645636 kB
Inactive(file): 1297375688 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:      25353212 kB
SwapFree:       25353212 kB
Dirty:            924980 kB
Writeback:             0 kB
AnonPages:      57090800 kB
Mapped:         137637500 kB
Shmem:          137578904 kB
Slab:           82699516 kB
SReclaimable:   81921588 kB
SUnreclaim:       777928 kB
KernelStack:       27968 kB
PageTables:     101008148 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    1610727932 kB
Committed_AS:   205791344 kB
VmallocTotal:   34359738367 kB
VmallocUsed:     6629132 kB
VmallocChunk:   31937035472 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      172736 kB
DirectMap2M:    13412352 kB
DirectMap1G:    3207593984 kB


> > We have three hardware raid'ed disks with XFS on them, one of which receives
> > the bulk of the load. This is a raid 50 volume on SSDs with the raid controller
> > running in writethrough mode.
> 
> It doesn't seem like writeback of dirty pages is the problem; more
> the case that the page cache is rediculously huge and not being
> reclaimed in a sane manner. Do you really need 2.8TB of cached file
> data in memory for performance?

Yeah, disk cache is the primary reason for stuffing memory into that machine.

-- 
Anders Ossowicki

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: "XFS: possible memory allocation deadlock in kmem_alloc" on high memory machine
  2015-06-03  7:07       ` Anders Ossowicki
@ 2015-06-03 23:11         ` Dave Chinner
  0 siblings, 0 replies; 6+ messages in thread
From: Dave Chinner @ 2015-06-03 23:11 UTC (permalink / raw)
  To: Anders Ossowicki; +Cc: xfs@oss.sgi.com

On Wed, Jun 03, 2015 at 09:07:25AM +0200, Anders Ossowicki wrote:
> On Wed, Jun 03, 2015 at 03:52:45AM +0200, Dave Chinner wrote:
> > On Tue, Jun 02, 2015 at 02:06:48PM +0200, Anders Ossowicki wrote:
> >
> > > Slab:           79729144 kB
> > > SReclaimable:   79040008 kB
> >
> > 80GB of slab caches as well - what is the output of /proc/slabinfo?
> 
> slabinfo - version: 2.1
> # name         <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
...
> xfs_ili              1066228 1066625   152   53    2 : tunables    0    0    0 : slabdata  20125  20125      0
> xfs_inode            2522728 2523172  1024   32    8 : tunables    0    0    0 : slabdata  78857  78857      0
> dentry               3217866 3702384   192   42    2 : tunables    0    0    0 : slabdata  88152  88152      0
> buffer_head        370050715 400741536 104   39    1 : tunables    0    0    0 : slabdata 10275424 10275424  0
> radix_tree_node     64025078 64148728  584   56    8 : tunables    0    0    0 : slabdata 1145751 1145751    0
.....
> Slab:           82699516 kB
> SReclaimable:   81921588 kB
....

So 400 million bufferheads (consuming 40GB RAM) and 60 million radix
tree nodes (consuming 35GB RAM) is where all that memory is.  That's
being used to track the 2.8GB of page cache data (roughly 3% memory
overhead).

Ok, nothing unusual there, but it demonstrates why I want to get rid
of bufferheads.....

> > > We have three hardware raid'ed disks with XFS on them, one of which receives
> > > the bulk of the load. This is a raid 50 volume on SSDs with the raid controller
> > > running in writethrough mode.
> > 
> > It doesn't seem like writeback of dirty pages is the problem; more
> > the case that the page cache is rediculously huge and not being
> > reclaimed in a sane manner. Do you really need 2.8TB of cached file
> > data in memory for performance?
> 
> Yeah, disk cache is the primary reason for stuffing memory into that machine.

Hmmmm. I don't think anyone has considered the page cache to be used
at this scale for caching before. Normally this amount of memory is
needed by applications in their process space, not as a disk buffer
to avoid disk IO. You've only got a 12TB filesystem, so you're
keeping 25% of it in the page cache at any given time, so I'm not
surprised that the page cache reclaim algorithms are having trouble....

I don't think there's anything on the XFS side we can do here to
improve the situation you are in - it appears that it's memory
relcaim and compaction that aren't working well enough to sustain
your workload on that platform....

OTOH, have you considered using something like dm-cache with a huge
ramdisk as the cache device and running it in write-through mode so
that power failure doesn't result in data loss or filesystem
corruption?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-06-03 23:18 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-06-01 14:57 "XFS: possible memory allocation deadlock in kmem_alloc" on high memory machine Anders Ossowicki
2015-06-01 21:01 ` Dave Chinner
2015-06-02 12:06   ` Anders Ossowicki
2015-06-03  1:52     ` Dave Chinner
2015-06-03  7:07       ` Anders Ossowicki
2015-06-03 23:11         ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox