[2.6.22.6] nfsd: fh_verify() `malloc failure' with lots of free memory leads to NFS hang

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [2.6.22.6] nfsd: fh_verify() `malloc failure' with lots of free memory leads to NFS hang
@ 2007-09-17 22:23 Nix
  2007-09-17 22:36 ` J. Bruce Fields
  0 siblings, 1 reply; 7+ messages in thread
From: Nix @ 2007-09-17 22:23 UTC (permalink / raw)
  To: linux-kernel

Back in early 2006 I reported persistent hangs on the NFS server,
whereby all of a sudden about ten minutes after boot my primary NFS
server would cease responding to NFS requests until it was rebooted.
<http://www.ussg.iu.edu/hypermail/linux/kernel/0601.3/1631.html>

That time, the problem vanished when I switched to NFS-over-TCP:
<http://www.ussg.iu.edu/hypermail/linux/kernel/0601.3/1773.html>

Well, I just rebooted --- post-glibc-upgrade from 2.5 to 2.6.1, no
kernel upgrade or anything, so this bug has been latent during at
least my last three weeks of uptime. And it's back. (I've been
seeing strange long pauses doing things like ls, and I suspect
they are related: see below.)

/proc/sys/nfsd/exports on the freezing server:

# Version 1.1
# Path Client(Flags) # IPs
/usr/packages.bin/non-free      hades.wkstn.nix(rw,no_root_squash,async,wdelay,no_subtree_check,uuid=90a98d8a:a8be4806:aea3a4e1:fe3437a0)
/home/.loki.wkstn.nix   esperi.srvr.nix(rw,root_squash,async,wdelay,no_subtree_check,uuid=8ff32fd3:a6114840:ab968da8:25b41721)
/home/.loki.wkstn.nix   hades.wkstn.nix(rw,no_root_squash,async,wdelay,no_subtree_check,uuid=8ff32fd3:a6114840:ab968da8:25b41721)
/usr/lib/X11/fonts      hades.wkstn.nix(ro,root_squash,async,wdelay,uuid=87553c5b:d84740fc:b7d9f7e8:4f749689)
/usr/share/xplanet      hades.wkstn.nix(ro,root_squash,async,wdelay,uuid=87553c5b:d84740fc:b7d9f7e8:4f749689)
/usr/share/xemacs       hades.wkstn.nix(rw,no_root_squash,async,wdelay,uuid=87553c5b:d84740fc:b7d9f7e8:4f749689)
/usr/packages   hades.wkstn.nix(rw,no_root_squash,async,wdelay,no_subtree_check,uuid=2a35f82a:cca144df:a1123587:23527f53)

I turned on ALL-class nfsd debugging and here's what I see as it freezes
up:

Sep 17 22:57:00 loki warning: kernel: nfsd_dispatch: vers 3 proc 1
Sep 17 22:57:00 loki warning: kernel: nfsd: GETATTR(3)  36: 01070001 000fb001 00000000 d32ff38f 404811a6 a88d96ab
Sep 17 22:57:00 loki warning: kernel: nfsd: fh_verify(36: 01070001 000fb001 00000000 d32ff38f 404811a6 a88d96ab)
Sep 17 22:57:44 loki warning: kernel: nfsd_dispatch: vers 3 proc 4
Sep 17 22:57:44 loki warning: kernel: nfsd: ACCESS(3)   36: 01070001 000fb001 00000000 d32ff38f 404811a6 a88d96ab 0x1f
Sep 17 22:57:44 loki warning: kernel: nfsd: fh_verify(36: 01070001 000fb001 00000000 d32ff38f 404811a6 a88d96ab)
Sep 17 22:57:45 loki warning: kernel: nfsd: Dropping request due to malloc failure!
Sep 17 22:57:52 loki warning: kernel: nfsd_dispatch: vers 3 proc 4
Sep 17 22:57:52 loki warning: kernel: nfsd: ACCESS(3)   36: 01070001 000fb001 00000000 d32ff38f 404811a6 a88d96ab 0x1f
Sep 17 22:57:52 loki warning: kernel: nfsd: fh_verify(36: 01070001 000fb001 00000000 d32ff38f 404811a6 a88d96ab)
Sep 17 22:57:52 loki warning: kernel: nfsd: Dropping request due to malloc failure!
Sep 17 22:57:55 loki warning: kernel: nfsd_dispatch: vers 3 proc 4
Sep 17 22:57:55 loki warning: kernel: nfsd: ACCESS(3)   36: 01070001 000fb001 00000000 d32ff38f 404811a6 a88d96ab 0x1f
Sep 17 22:57:55 loki warning: kernel: nfsd: fh_verify(36: 01070001 000fb001 00000000 d32ff38f 404811a6 a88d96ab)
Sep 17 22:57:55 loki warning: kernel: nfsd: Dropping request due to malloc failure!
Sep 17 22:58:50 hades notice: kernel: nfs: server loki not responding, still trying
Sep 17 22:58:50 hades notice: kernel: nfs: server loki not responding, still trying
Sep 17 22:58:55 hades notice: kernel: nfs: server loki not responding, still trying
Sep 17 22:59:40 hades notice: kernel: nfs: server loki not responding, still trying


>From then on, *every* fh_verify() request fails the same way, and
obviously if you can't verify any fds you can't do much with NFS.

Looking back in the log I see intermittent malloc failures starting
almost as soon as I've booted (allowing a couple of minutes for me to
turn debugging on):

Sep 17 22:25:50 hades notice: kernel: nfs: server loki OK
[...]
Sep 17 22:28:09 loki warning: kernel: nfsd_dispatch: vers 3 proc 19
Sep 17 22:28:09 loki warning: kernel: nfsd: FSINFO(3)   28: 00070001 000fb001 00000000 d32ff38f 404811a6 a88d96ab
Sep 17 22:28:09 loki warning: kernel: nfsd: fh_verify(28: 00070001 000fb001 00000000 d32ff38f 404811a6 a88d96ab)
Sep 17 22:28:09 loki warning: kernel: nfsd: Dropping request due to malloc failure!

A while later we start seeing runs of malloc failures, which I think
correlated with the unexplained pauses in NFS response:

Sep 17 22:33:59 loki warning: kernel: nfsd_dispatch: vers 3 proc 6
Sep 17 22:33:59 loki warning: kernel: nfsd: READ(3) 44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7 20480 bytes at 4096
Sep 17 22:33:59 loki warning: kernel: nfsd: fh_verify(44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7)
Sep 17 22:33:59 loki warning: kernel: nfsd: Dropping request due to malloc failure!
Sep 17 22:33:59 loki warning: kernel: nfsd_dispatch: vers 3 proc 6
Sep 17 22:33:59 loki warning: kernel: nfsd: READ(3) 44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7 28672 bytes at 53248
Sep 17 22:33:59 loki warning: kernel: nfsd: fh_verify(44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7)
Sep 17 22:33:59 loki warning: kernel: nfsd: Dropping request due to malloc failure!
Sep 17 22:33:59 loki warning: kernel: nfsd_dispatch: vers 3 proc 6
Sep 17 22:33:59 loki warning: kernel: nfsd: READ(3) 44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7 8192 bytes at 86016
Sep 17 22:33:59 loki warning: kernel: nfsd: fh_verify(44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7)
Sep 17 22:33:59 loki warning: kernel: nfsd: Dropping request due to malloc failure!
Sep 17 22:33:59 loki warning: kernel: nfsd_dispatch: vers 3 proc 6
Sep 17 22:33:59 loki warning: kernel: nfsd: READ(3) 44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7 8192 bytes at 98304
Sep 17 22:33:59 loki warning: kernel: nfsd: fh_verify(44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7)
Sep 17 22:33:59 loki warning: kernel: nfsd: Dropping request due to malloc failure!
Sep 17 22:33:59 loki warning: kernel: nfsd_dispatch: vers 3 proc 6
Sep 17 22:33:59 loki warning: kernel: nfsd: READ(3) 44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7 65536 bytes at 110592
Sep 17 22:33:59 loki warning: kernel: nfsd: fh_verify(44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7)
Sep 17 22:33:59 loki warning: kernel: nfsd: Dropping request due to malloc failure!
Sep 17 22:33:59 loki warning: kernel: nfsd_dispatch: vers 3 proc 6
Sep 17 22:33:59 loki warning: kernel: nfsd: READ(3) 44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7 65536 bytes at 176128
Sep 17 22:33:59 loki warning: kernel: nfsd: fh_verify(44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7)
Sep 17 22:33:59 loki warning: kernel: nfsd: Dropping request due to malloc failure!
Sep 17 22:33:59 loki warning: kernel: nfsd_dispatch: vers 3 proc 6
Sep 17 22:33:59 loki warning: kernel: nfsd: READ(3) 44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7 65536 bytes at 241664
Sep 17 22:33:59 loki warning: kernel: nfsd: fh_verify(44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7)
Sep 17 22:33:59 loki warning: kernel: nfsd: Dropping request due to malloc failure!
Sep 17 22:33:59 loki warning: kernel: nfsd_dispatch: vers 3 proc 6
Sep 17 22:33:59 loki warning: kernel: nfsd: READ(3) 44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7 53248 bytes at 307200
Sep 17 22:33:59 loki warning: kernel: nfsd: fh_verify(44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7)
Sep 17 22:33:59 loki warning: kernel: nfsd: Dropping request due to malloc failure!
Sep 17 22:33:59 loki warning: kernel: nfsd_dispatch: vers 3 proc 6
Sep 17 22:33:59 loki warning: kernel: nfsd: READ(3) 44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7 8192 bytes at 397312
Sep 17 22:33:59 loki warning: kernel: nfsd: fh_verify(44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7)
Sep 17 22:33:59 loki warning: kernel: nfsd: Dropping request due to malloc failure!
Sep 17 22:33:59 loki warning: kernel: nfsd_dispatch: vers 3 proc 6
Sep 17 22:33:59 loki warning: kernel: nfsd: READ(3) 44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7 12288 bytes at 413696
Sep 17 22:33:59 loki warning: kernel: nfsd: fh_verify(44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7)
Sep 17 22:33:59 loki warning: kernel: nfsd: Dropping request due to malloc failure!
Sep 17 22:33:59 loki warning: kernel: nfsd_dispatch: vers 3 proc 6
Sep 17 22:33:59 loki warning: kernel: nfsd: READ(3) 44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7 42652 bytes at 430080
Sep 17 22:33:59 loki warning: kernel: nfsd: fh_verify(44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7)
Sep 17 22:33:59 loki warning: kernel: nfsd: Dropping request due to malloc failure!
Sep 17 22:34:02 loki warning: kernel: found domain hades.wkstn.nix
Sep 17 22:34:02 loki warning: kernel: found fsidtype 7
Sep 17 22:34:02 loki warning: kernel: found fsid length 24
Sep 17 22:34:02 loki warning: kernel: Path seems to be </usr/lib/X11/fonts>
Sep 17 22:34:02 loki warning: kernel: Found the path /usr/lib/X11/fonts
Sep 17 22:34:02 loki warning: kernel: nfsd_dispatch: vers 3 proc 6
Sep 17 22:34:02 loki warning: kernel: nfsd: READ(3) 44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7 42652 bytes at 430080
Sep 17 22:34:02 loki warning: kernel: nfsd: fh_verify(44: 02070001 0001ce75 00000000 5b3c5587 fc4047d8 e8f7d9b7)
Sep 17 22:34:02 loki warning: kernel: nfsd_dispatch: vers 3 proc 6

Fifteen minutes after that the whole thing seized up with malloc
failures as the only response.

We've obviously got some sort of resource leak (although why it's
suddenly appeared I have no idea). There's buckets of memory, though.
The system's not swapping or under especially high memory pressure
(indeed as I've just rebooted it and hardly started anything but the
usual daemon swarm and LVM/MD stuff, it's under rather *low* pressure by
its standards). (Dumps of /proc/meminfo, /proc/zoneinfo and
/proc/slabinfo below.)


I note that this error can be caused by both ENOMEM and EAGAIN, and that
there are numerous points in fh_verify() whose failure can lead to this
error. I'm just instrumenting them now so I can tell which is failing.



MemTotal:       321172 kB
MemFree:          6152 kB
Buffers:          2540 kB
Cached:         106840 kB
SwapCached:          0 kB
Active:         277004 kB
Inactive:         4692 kB
SwapTotal:     1851368 kB
SwapFree:      1851188 kB
Dirty:             160 kB
Writeback:           0 kB
AnonPages:      172340 kB
Mapped:          72692 kB
Slab:            12764 kB
SReclaimable:     4852 kB
SUnreclaim:       7912 kB
PageTables:       2616 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:   2011952 kB
Committed_AS:   492284 kB
VmallocTotal:   712684 kB
VmallocUsed:      2648 kB
VmallocChunk:   710008 kB
HugePages_Total:     0
HugePages_Free:      0
HugePages_Rsvd:      0
Hugepagesize:     4096 kB

Node 0, zone      DMA
  pages free     337
        min      28
        low      35
        high     42
        scanned  0 (a: 23 i: 24)
        spanned  4096
        present  4064
    nr_free_pages 337
    nr_inactive  0
    nr_active    2811
    nr_anon_pages 1788
    nr_mapped    885
    nr_file_pages 1023
    nr_dirty     0
    nr_writeback 0
    nr_slab_reclaimable 47
    nr_slab_unreclaimable 32
    nr_page_table_pages 11
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 421
        protection: (0, 301)
  pagesets
    cpu: 0 pcp: 0
              count: 0
              high:  0
              batch: 1
    cpu: 0 pcp: 1
              count: 0
              high:  0
              batch: 1
  all_unreclaimable: 0
  prev_priority:     12
  start_pfn:         0
Node 0, zone   Normal
  pages free     1227
        min      541
        low      676
        high     811
        scanned  0 (a: 0 i: 0)
        spanned  77804
        present  77197
    nr_free_pages 1227
    nr_inactive  5698
    nr_active    61811
    nr_anon_pages 40341
    nr_mapped    19329
    nr_file_pages 27174
    nr_dirty     13
    nr_writeback 0
    nr_slab_reclaimable 1049
    nr_slab_unreclaimable 2047
    nr_page_table_pages 698
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 4904
        protection: (0, 0)
  pagesets
    cpu: 0 pcp: 0
              count: 66
              high:  90
              batch: 15
    cpu: 0 pcp: 1
              count: 5
              high:  30
              batch: 7
  all_unreclaimable: 0
  prev_priority:     12
  start_pfn:         4096

# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
jbd_4k                 0      0   4096    1    1 : tunables   24   12    0 : slabdata      0      0      0
jbd_1k                 0      0   1024    4    1 : tunables   54   27    0 : slabdata      0      0      0
jbd_2k                 0      0   2048    2    1 : tunables   24   12    0 : slabdata      0      0      0
raid5-md2            512    517    344   11    1 : tunables   54   27    0 : slabdata     47     47      0
raid5-md1            512    517    344   11    1 : tunables   54   27    0 : slabdata     47     47      0
rpc_buffers            8      8   2048    2    1 : tunables   24   12    0 : slabdata      4      4      0
rpc_tasks              8     24    160   24    1 : tunables  120   60    0 : slabdata      1      1      0
rpc_inode_cache       20     20    384   10    1 : tunables   54   27    0 : slabdata      2      2      0
bridge_fdb_cache       8     59     64   59    1 : tunables  120   60    0 : slabdata      1      1      0
UNIX                  91    100    384   10    1 : tunables   54   27    0 : slabdata     10     10      0
dm-snapshot-in       128    134     56   67    1 : tunables  120   60    0 : slabdata      2      2      0
dm-snapshot-ex         0      0     16  203    1 : tunables  120   60    0 : slabdata      0      0      0
dm-crypt_io            0      0     36  101    1 : tunables  120   60    0 : slabdata      0      0      0
dm_tio              4366   4669     16  203    1 : tunables  120   60    0 : slabdata     23     23      0
dm_io               4366   4563     20  169    1 : tunables  120   60    0 : slabdata     27     27      0
uhci_urb_priv          4    127     28  127    1 : tunables  120   60    0 : slabdata      1      1      0
scsi_cmd_cache        39     39    288   13    1 : tunables   54   27    0 : slabdata      3      3      0
cfq_io_context       250    288     80   48    1 : tunables  120   60    0 : slabdata      6      6      0
cfq_queue            247    264     88   44    1 : tunables  120   60    0 : slabdata      6      6      0
mqueue_inode_cache      1      8    480    8    1 : tunables   54   27    0 : slabdata      1      1      0
fuse_request           0      0    360   11    1 : tunables   54   27    0 : slabdata      0      0      0
fuse_inode             0      0    352   11    1 : tunables   54   27    0 : slabdata      0      0      0
nfs_write_data        36     36    448    9    1 : tunables   54   27    0 : slabdata      4      4      0
nfs_read_data         32     36    416    9    1 : tunables   54   27    0 : slabdata      4      4      0
nfs_inode_cache       15     21    560    7    1 : tunables   54   27    0 : slabdata      3      3      0
nfs_page               0      0     64   59    1 : tunables  120   60    0 : slabdata      0      0      0
isofs_inode_cache      0      0    324   12    1 : tunables   54   27    0 : slabdata      0      0      0
hugetlbfs_inode_cache      1     13    296   13    1 : tunables   54   27    0 : slabdata      1      1      0
ext2_inode_cache       5      9    424    9    1 : tunables   54   27    0 : slabdata      1      1      0
ext2_xattr             0      0     44   84    1 : tunables  120   60    0 : slabdata      0      0      0
journal_handle         1    169     20  169    1 : tunables  120   60    0 : slabdata      1      1      0
journal_head          98    360     52   72    1 : tunables  120   60    0 : slabdata      5      5      0
revoke_table          20    254     12  254    1 : tunables  120   60    0 : slabdata      1      1      0
revoke_record          0      0     16  203    1 : tunables  120   60    0 : slabdata      0      0      0
ext3_inode_cache    2392   3429    440    9    1 : tunables   54   27    0 : slabdata    381    381      0
ext3_xattr             0      0     44   84    1 : tunables  120   60    0 : slabdata      0      0      0
reiser_inode_cache    172    180    384   10    1 : tunables   54   27    0 : slabdata     18     18      0
configfs_dir_cache      0      0     48   78    1 : tunables  120   60    0 : slabdata      0      0      0
dnotify_cache          1    169     20  169    1 : tunables  120   60    0 : slabdata      1      1      0
dquot                 17     30    128   30    1 : tunables  120   60    0 : slabdata      1      1      0
inotify_event_cache      0      0     28  127    1 : tunables  120   60    0 : slabdata      0      0      0
inotify_watch_cache      4     92     40   92    1 : tunables  120   60    0 : slabdata      1      1      0
kioctx                 1     24    160   24    1 : tunables  120   60    0 : slabdata      1      1      0
kiocb                  0      0    160   24    1 : tunables  120   60    0 : slabdata      0      0      0
fasync_cache           3    203     16  203    1 : tunables  120   60    0 : slabdata      1      1      0
shmem_inode_cache    950    950    396   10    1 : tunables   54   27    0 : slabdata     95     95      0
posix_timers_cache      0      0    100   39    1 : tunables  120   60    0 : slabdata      0      0      0
uid_cache             15     59     64   59    1 : tunables  120   60    0 : slabdata      1      1      0
UDP-Lite               0      0    480    8    1 : tunables   54   27    0 : slabdata      0      0      0
tcp_bind_bucket       37    203     16  203    1 : tunables  120   60    0 : slabdata      1      1      0
inet_peer_cache        1     59     64   59    1 : tunables  120   60    0 : slabdata      1      1      0
ip_fib_alias          12    113     32  113    1 : tunables  120   60    0 : slabdata      1      1      0
ip_fib_hash           11    113     32  113    1 : tunables  120   60    0 : slabdata      1      1      0
ip_dst_cache          31     90    256   15    1 : tunables  120   60    0 : slabdata      6      6      0
arp_cache              5     30    128   30    1 : tunables  120   60    0 : slabdata      1      1      0
RAW                    2      9    448    9    1 : tunables   54   27    0 : slabdata      1      1      0
UDP                   29     32    480    8    1 : tunables   54   27    0 : slabdata      4      4      0
tw_sock_TCP            0      0     96   40    1 : tunables  120   60    0 : slabdata      0      0      0
request_sock_TCP       0      0     64   59    1 : tunables  120   60    0 : slabdata      0      0      0
TCP                   45     49   1056    7    2 : tunables   24   12    0 : slabdata      7      7      0
eventpoll_pwq          2    101     36  101    1 : tunables  120   60    0 : slabdata      1      1      0
eventpoll_epi          2     40     96   40    1 : tunables  120   60    0 : slabdata      1      1      0
sgpool-128             2      2   2048    2    1 : tunables   24   12    0 : slabdata      1      1      0
sgpool-64              2      4   1024    4    1 : tunables   54   27    0 : slabdata      1      1      0
sgpool-32              2      8    512    8    1 : tunables   54   27    0 : slabdata      1      1      0
sgpool-16              2     15    256   15    1 : tunables  120   60    0 : slabdata      1      1      0
sgpool-8              30     30    128   30    1 : tunables  120   60    0 : slabdata      1      1      0
scsi_io_context        0      0    104   37    1 : tunables  120   60    0 : slabdata      0      0      0
blkdev_ioc            96    113     32  113    1 : tunables  120   60    0 : slabdata      1      1      0
blkdev_queue          26     27    892    9    2 : tunables   54   27    0 : slabdata      3      3      0
blkdev_requests      142    242    176   22    1 : tunables  120   60    0 : slabdata     11     11      0
biovec-256           274    274   3072    2    2 : tunables   24   12    0 : slabdata    137    137      0
biovec-128           274    275   1536    5    2 : tunables   24   12    0 : slabdata     55     55      0
biovec-64            274    275    768    5    1 : tunables   54   27    0 : slabdata     55     55      0
biovec-16            274    280    192   20    1 : tunables  120   60    0 : slabdata     14     14      0
biovec-4             274    295     64   59    1 : tunables  120   60    0 : slabdata      5      5      0
biovec-1             321    609     16  203    1 : tunables  120   60    0 : slabdata      3      3      0
bio                  319    354     64   59    1 : tunables  120   60    0 : slabdata      6      6      0
sock_inode_cache     178    187    352   11    1 : tunables   54   27    0 : slabdata     17     17      0
skbuff_fclone_cache      9     12    320   12    1 : tunables   54   27    0 : slabdata      1      1      0
skbuff_head_cache     75    120    160   24    1 : tunables  120   60    0 : slabdata      5      5      0
file_lock_cache        4     42     92   42    1 : tunables  120   60    0 : slabdata      1      1      0
proc_inode_cache     168    168    312   12    1 : tunables   54   27    0 : slabdata     14     14      0
sigqueue               8     27    144   27    1 : tunables  120   60    0 : slabdata      1      1      0
radix_tree_node     1734   3250    288   13    1 : tunables   54   27    0 : slabdata    250    250      0
bdev_cache            44     45    416    9    1 : tunables   54   27    0 : slabdata      5      5      0
sysfs_dir_cache     3254   3276     48   78    1 : tunables  120   60    0 : slabdata     42     42      0
mnt_cache             67     90    128   30    1 : tunables  120   60    0 : slabdata      3      3      0
inode_cache          801    871    296   13    1 : tunables   54   27    0 : slabdata     67     67      0
dentry              4131   6300    128   30    1 : tunables  120   60    0 : slabdata    210    210      0
filp                1332   1392    160   24    1 : tunables  120   60    0 : slabdata     58     58      0
names_cache            4      4   4096    1    1 : tunables   24   12    0 : slabdata      4      4      0
idr_layer_cache      152    174    136   29    1 : tunables  120   60    0 : slabdata      6      6      0
buffer_head         4991  27144     52   72    1 : tunables  120   60    0 : slabdata    377    377      0
mm_struct            126    126    416    9    1 : tunables   54   27    0 : slabdata     14     14      0
vm_area_struct     12759  13570     84   46    1 : tunables  120   60    0 : slabdata    295    295      0
fs_cache             108    113     32  113    1 : tunables  120   60    0 : slabdata      1      1      0
files_cache           93    100    192   20    1 : tunables  120   60    0 : slabdata      5      5      0
signal_cache         135    135    416    9    1 : tunables   54   27    0 : slabdata     15     15      0
sighand_cache        129    129   1312    3    1 : tunables   24   12    0 : slabdata     43     43      0
task_struct          135    135   1280    3    1 : tunables   24   12    0 : slabdata     45     45      0
anon_vma            1179   1356      8  339    1 : tunables  120   60    0 : slabdata      4      4      0
pid                  150    202     36  101    1 : tunables  120   60    0 : slabdata      2      2      0
size-4194304(DMA)      0      0 4194304    1 1024 : tunables    1    1    0 : slabdata      0      0      0
size-4194304           0      0 4194304    1 1024 : tunables    1    1    0 : slabdata      0      0      0
size-2097152(DMA)      0      0 2097152    1  512 : tunables    1    1    0 : slabdata      0      0      0
size-2097152           0      0 2097152    1  512 : tunables    1    1    0 : slabdata      0      0      0
size-1048576(DMA)      0      0 1048576    1  256 : tunables    1    1    0 : slabdata      0      0      0
size-1048576           0      0 1048576    1  256 : tunables    1    1    0 : slabdata      0      0      0
size-524288(DMA)       0      0 524288    1  128 : tunables    1    1    0 : slabdata      0      0      0
size-524288            0      0 524288    1  128 : tunables    1    1    0 : slabdata      0      0      0
size-262144(DMA)       0      0 262144    1   64 : tunables    1    1    0 : slabdata      0      0      0
size-262144            0      0 262144    1   64 : tunables    1    1    0 : slabdata      0      0      0
size-131072(DMA)       0      0 131072    1   32 : tunables    8    4    0 : slabdata      0      0      0
size-131072            0      0 131072    1   32 : tunables    8    4    0 : slabdata      0      0      0
size-65536(DMA)        0      0  65536    1   16 : tunables    8    4    0 : slabdata      0      0      0
size-65536             1      1  65536    1   16 : tunables    8    4    0 : slabdata      1      1      0
size-32768(DMA)        0      0  32768    1    8 : tunables    8    4    0 : slabdata      0      0      0
size-32768             0      0  32768    1    8 : tunables    8    4    0 : slabdata      0      0      0
size-16384(DMA)        0      0  16384    1    4 : tunables    8    4    0 : slabdata      0      0      0
size-16384             0      0  16384    1    4 : tunables    8    4    0 : slabdata      0      0      0
size-8192(DMA)         0      0   8192    1    2 : tunables    8    4    0 : slabdata      0      0      0
size-8192              5      5   8192    1    2 : tunables    8    4    0 : slabdata      5      5      0
size-4096(DMA)         0      0   4096    1    1 : tunables   24   12    0 : slabdata      0      0      0
size-4096             96     96   4096    1    1 : tunables   24   12    0 : slabdata     96     96      0
size-2048(DMA)         0      0   2048    2    1 : tunables   24   12    0 : slabdata      0      0      0
size-2048            126    134   2048    2    1 : tunables   24   12    0 : slabdata     67     67      0
size-1024(DMA)         0      0   1024    4    1 : tunables   54   27    0 : slabdata      0      0      0
size-1024            128    128   1024    4    1 : tunables   54   27    0 : slabdata     32     32      0
size-512(DMA)          0      0    512    8    1 : tunables   54   27    0 : slabdata      0      0      0
size-512             416    416    512    8    1 : tunables   54   27    0 : slabdata     52     52      0
size-256(DMA)          0      0    256   15    1 : tunables  120   60    0 : slabdata      0      0      0
size-256             446    465    256   15    1 : tunables  120   60    0 : slabdata     31     31      0
size-192(DMA)          0      0    192   20    1 : tunables  120   60    0 : slabdata      0      0      0
size-192             133    140    192   20    1 : tunables  120   60    0 : slabdata      7      7      0
size-128(DMA)          0      0    128   30    1 : tunables  120   60    0 : slabdata      0      0      0
size-128             707   1140    128   30    1 : tunables  120   60    0 : slabdata     38     38      0
size-96(DMA)           0      0     96   40    1 : tunables  120   60    0 : slabdata      0      0      0
size-96              517    800     96   40    1 : tunables  120   60    0 : slabdata     20     20      0
size-64(DMA)           0      0     64   59    1 : tunables  120   60    0 : slabdata      0      0      0
size-32(DMA)           0      0     32  113    1 : tunables  120   60    0 : slabdata      0      0      0
size-64             3140   3776     64   59    1 : tunables  120   60    0 : slabdata     64     64      0
size-32             2339   2825     32  113    1 : tunables  120   60    0 : slabdata     25     25      0
kmem_cache           143    160     96   40    1 : tunables  120   60    0 : slabdata      4      4      0

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [2.6.22.6] nfsd: fh_verify() `malloc failure' with lots of free memory leads to NFS hang
  2007-09-17 22:23 [2.6.22.6] nfsd: fh_verify() `malloc failure' with lots of free memory leads to NFS hang Nix
@ 2007-09-17 22:36 ` J. Bruce Fields
  2007-09-17 23:54   ` Nix
  0 siblings, 1 reply; 7+ messages in thread
From: J. Bruce Fields @ 2007-09-17 22:36 UTC (permalink / raw)
  To: Nix; +Cc: linux-kernel

On Mon, Sep 17, 2007 at 11:23:46PM +0100, Nix wrote:
> Sep 17 22:57:55 loki warning: kernel: nfsd_dispatch: vers 3 proc 4
> Sep 17 22:57:55 loki warning: kernel: nfsd: ACCESS(3)   36: 01070001 000fb001 00000000 d32ff38f 404811a6 a88d96ab 0x1f
> Sep 17 22:57:55 loki warning: kernel: nfsd: fh_verify(36: 01070001 000fb001 00000000 d32ff38f 404811a6 a88d96ab)
> Sep 17 22:57:55 loki warning: kernel: nfsd: Dropping request due to malloc failure!
> Sep 17 22:58:50 hades notice: kernel: nfs: server loki not responding, still trying
> Sep 17 22:58:50 hades notice: kernel: nfs: server loki not responding, still trying
> Sep 17 22:58:55 hades notice: kernel: nfs: server loki not responding, still trying
> Sep 17 22:59:40 hades notice: kernel: nfs: server loki not responding, still trying
> 
> 
> >From then on, *every* fh_verify() request fails the same way, and
> obviously if you can't verify any fds you can't do much with NFS.
> 
> Looking back in the log I see intermittent malloc failures starting
> almost as soon as I've booted (allowing a couple of minutes for me to
> turn debugging on):
> 
> Sep 17 22:25:50 hades notice: kernel: nfs: server loki OK
> [...]
> Sep 17 22:28:09 loki warning: kernel: nfsd_dispatch: vers 3 proc 19
> Sep 17 22:28:09 loki warning: kernel: nfsd: FSINFO(3)   28: 00070001 000fb001 00000000 d32ff38f 404811a6 a88d96ab
> Sep 17 22:28:09 loki warning: kernel: nfsd: fh_verify(28: 00070001 000fb001 00000000 d32ff38f 404811a6 a88d96ab)
> Sep 17 22:28:09 loki warning: kernel: nfsd: Dropping request due to malloc failure!
> 
> A while later we start seeing runs of malloc failures, which I think
> correlated with the unexplained pauses in NFS response:

Actually, they're nothing to do with malloc failures--the message
printed here is misleading, and isn't even an error; it gets printed
whenever an upcall to mountd is made.  The problem is almost certainly a
problem with kernel<->mountd communication--the kernel depends on mountd
to answer questions about exported filesystems as part of the fh_verify
code.

It's just a shot in the dark, but you might try the latest nfs-utils
(get the latest out of git://linux-nfs.org/nfs-utils if you're already
on the most recent your distro will give you).  Or just apply the
following--which did fix a problem whose symptoms varied depending on
libc behavior.

If that doesn't work, I'd try

	strace -s0 `pidof rpc.mountd`

and also look at the contents of /proc/net/rpc/nfsd.fh/contents.

--b.

commit dd087896285da9e160e13ee9f7d75381b67895e3
Author: J. Bruce Fields <bfields@citi.umich.edu>
Date:   Thu Jul 26 16:30:46 2007 -0400

    Use __fpurge to ensure single-line writes to cache files
    
    On a recent Debian/Sid machine, I saw libc retrying stdio writes that
    returned write errors.  The result is that if an export downcall returns
    an error (which it can in normal operation, since it currently
    (incorrectly) returns -ENOENT on any negative downcall), then subsequent
    downcalls will write multiple lines (including the original line that
    received the error).
    
    The result is that the server fails to respond to any rpc call that
    refers to an unexported mount point (such as a readdir of a directory
    containing such a mountpoint), so client commands hang.
    
    I don't know whether this libc behavior is correct or expected, but it
    seems safest to add the __fpurge() (suggested by Neil) to ensure data is
    thrown away.
    
    Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu>
    Signed-off-by: Neil Brown <neilb@suse.de>

diff --git a/support/nfs/cacheio.c b/support/nfs/cacheio.c
index a76915b..9d271cd 100644
--- a/support/nfs/cacheio.c
+++ b/support/nfs/cacheio.c
@@ -17,6 +17,7 @@
 
 #include <nfslib.h>
 #include <stdio.h>
+#include <stdio_ext.h>
 #include <ctype.h>
 #include <unistd.h>
 #include <sys/types.h>
@@ -111,7 +112,18 @@ void qword_printint(FILE *f, int num)
 
 int qword_eol(FILE *f)
 {
+	int err;
+
 	fprintf(f,"\n");
+	err = fflush(f);
+	/*
+	 * We must send one line (and one line only) in a single write
+	 * call.  In case of a write error, libc may accumulate the
+	 * unwritten data and try to write it again later, resulting in a
+	 * multi-line write.  So we must explicitly ask it to throw away
+	 * any such cached data:
+	 */
+	__fpurge(f);
 	return fflush(f);
 }
 

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [2.6.22.6] nfsd: fh_verify() `malloc failure' with lots of free memory leads to NFS hang
  2007-09-17 22:36 ` J. Bruce Fields
@ 2007-09-17 23:54   ` Nix
  2007-09-18  1:12     ` J. Bruce Fields
  0 siblings, 1 reply; 7+ messages in thread
From: Nix @ 2007-09-17 23:54 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-kernel

On 17 Sep 2007, J. Bruce Fields stated:
> On Mon, Sep 17, 2007 at 11:23:46PM +0100, Nix wrote:
>> A while later we start seeing runs of malloc failures, which I think
>> correlated with the unexplained pauses in NFS response:
>
> Actually, they're nothing to do with malloc failures--the message
> printed here is misleading, and isn't even an error; it gets printed
> whenever an upcall to mountd is made.

Indeed, with more debugging, all the failures I see come from the call
to exp_find(), which is digging out exports...

>                                        The problem is almost certainly a
> problem with kernel<->mountd communication--the kernel depends on mountd
> to answer questions about exported filesystems as part of the fh_verify
> code.

Ah! I keep forgetting that mountd isn't just used at mount time: damned
misleading names, grumble.

Restarting mountd clears the problem up temporarily, so you are
definitely right.

> commit dd087896285da9e160e13ee9f7d75381b67895e3
> Author: J. Bruce Fields <bfields@citi.umich.edu>
> Date:   Thu Jul 26 16:30:46 2007 -0400

Aha! I'm on 3b55934b9baefecee17aefc3ea139e261a4b03b8, over a month older.

>     On a recent Debian/Sid machine, I saw libc retrying stdio writes that
>     returned write errors.

Debian Sid recently upgraded to glibc 2.6.x, as did I... earlier
versions of glibc will have had this behaviour too, but it may have been
less frequent.

>     I don't know whether this libc behavior is correct or expected, but it
>     seems safest to add the __fpurge() (suggested by Neil) to ensure data is
>     thrown away.

It is expected, judging from my reading of the
code. stdio-common/vfprintf.c emits single chars using the outchar()
macro, and strings using the outstring() macro, using functions in libio
to do the job. The string output routine then calls _IO_file_xsputn(),
which, tracing through libio's jump tables and symbol aliases, ends up
calling _IO_new_file_xsputn() in libio/fileops.c. (I've only just
started to understand libio. It's basically undocumented as far as I can
tell, but it's deeply nifty. Think of stdio, only made entirely out of
hookable components. :) )

(Actual writing then thunks down through _IO_new_do_write() and
new_do_write() in the same file, which finally calls __write().  If
there's any kind of error this returns EOF after some opaque messing
about with a _cur_column value, which is as far as I can tell never
used!)

The code which calls new_do_write() looks like this:

,----[ libio/fileops.c:_IO_new_file_xsputn() ]
|  if (do_write)
|    {
|      count = new_do_write (f, s, do_write);
|      to_do -= count;
|      if (count < do_write)
|        return n - to_do;
|    }
`----

This code handles partial writes followed by errors by returning a
suitable nonzero value, and immediate errors by returning -1.

In either case the buffer will have been filled as much as possible by
that point, and will still be filled when (vf)printf() is next called.

This behaviour is, IIRC, mandated by the C Standard: I can find no
reference in the Standard to streams being flushed on error, only
on fclose(), fflush(), or program termination.

I'm upgrading now: thank you!

-- 
`Some people don't think performance issues are "real bugs", and I think 
such people shouldn't be allowed to program.' --- Linus Torvalds

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [2.6.22.6] nfsd: fh_verify() `malloc failure' with lots of free memory leads to NFS hang
  2007-09-17 23:54   ` Nix
@ 2007-09-18  1:12     ` J. Bruce Fields
  2007-09-18  6:18       ` Nix
  2007-09-21 18:46       ` Nix
  0 siblings, 2 replies; 7+ messages in thread
From: J. Bruce Fields @ 2007-09-18  1:12 UTC (permalink / raw)
  To: Nix; +Cc: linux-kernel

On Tue, Sep 18, 2007 at 12:54:07AM +0100, Nix wrote:
> The code which calls new_do_write() looks like this:
> 
> ,----[ libio/fileops.c:_IO_new_file_xsputn() ]
> |  if (do_write)
> |    {
> |      count = new_do_write (f, s, do_write);
> |      to_do -= count;
> |      if (count < do_write)
> |        return n - to_do;
> |    }
> `----
> 
> This code handles partial writes followed by errors by returning a
> suitable nonzero value, and immediate errors by returning -1.
> 
> In either case the buffer will have been filled as much as possible by
> that point, and will still be filled when (vf)printf() is next called.

OK, I'm a little lost at this point (what's n?  What's to_do?), but I'll
take your word for it.

I'd be kinda curious when exactly the behavior changed and why.

Also I suppose we should check which version of nfs-utils that fix is in
and make sure distributions are getting the fixed nfs-utils before they
get the new libc, or we're going to see this bug a lot....

> This behaviour is, IIRC, mandated by the C Standard: I can find no
> reference in the Standard to streams being flushed on error, only
> on fclose(), fflush(), or program termination.

OK!

Let me know if the problem's fixed.

--b.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [2.6.22.6] nfsd: fh_verify() `malloc failure' with lots of free memory leads to NFS hang
  2007-09-18  1:12     ` J. Bruce Fields
@ 2007-09-18  6:18       ` Nix
  2007-09-21 18:46       ` Nix
  1 sibling, 0 replies; 7+ messages in thread
From: Nix @ 2007-09-18  6:18 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-kernel

On 18 Sep 2007, J. Bruce Fields stated:

> On Tue, Sep 18, 2007 at 12:54:07AM +0100, Nix wrote:
>> The code which calls new_do_write() looks like this:
>> 
>> ,----[ libio/fileops.c:_IO_new_file_xsputn() ]
>> |  if (do_write)
>> |    {
>> |      count = new_do_write (f, s, do_write);
>> |      to_do -= count;
>> |      if (count < do_write)
>> |        return n - to_do;
>> |    }
>> `----
>> 
>> This code handles partial writes followed by errors by returning a
>> suitable nonzero value, and immediate errors by returning -1.
>> 
>> In either case the buffer will have been filled as much as possible by
>> that point, and will still be filled when (vf)printf() is next called.
>
> OK, I'm a little lost at this point (what's n?  What's to_do?), but I'll
> take your word for it.

n is the total amount to write: to_do is the amount still unwritten. The
rest is buffered but written.

> I'd be kinda curious when exactly the behavior changed and why.

Same here. I'm sort of surprised that it *did* change. The last change
to anything in that function was in 2005, and the fragment shown is
ten years old.

I suspect some other change made it easier to see this pre-existing
behaviour. I wonder if something in glibc used to call __fpurge() for
you?

> Also I suppose we should check which version of nfs-utils that fix is in
> and make sure distributions are getting the fixed nfs-utils before they
> get the new libc, or we're going to see this bug a lot....

And since it looks like a kernel bug idiots like me are going to keep on
bugging the l-k list with a non-kernel bug.

> Let me know if the problem's fixed.

Well, the machine's still running after more than six hours, where before
it would freeze solid in half an hour or less.

So I'd say it's fixed :)

-- 
`Some people don't think performance issues are "real bugs", and I think 
such people shouldn't be allowed to program.' --- Linus Torvalds

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [2.6.22.6] nfsd: fh_verify() `malloc failure' with lots of free memory leads to NFS hang
  2007-09-18  1:12     ` J. Bruce Fields
  2007-09-18  6:18       ` Nix
@ 2007-09-21 18:46       ` Nix
  2007-09-21 21:13         ` J. Bruce Fields
  1 sibling, 1 reply; 7+ messages in thread
From: Nix @ 2007-09-21 18:46 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-kernel

On 18 Sep 2007, J. Bruce Fields told this:
> Also I suppose we should check which version of nfs-utils that fix is in
> and make sure distributions are getting the fixed nfs-utils before they
> get the new libc, or we're going to see this bug a lot....

Further info. This behaviour, although it is allowed by POSIX, is not
*specified*, as such, by it (ah! ambiguity! life would be so dull
without you!), and is not replicated by any other C library on the face
of the planet.

Further, it's not even the behaviour of glibc, as such.

It's the behaviour of glibc 2.6.x as patched with this Debian-specific
patch:

,----[ any/submitted-fileops-and-signals.diff ]
| Index: glibc-2.6/libio/fileops.c
| ===================================================================
| --- glibc-2.6.orig/libio/fileops.c
| +++ glibc-2.6/libio/fileops.c
| @@ -517,10 +517,19 @@
|      fp->_cur_column = INTUSE(_IO_adjust_column) (fp->_cur_column - 1, data,
|  						 count) + 1;
|    _IO_setg (fp, fp->_IO_buf_base, fp->_IO_buf_base, fp->_IO_buf_base);
| -  fp->_IO_write_base = fp->_IO_write_ptr = fp->_IO_buf_base;
| -  fp->_IO_write_end = (fp->_mode <= 0
| +  if (__builtin_expect (count == to_do, 1))
| +    {
| +      fp->_IO_write_base = fp->_IO_write_ptr = fp->_IO_buf_base;
| +      fp->_IO_write_end = (fp->_mode <= 0
|  		       && (fp->_flags & (_IO_LINE_BUF+_IO_UNBUFFERED))
|  		       ? fp->_IO_buf_base : fp->_IO_buf_end);
| +    }
| +  else if (count > 0 && fp->_IO_write_base == data)
| +    {
| +      memmove (data, data+count, 
| +               fp->_IO_write_ptr - fp->_IO_write_base - count);
| +      fp->_IO_write_ptr -= count;
| +    }
|    return count;
|  }
|  
`----

This patch also breaks at least bash and zsh as well as nfs-utils (see
<http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=429021>) and I sort of
suspect it's not long for this world as a result.

So this is much less nasty than we feared (well, and more nasty if you
happen to be a Debian unstable or testing user).

-- 
`Some people don't think performance issues are "real bugs", and I think 
such people shouldn't be allowed to program.' --- Linus Torvalds

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [2.6.22.6] nfsd: fh_verify() `malloc failure' with lots of free memory leads to NFS hang
  2007-09-21 18:46       ` Nix
@ 2007-09-21 21:13         ` J. Bruce Fields
  0 siblings, 0 replies; 7+ messages in thread
From: J. Bruce Fields @ 2007-09-21 21:13 UTC (permalink / raw)
  To: Nix; +Cc: linux-kernel

On Fri, Sep 21, 2007 at 07:46:32PM +0100, Nix wrote:
> On 18 Sep 2007, J. Bruce Fields told this:
> > Also I suppose we should check which version of nfs-utils that fix is in
> > and make sure distributions are getting the fixed nfs-utils before they
> > get the new libc, or we're going to see this bug a lot....
> 
> Further info. This behaviour, although it is allowed by POSIX, is not
> *specified*, as such, by it (ah! ambiguity! life would be so dull
> without you!), and is not replicated by any other C library on the face
> of the planet.
> 
> Further, it's not even the behaviour of glibc, as such.
> 
> It's the behaviour of glibc 2.6.x as patched with this Debian-specific
> patch:

Good grief.  Well, I'd still count it as a bug in nfs-utils, then, but
nfs-util's share of the blame is much less than we'd previously
thought....

> This patch also breaks at least bash and zsh as well as nfs-utils (see
> <http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=429021>) and I sort of
> suspect it's not long for this world as a result.
> 
> So this is much less nasty than we feared (well, and more nasty if you
> happen to be a Debian unstable or testing user).

OK!  Sounds like good news.

--b.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2007-09-21 21:13 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-09-17 22:23 [2.6.22.6] nfsd: fh_verify() `malloc failure' with lots of free memory leads to NFS hang Nix
2007-09-17 22:36 ` J. Bruce Fields
2007-09-17 23:54   ` Nix
2007-09-18  1:12     ` J. Bruce Fields
2007-09-18  6:18       ` Nix
2007-09-21 18:46       ` Nix
2007-09-21 21:13         ` J. Bruce Fields

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox