From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (Postfix) with ESMTP id 39F1E7CBF for ; Mon, 6 Jul 2015 19:45:43 -0500 (CDT) Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by relay1.corp.sgi.com (Postfix) with ESMTP id 190E48F8052 for ; Mon, 6 Jul 2015 17:45:43 -0700 (PDT) Received: from ipmail06.adl6.internode.on.net (ipmail06.adl6.internode.on.net [150.101.137.145]) by cuda.sgi.com with ESMTP id DWerotRS8sD67jyY for ; Mon, 06 Jul 2015 17:45:36 -0700 (PDT) Date: Tue, 7 Jul 2015 10:35:42 +1000 From: Dave Chinner Subject: Re: Failing XFS filesystem underlying Ceph OSDs Message-ID: <20150707003542.GW7943@dastard> References: <20150703235141.GQ7943@dastard> <20150704233802.GS7943@dastard> <20150705232443.GA3902@dastard> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Alex Gorbachev Cc: xfs@oss.sgi.com On Mon, Jul 06, 2015 at 03:20:19PM -0400, Alex Gorbachev wrote: > On Sun, Jul 5, 2015 at 7:24 PM, Dave Chinner wrote: > > On Sun, Jul 05, 2015 at 12:25:47AM -0400, Alex Gorbachev wrote: > > > > > sysctl vm.swappiness=20 (can probably be 1 as per article) > > > > > > > > > > sysctl vm.min_free_kbytes=262144 > > > > > > [...] > > > > > > We have experienced the problem in various guises with kernels 3.14, > > 3.19, > > > 4.1-rc2 and now 4.1, so it's not new to us, just different error stack. > > > Below are some other stack dumps of what manifested as the same error. > > > > > > [] schedule+0x29/0x70 > > > [] _xfs_log_force+0x187/0x280 [xfs] > > > [] ? try_to_wake_up+0x2a0/0x2a0 > > > [] xfs_log_force+0x39/0xc0 [xfs] > > > [] xfsaild_push+0x552/0x5a0 [xfs] > > > [] ? schedule_timeout+0x124/0x210 > > > [] xfsaild+0x9f/0x140 [xfs] > > > [] ? xfsaild_push+0x5a0/0x5a0 [xfs] > > > [] kthread+0xc9/0xe0 > > > [] ? flush_kthread_worker+0x90/0x90 > > > [] ret_from_fork+0x58/0x90 > > > [] ? flush_kthread_worker+0x90/0x90 > > > INFO: task xfsaild/sdg1:2606 blocked for more than 120 seconds. > > > Not tainted 3.19.4-031904-generic #201504131440 > > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this > > message. > > > > That's indicative of IO completion problems, but not a crash. > > > > > BUG: unable to handle kernel NULL pointer dereference at > > (null) > > > IP: [] xfs_count_page_state+0x3f/0x70 [xfs] > > .... > > > [] xfs_vm_releasepage+0x40/0x120 [xfs] > > > [] try_to_release_page+0x32/0x50 > > > [] shrink_page_list+0x69d/0x720 > > > [] shrink_inactive_list+0x1dd/0x5d0 > > .... > > > > Again, this is indicative of a page cache issue: a page without > > buffers has been passed to xfs_vm_releasepage(), which implies the > > page flags are not correct. i.e PAGE_FLAGS_PRIVATE is set but > > page->private is null... > > > > Again, this is unlikely to be an XFS issue. > > > > Sorry for my ignorance, but would this likely come from Ceph code or a > hardware issue of some kind, such as a disk drive? I have reached out to > RedHat and Ceph community on that as well. More likely a kernel bug somewhere in the page cache or memory reclaim paths. The issue is that we only notice the problem long after it has occurred. i.e. when XFS goes to tear down the page it has been handed, the page is already in a bad state and so it doesn't really tell us anything about the cause of the problem. Realisticaly, we need a script that reproduces the problem (that doesn't require a Ceph cluster) to be able to isolate the cause. In the mean time, you can always try running CONFIG_XFS_WARN=y to see if that catches problems earlier, and you might also want to do things like turn on memory poisoning and other kernel debugging options to try to isolate the cause of the issue.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs