From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay1.corp.sgi.com [137.38.102.111]) by oss.sgi.com (Postfix) with ESMTP id 5A1B57F52 for ; Thu, 26 Sep 2013 02:36:49 -0500 (CDT) Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by relay1.corp.sgi.com (Postfix) with ESMTP id 43DA18F8039 for ; Thu, 26 Sep 2013 00:36:46 -0700 (PDT) Received: from userp1040.oracle.com (userp1040.oracle.com [156.151.31.81]) by cuda.sgi.com with ESMTP id SlvxOjFu09EZk0dk (version=TLSv1 cipher=AES256-SHA bits=256 verify=NO) for ; Thu, 26 Sep 2013 00:36:45 -0700 (PDT) Received: from acsinet21.oracle.com (acsinet21.oracle.com [141.146.126.237]) by userp1040.oracle.com (Sentrion-MTA-4.3.1/Sentrion-MTA-4.3.1) with ESMTP id r8Q7ahJB029860 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Thu, 26 Sep 2013 07:36:44 GMT Received: from userz7022.oracle.com (userz7022.oracle.com [156.151.31.86]) by acsinet21.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id r8Q7agIG016953 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for ; Thu, 26 Sep 2013 07:36:43 GMT Received: from abhmt119.oracle.com (abhmt119.oracle.com [141.146.116.71]) by userz7022.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id r8Q7ag4v021080 for ; Thu, 26 Sep 2013 07:36:42 GMT Message-ID: <5243E43A.6010101@oracle.com> Date: Thu, 26 Sep 2013 15:37:30 +0800 From: Jeff Liu MIME-Version: 1.0 Subject: Re: [PATCH v3] xfs: fix dead loop at xfs_vm_writepage() on 32bit machine References: <5243D8DE.8090003@oracle.com> In-Reply-To: <5243D8DE.8090003@oracle.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: "xfs@oss.sgi.com" On 09/26/2013 02:49 PM, Jeff Liu wrote: > From: Jie Liu > > Write a file with an offset greater than 16TB on 32-bit system and > then trigger page write-back via sync(1) as below will cause the > task hang in a little while: > > INFO: task sync:2590 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > sync D c1064a28 0 2590 2097 0x00000000 > ..... > Call Trace: > [] ? ttwu_do_wakeup+0x18/0x130 > [] ? try_to_wake_up+0x1ce/0x220 > [] ? wake_up_process+0x1f/0x40 > [] ? wake_up_worker+0x1e/0x30 > [] schedule+0x23/0x60 > [] schedule_timeout+0x18d/0x1f0 > [] ? do_raw_spin_unlock+0x4e/0x90 > [] ? __queue_delayed_work+0x91/0x150 > [] ? do_raw_spin_lock+0x3f/0x100 > [] ? do_raw_spin_unlock+0x4e/0x90 > [] wait_for_completion+0x7d/0xc0 > [] ? try_to_wake_up+0x220/0x220 > [] sync_inodes_sb+0x92/0x180 > [] sync_inodes_one_sb+0x15/0x20 > [] iterate_supers+0xb8/0xc0 > [] ? fdatawrite_one_bdev+0x20/0x20 > [] sys_sync+0x31/0x80 > [] sysenter_do_call+0x12/0x28 > > The reason is that the end_index is unsigned long with maximum value > '2^32-1=4294967295' on 32-bit platform, and the given offset cause it > wrapped to 0, so that the following codes will repeat again and again > until the task schedule time out: > > end_index = offset >> PAGE_CACHE_SHIFT; > last_index = (offset - 1) >> PAGE_CACHE_SHIFT; > if (page->index >= end_index) { > unsigned offset_into_page = offset & (PAGE_CACHE_SIZE - 1); > /* > * Just skip the page if it is fully outside i_size, e.g. due > * to a truncate operation that is in progress. > */ > if (page->index >= end_index + 1 || offset_into_page == 0) { > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > unlock_page(page); > return 0; > } > > To check a page is fully outsids i_size or not, we can change the > logic to: > if (page->index > end_index || > (page->index == end_index && offset_into_page == 0)) > > Secondly, there still has another similar issue when calculating the > end offset for mapping the filesystem blocks to the file blocks for > delalloc. With the same tests to above, run unmount(8) will cause > kernel panic if CONFIG_XFS_DEBUG is enabled: > > XFS: Assertion failed: XFS_FORCED_SHUTDOWN(ip->i_mount) || \ > ip->i_delayed_blks == 0, file: fs/xfs/xfs_super.c, line: 964 > > kernel BUG at fs/xfs/xfs_message.c:108! > invalid opcode: 0000 [#1] SMP > task: edddc100 ti: ec6ee000 task.ti: ec6ee000 > EIP: 0060:[] EFLAGS: 00010296 CPU: 1 > EIP is at assfail+0x2b/0x30 [xfs] > .............. > Call Trace: > [] xfs_fs_destroy_inode+0x74/0x120 [xfs] > [] destroy_inode+0x31/0x50 > [] evict+0xef/0x170 > [] dispose_list+0x32/0x40 > [] evict_inodes+0xca/0xe0 > [] generic_shutdown_super+0x46/0xd0 > [] kill_block_super+0x29/0x70 > [] deactivate_locked_super+0x44/0x70 > [] deactivate_super+0x47/0x60 > [] mntput_no_expire+0xcd/0x120 > [] SyS_umount+0xa8/0x370 > [] SyS_oldumount+0x1e/0x20 > [] sysenter_do_call+0x12/0x28 > > That because the end_offset is evaluated to 0 same to above, hence the > mapping and covertion for dealloc file blocks to file system blocks did > not happened. > > This patch just fixed both issues. > > Reported-by: Michael L. Semon > Signed-off-by: Jie Liu > --- > v3: Add code comments to reflect this change. > v2: don't reset the s_max_bytes to MAX_LFS_FILESIZE, instead, revise the page offset > check up strategy to avoid the potential overflow. > v1: http://oss.sgi.com/archives/xfs/2013-07/msg00154.html > > fs/xfs/xfs_aops.c | 49 +++++++++++++++++++++++++++++++++++++++++++------ > 1 file changed, 43 insertions(+), 6 deletions(-) > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c > index e51e581..541c837 100644 > --- a/fs/xfs/xfs_aops.c > +++ b/fs/xfs/xfs_aops.c > @@ -960,7 +960,32 @@ xfs_vm_writepage( > offset = i_size_read(inode); > end_index = offset >> PAGE_CACHE_SHIFT; > last_index = (offset - 1) >> PAGE_CACHE_SHIFT; > - if (page->index >= end_index) { > + > + /* > + * The page index is less than the end_index, adjust the end_offset > + * to the highest offset that this page should represent. > + * ----------------------------------------------------- > + * | file mapping | | > + * ----------------------------------------------------- > + * | Page ... | Page N-2 | Page N-1 | Page N | | > + * ^--------------------------------^----------|-------- > + * | desired writeback range | see else | > + * ---------------------------------^------------------| > + */ > + if (page->index < end_index) > + end_offset = (xfs_off_t)(page->index + 1) << PAGE_CACHE_SHIFT; > + else { > + /* > + * Check whether the page to write out is beyond or straddles > + * i_size or not. > + * ------------------------------------------------------- > + * | file mapping | | > + * ------------------------------------------------------- > + * | Page ... | Page N-2 | Page N-1 | Page N | Beyond | > + * ^--------------------------------^-----------|--------- > + * | | Straddles | > + * ---------------------------------^-----------|--------| > + */ > unsigned offset_into_page = offset & (PAGE_CACHE_SIZE - 1); > > /* > @@ -968,24 +993,36 @@ xfs_vm_writepage( > * truncate operation that is in progress. We must redirty the > * page so that reclaim stops reclaiming it. Otherwise > * xfs_vm_releasepage() is called on it and gets confused. > + * > + * Note that the end_index is unsigned long, it would cause it > + * overflow if the given offset is greater than 16TB on 32-bit > + * system if we do check the page is fully outside i_size or not > + * via "if (page->index >= end_index + 1)" as "end_idex + 1" > + * will be evaluated to 0. Hence this page will be redirtied > + * and be written out again and again which would result in > + * dead loop, the user program that perform this operation will ^^^^^^^^^^^^^ Should I say this is an infinite or endless loop rather than dead loop, although I was intended to comments the same behavior, but someone was kindly reminded me of that the infinite loop is more suitable in English world just now. Thanks, -Jeff > + * hang. Instead, we can verify this situation by checking if > + * the page to write if totally beyond the i_size or if it's > + * offset just equal to the EOF. > */ > - if (page->index >= end_index + 1 || offset_into_page == 0) > + if (page->index > end_index || > + (page->index == end_index && offset_into_page == 0)) > goto redirty; > > /* > * The page straddles i_size. It must be zeroed out on each > * and every writepage invocation because it may be mmapped. > * "A file is mapped in multiples of the page size. For a file > - * that is not a multiple of the page size, the remaining > + * that is not a multiple of the page size, the remaining > * memory is zeroed when mapped, and writes to that region are > * not written out to the file." > */ > zero_user_segment(page, offset_into_page, PAGE_CACHE_SIZE); > + > + /* Adjust the end_offset to the end of file */ > + end_offset = offset; > } > > - end_offset = min_t(unsigned long long, > - (xfs_off_t)(page->index + 1) << PAGE_CACHE_SHIFT, > - offset); > len = 1 << inode->i_blkbits; > > bh = head = page_buffers(page); _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs