From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from cn.fujitsu.com ([222.73.24.84]:37513 "EHLO song.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1752221Ab3AVI7J (ORCPT ); Tue, 22 Jan 2013 03:59:09 -0500 Message-ID: <50FE5500.7040300@cn.fujitsu.com> Date: Tue, 22 Jan 2013 16:59:44 +0800 From: Miao Xie Reply-To: miaox@cn.fujitsu.com MIME-Version: 1.0 To: Alex Lyakas CC: Josef Bacik , linux-btrfs Subject: Re: btrfs_start_delalloc_inodes livelocks when creating snapshot under IO References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-btrfs-owner@vger.kernel.org List-ID: On mon, 21 Jan 2013 20:37:57 +0200, Alex Lyakas wrote: > Greetings all, > > I see the following issue during snap creation under IO: > Transaction commit calls btrfs_start_delalloc_inodes() that locks the > delalloc_inodes list, fetches the first inode, unlocks the list, > triggers btrfs_alloc_delalloc_work/btrfs_queue_worker for this inode > and then locks the list again. Then it checks the head of the list > again. In my case, this is always exactly the same inode. As a result, > this function allocates a huge amount of btrfs_delalloc_work > structures, and I start seeing OOM messages in the kernel log, killing > processes etc. OK, I will deal with this problem. > > During that time this transaction commit is stuck, so, for example, > other requests to create snapshot (that must wait for this transaction > to commit first) get stuck. The transaction_kthread also gets stuck in > attempt to commit the transaction. > > Is this an intended behavior? Shouldn't we ensure that every inode in > the delalloc list gets handled at most once? If the delalloc work is > processed asynchronously, maybe the delalloc list can be locked once > and traversed once? 1st question: It is an intended behavior. We flush the delalloc inodes when committing the transaction is to guarantee that the data in the snapshot is the same as the source subvolume. For example: # dd if=/dev/zero of=/file0 bs=1M count=1 # btrfs subvolume snapshot /snap0 If we don't flush the delalloc inode(file0), we will find the data in snap0/file0 is different with /file0. It is why we have to flush the delalloc inodes. 2nd question: I think we needn't flush all the delalloc inodes in the fs when creating a snapshot, we just need flush all the delalloc inodes of the source subvolumes when creating a snapshot. If so, we must split the delalloc list and introduce pre-subvolume delalloc list.(If no one rejects this idea, I can implement it.) 3th question: Traversing the delalloc list once is reasonable and safe, because we can guarantee the following process is safe. Task write data into a file make a snapshot Maybe somebody will worry this case: Task0 Task1 make a snapshot commit transaction flush delalloc inodes write data into a file commit all trees update super block transaction commit end because we can not get the data that Task1 wrote from the snapshot. I think this case can be ignored because this data was written into the file after we start the process of the snapshot creation. > Josef, I see in commit 996d282c7ff470f150a467eb4815b90159d04c47 that > you mention that "btrfs_start_delalloc_inodes will just walk the list > of delalloc inodes and start writing them out, but it doesn't splice > the list or anything so as long as somebody is doing work on the box > you could end up in this section _forever_." I guess I am hitting this > here also. > > Miao, I tested the behavior before your commit > 8ccf6f19b67f7e0921063cc309f4672a6afcb528 "Btrfs: make delalloc inodes > be flushed by multi-task", on kernel 3.6. I see same issue there as > well, but OOM doesn't happen, because before your change > btrfs_start_delalloc_inodes() was calling filemap_flush() directly. > But I see still that btrfs_start_delalloc_inodes() handles same inode > more than once, and in some cases never returns in 15 minutes or more, > delaying all other transactions. And snapshot creation gets stuck for > all this time. You are right. I will fix this problem. > > (The stack I see on kernel 3.6 is like this: > [] get_request_wait+0xf6/0x1d0 > [] blk_queue_bio+0x7f/0x380 > [] generic_make_request.part.50+0x74/0xb0 > [] generic_make_request+0x68/0x70 > [] submit_bio+0x85/0x110 > [] btrfs_map_bio+0x165/0x2f0 [btrfs] > [] btrfs_submit_bio_hook+0x7f/0x170 [btrfs] > [] submit_one_bio+0x6a/0xa0 [btrfs] > [] submit_extent_page.isra.34+0xe4/0x230 [btrfs] > [] __extent_writepage+0x5ec/0x810 [btrfs] > [] > extent_write_cache_pages.isra.26.constprop.40+0x2b2/0x410 [btrfs] > [] extent_writepages+0x45/0x60 [btrfs] > [] btrfs_writepages+0x28/0x30 [btrfs] > [] do_writepages+0x21/0x40 > [] __filemap_fdatawrite_range+0x5b/0x60 > [] filemap_flush+0x1c/0x20 > [] btrfs_start_delalloc_inodes+0xc9/0x1f0 [btrfs] > [] btrfs_commit_transaction+0x44d/0xaf0 [btrfs] > [] btrfs_mksubvol.isra.53+0x37d/0x440 [btrfs] > [] btrfs_ioctl_snap_create_transid+0xfa/0x190 [btrfs] > [] btrfs_ioctl_snap_create_v2+0x103/0x140 [btrfs] > [] btrfs_ioctl+0x80f/0x1bf0 [btrfs] > [] do_vfs_ioctl+0x8a/0x340 > [] sys_ioctl+0x91/0xa0 > [] system_call_fastpath+0x16/0x1b > [] 0xffffffffffffffff > > Somehow the request queue of the block device gets empty and the > transaction waits for a long time to allocate a request.) I think it is because there is no enough memory and it must wait for the memory reclaim. 1038 * 1039 * Get a free request from @q. If %__GFP_WAIT is set in @gfp_mask, this 1040 * function keeps retrying under memory pressure and fails iff @q is dead. 1041 * Thanks Miao > > Some details about my setup: > I am testing for-linus Chris's branch > I have one subvolume with 8 large files (10GB each). > I am running two fio processes (one per file, so only 2 out of 8 files > are involved) with 8 threads each like this: > fio --thread --directory=/btrfs/subvol1 --rw=randwrite --randrepeat=1 > --fadvise_hint=0 --fallocate=posix --size=1000m --filesize=10737418240 > --bsrange=512b-64k --scramble_buffers=1 --nrfiles=1 --overwrite=1 > --ioengine=sync --filename=file-1 --name=job0 --name=job1 --name=job2 > --name=job3 --name=job4 --name=job5 --name=job6 --name=job7 > The files are preallocated with fallocate before the fio run. > Mount options: noatime,nodatasum,nodatacow,nospace_cache > > Can somebody please advise on how to address this issue, and, if > possible, how to solve it on kernel 3.6. > > Thanks, > Alex. > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >