* Re: iomap infrastructure and multipage writes V5 @ 2017-02-13 22:32 xfs-owner 0 siblings, 0 replies; 14+ messages in thread From: xfs-owner @ 2017-02-13 22:32 UTC (permalink / raw) To: sandeen [-- Attachment #1: Type: text/plain, Size: 268 bytes --] This list has been closed. Please subscribe to the linux-xfs@vger.kernel.org mailing list and send any future messages there. You can subscribe to the linux-xfs list at http://vger.kernel.org/vger-lists.html#linux-xfs For any questions please post to the new list. [-- Attachment #2: Type: message/rfc822, Size: 4391 bytes --] From: Eric Sandeen <sandeen@sandeen.net> To: Dave Chinner <david@fromorbit.com>, Christoph Hellwig <hch@lst.de> Cc: rpeterso@redhat.com, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com Subject: Re: iomap infrastructure and multipage writes V5 Date: Mon, 13 Feb 2017 16:31:55 -0600 Message-ID: <75a139c8-5e49-3a89-10d1-20caeef3f69a@sandeen.net> On 8/2/16 6:42 PM, Dave Chinner wrote: > On Sun, Jul 31, 2016 at 09:19:00PM +0200, Christoph Hellwig wrote: >> Now after spending this much time I've started wondering why we even >> reserve blocks in xfs_iomap_write_allocate - after all we've reserved >> space for the actual data blocks and the indlen worst case in >> xfs_bmapi_reserve_delalloc. And in fact a little hack to drop that >> reservation seems to solve both the root cause (depleted reserved pool) >> and the cleanup mess. I just haven't spend enought time to convince >> myself that it's actually safe, and in fact looking at the allocator >> makes me thing it only works by accident currently despite generally >> postive test results. >> >> Here is the quick patch if anyone wants to chime in: >> >> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c >> index 620fc91..67c317f 100644 >> --- a/fs/xfs/xfs_iomap.c >> +++ b/fs/xfs/xfs_iomap.c >> @@ -717,7 +717,7 @@ xfs_iomap_write_allocate( >> >> nimaps = 0; >> while (nimaps == 0) { >> - nres = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK); >> + nres = 0; // XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK); >> >> error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, nres, >> 0, XFS_TRANS_RESERVE, &tp); >> > > This solves the problem for me, and from history appears to be the > right thing to do. Christoph, can you send a proper patch for this? Did anything ever come of this? I don't think I saw a patch, and it looks like it is not upstream. Thanks, -Eric > Cheers, > > Dave. > ^ permalink raw reply [flat|nested] 14+ messages in thread
* iomap infrastructure and multipage writes V5 @ 2016-06-01 14:44 Christoph Hellwig 2016-06-01 14:46 ` Christoph Hellwig 2016-06-28 0:26 ` Dave Chinner 0 siblings, 2 replies; 14+ messages in thread From: Christoph Hellwig @ 2016-06-01 14:44 UTC (permalink / raw) To: xfs; +Cc: rpeterso, linux-fsdevel This series add a new file system I/O path that uses the iomap structure introduced for the pNFS support and support multi-page buffered writes. This was first started by Dave Chinner a long time ago, then I did beat it into shape for production runs in a very constrained ARM NAS enviroment for Tuxera almost as long ago, and now half a dozen rewrites later it's back. The basic idea is to avoid the per-block get_blocks overhead and make use of extents in the buffered write path by iterating over them instead. Note that patch 1 conflicts with Vishals dax error handling series. It would be great to have a stable branch with it so that both the XFS and nvdimm tree could pull it in before the other changes in this area. Changes since V4: - rebase to Linux 4.7-rc1 - fixed an incorrect BUG_ON statement Changes since V3: - fix DAX based zeroing - Reviews and trivial fixes from Bob Changes since V2: - fix the range for delalloc punches after failed writes - updated some changelogs Chances since V1: - add support for fiemap - fix a test fail on 1k block sizes - prepare for 64-bit length, this will be used in a follow on patchset _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: iomap infrastructure and multipage writes V5 2016-06-01 14:44 Christoph Hellwig @ 2016-06-01 14:46 ` Christoph Hellwig 2016-06-28 0:26 ` Dave Chinner 1 sibling, 0 replies; 14+ messages in thread From: Christoph Hellwig @ 2016-06-01 14:46 UTC (permalink / raw) To: xfs; +Cc: rpeterso, linux-fsdevel On Wed, Jun 01, 2016 at 04:44:43PM +0200, Christoph Hellwig wrote: > Note that patch 1 conflicts with Vishals dax error handling series. > It would be great to have a stable branch with it so that both the > XFS and nvdimm tree could pull it in before the other changes in this > area. Please ignore this note - that former patch 1 has been merged in 4.7-rc1. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: iomap infrastructure and multipage writes V5 2016-06-01 14:44 Christoph Hellwig 2016-06-01 14:46 ` Christoph Hellwig @ 2016-06-28 0:26 ` Dave Chinner 2016-06-28 13:28 ` Christoph Hellwig 2016-06-30 17:22 ` Christoph Hellwig 1 sibling, 2 replies; 14+ messages in thread From: Dave Chinner @ 2016-06-28 0:26 UTC (permalink / raw) To: Christoph Hellwig; +Cc: rpeterso, linux-fsdevel, xfs On Wed, Jun 01, 2016 at 04:44:43PM +0200, Christoph Hellwig wrote: > This series add a new file system I/O path that uses the iomap structure > introduced for the pNFS support and support multi-page buffered writes. > > This was first started by Dave Chinner a long time ago, then I did beat > it into shape for production runs in a very constrained ARM NAS > enviroment for Tuxera almost as long ago, and now half a dozen rewrites > later it's back. > > The basic idea is to avoid the per-block get_blocks overhead > and make use of extents in the buffered write path by iterating over > them instead. Christoph, it look slike there's an ENOSPC+ENOMEM behavioural regression here. generic/224 on my 1p/1GB RAM VM using a 1k lock size filesystem has significantly different behaviour once ENOSPC is hit withi this patchset. It ends up with an endless stream of errors like this: [ 687.530641] XFS (sdc): page discard on page ffffea0000197700, inode 0xb52c, offset 400338944. [ 687.539828] XFS (sdc): page discard on page ffffea0000197740, inode 0xb52c, offset 400343040. [ 687.549035] XFS (sdc): page discard on page ffffea0000197780, inode 0xb52c, offset 400347136. [ 687.558222] XFS (sdc): page discard on page ffffea00001977c0, inode 0xb52c, offset 400351232. [ 687.567391] XFS (sdc): page discard on page ffffea0000846000, inode 0xb52c, offset 400355328. [ 687.576602] XFS (sdc): page discard on page ffffea0000846040, inode 0xb52c, offset 400359424. [ 687.585794] XFS (sdc): page discard on page ffffea0000846080, inode 0xb52c, offset 400363520. [ 687.595005] XFS (sdc): page discard on page ffffea00008460c0, inode 0xb52c, offset 400367616. Yeah, it's been going for ten minutes already, and it's reporting this for every page on inode 0xb52c. df reports: Filesystem 1K-blocks Used Available Use% Mounted on /dev/sdc 1038336 1038334 2 100% /mnt/scratch So it's looking very much like the iomap write is allowing pages through the write side rather than giving ENOSPC. i.e. the initial alloc fails, it triggers a flush, which triggers a page discard, which then allows the next write to proceed because it's free up blocks. The trace looks like this dd-12284 [000] 983.936583: xfs_file_buffered_write: dev 8:32 ino 0xb52c size 0xa00000 offset 0x1f37a000 count 0x1000 ioflags dd-12284 [000] 983.936584: xfs_ilock: dev 8:32 ino 0xb52c flags ILOCK_EXCL caller 0xffffffff8150bb47s dd-12284 [000] 983.936585: xfs_iomap_prealloc_size: dev 8:32 ino 0xb52c prealloc blocks 64 shift 6 m_writeio_blocks 64 dd-12284 [000] 983.936585: xfs_delalloc_enospc: dev 8:32 ino 0xb52c isize 0x1f37a000 disize 0xa00000 offset 0x1f37a000 count 4096 dd-12284 [000] 983.936586: xfs_delalloc_enospc: dev 8:32 ino 0xb52c isize 0x1f37a000 disize 0xa00000 offset 0x1f37a000 count 4096 dd-12284 [000] 983.936586: xfs_iunlock: dev 8:32 ino 0xb52c flags ILOCK_EXCL caller 0xffffffff8150bc61s kworker/u2:0-6 [000] 983.946649: xfs_writepage: dev 8:32 ino 0xb52c pgoff 0x1f379000 size 0x1f37a000 offset 0 length 0 delalloc 1 unwritten 0 kworker/u2:0-6 [000] 983.946650: xfs_ilock: dev 8:32 ino 0xb52c flags ILOCK_SHARED caller 0xffffffff814f4b49s kworker/u2:0-6 [000] 983.946651: xfs_iunlock: dev 8:32 ino 0xb52c flags ILOCK_SHARED caller 0xffffffff814f4bf5s kworker/u2:0-6 [000] 983.948093: xfs_ilock: dev 8:32 ino 0xb52c flags ILOCK_EXCL caller 0xffffffff814f555fs kworker/u2:0-6 [000] 983.948095: xfs_bunmap: dev 8:32 ino 0xb52c size 0xa00000 bno 0x7cde4 len 0x1flags caller 0xffffffff814f9123s kworker/u2:0-6 [000] 983.948096: xfs_bmap_pre_update: dev 8:32 ino 0xb52c state idx 0 offset 511460 block 4503599627239431 count 4 flag 0 caller 0xffffffff814c6ea3s kworker/u2:0-6 [000] 983.948097: xfs_bmap_post_update: dev 8:32 ino 0xb52c state idx 0 offset 511461 block 4503599627239431 count 3 flag 0 caller 0xffffffff814c6f1es kworker/u2:0-6 [000] 983.948097: xfs_bunmap: dev 8:32 ino 0xb52c size 0xa00000 bno 0x7cde5 len 0x1flags caller 0xffffffff814f9123s kworker/u2:0-6 [000] 983.948098: xfs_bmap_pre_update: dev 8:32 ino 0xb52c state idx 0 offset 511461 block 4503599627239431 count 3 flag 0 caller 0xffffffff814c6ea3s kworker/u2:0-6 [000] 983.948098: xfs_bmap_post_update: dev 8:32 ino 0xb52c state idx 0 offset 511462 block 4503599627239431 count 2 flag 0 caller 0xffffffff814c6f1es kworker/u2:0-6 [000] 983.948099: xfs_bunmap: dev 8:32 ino 0xb52c size 0xa00000 bno 0x7cde6 len 0x1flags caller 0xffffffff814f9123s kworker/u2:0-6 [000] 983.948099: xfs_bmap_pre_update: dev 8:32 ino 0xb52c state idx 0 offset 511462 block 4503599627239431 count 2 flag 0 caller 0xffffffff814c6ea3s kworker/u2:0-6 [000] 983.948100: xfs_bmap_post_update: dev 8:32 ino 0xb52c state idx 0 offset 511463 block 4503599627239431 count 1 flag 0 caller 0xffffffff814c6f1es kworker/u2:0-6 [000] 983.948100: xfs_bunmap: dev 8:32 ino 0xb52c size 0xa00000 bno 0x7cde7 len 0x1flags caller 0xffffffff814f9123s kworker/u2:0-6 [000] 983.948101: xfs_iext_remove: dev 8:32 ino 0xb52c state idx 0 offset 511463 block 4503599627239431 count 1 flag 0 caller 0xffffffff814c6d2as kworker/u2:0-6 [000] 983.948101: xfs_iunlock: dev 8:32 ino 0xb52c flags ILOCK_EXCL caller 0xffffffff814f55e8s kworker/u2:0-6 [000] 983.948102: xfs_invalidatepage: dev 8:32 ino 0xb52c pgoff 0x1f379000 size 0x1f37a000 offset 0 length 1000 delalloc 1 unwritten 0 kworker/u2:0-6 [000] 983.948102: xfs_releasepage: dev 8:32 ino 0xb52c pgoff 0x1f379000 size 0x1f37a000 offset 0 length 0 delalloc 0 unwritten 0 [snip eof block scan locking] dd-12284 [000] 983.948239: xfs_file_buffered_write: dev 8:32 ino 0xb52c size 0xa00000 offset 0x1f37a000 count 0x1000 ioflags dd-12284 [000] 983.948239: xfs_ilock: dev 8:32 ino 0xb52c flags ILOCK_EXCL caller 0xffffffff8150bb47s dd-12284 [000] 983.948240: xfs_iomap_prealloc_size: dev 8:32 ino 0xb52c prealloc blocks 64 shift 0 m_writeio_blocks 64 dd-12284 [000] 983.948242: xfs_delalloc_enospc: dev 8:32 ino 0xb52c isize 0x1f37a000 disize 0xa00000 offset 0x1f37a000 count 4096 dd-12284 [000] 983.948243: xfs_iext_insert: dev 8:32 ino 0xb52c state idx 0 offset 511464 block 4503599627239431 count 4 flag 0 caller 0xffffffff814c265as dd-12284 [000] 983.948243: xfs_iunlock: dev 8:32 ino 0xb52c flags ILOCK_EXCL caller 0xffffffff8150bc61s dd-12284 [000] 983.948244: xfs_iomap_alloc: dev 8:32 ino 0xb52c size 0xa00000 offset 0x1f37a000 count 4096 type invalid startoff 0x7cde8 startblock -1 blockcount 0x4 dd-12284 [000] 983.948250: xfs_iunlock: dev 8:32 ino 0xb52c flags IOLOCK_EXCL caller 0xffffffff81502378s dd-12284 [000] 983.948254: xfs_ilock: dev 8:32 ino 0xb52c flags IOLOCK_EXCL caller 0xffffffff81502352s dd-12284 [000] 983.948256: xfs_update_time: dev 8:32 ino 0xb52c dd-12284 [000] 983.948257: xfs_log_reserve: dev 8:32 t_ocnt 0 t_cnt 0 t_curr_res 2860 t_unit_res 2860 t_flags XLOG_TIC_INITED reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 956212 grant_write_cycle 1 grant_write_bytes 956212 curr_cycle 1 curr_block 1863 tail_cycle 1 tail_block 1861 dd-12284 [000] 983.948257: xfs_log_reserve_exit: dev 8:32 t_ocnt 0 t_cnt 0 t_curr_res 2860 t_unit_res 2860 t_flags XLOG_TIC_INITED reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 959072 grant_write_cycle 1 grant_write_bytes 959072 curr_cycle 1 curr_block 1863 tail_cycle 1 tail_block 1861 dd-12284 [000] 983.948258: xfs_ilock: dev 8:32 ino 0xb52c flags ILOCK_EXCL caller 0xffffffff8150c4dfs dd-12284 [000] 983.948260: xfs_log_done_nonperm: dev 8:32 t_ocnt 0 t_cnt 0 t_curr_res 2860 t_unit_res 2860 t_flags XLOG_TIC_INITED reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 959072 grant_write_cycle 1 grant_write_bytes 959072 curr_cycle 1 curr_block 1863 tail_cycle 1 tail_block 1861 dd-12284 [000] 983.948260: xfs_log_ungrant_enter: dev 8:32 t_ocnt 0 t_cnt 0 t_curr_res 2860 t_unit_res 2860 t_flags XLOG_TIC_INITED reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 959072 grant_write_cycle 1 grant_write_bytes 959072 curr_cycle 1 curr_block 1863 tail_cycle 1 tail_block 1861 dd-12284 [000] 983.948260: xfs_log_ungrant_sub: dev 8:32 t_ocnt 0 t_cnt 0 t_curr_res 2860 t_unit_res 2860 t_flags XLOG_TIC_INITED reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 959072 grant_write_cycle 1 grant_write_bytes 959072 curr_cycle 1 curr_block 1863 tail_cycle 1 tail_block 1861 dd-12284 [000] 983.948261: xfs_log_ungrant_exit: dev 8:32 t_ocnt 0 t_cnt 0 t_curr_res 2860 t_unit_res 2860 t_flags XLOG_TIC_INITED reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 956212 grant_write_cycle 1 grant_write_bytes 956212 curr_cycle 1 curr_block 1863 tail_cycle 1 tail_block 1861 dd-12284 [000] 983.948261: xfs_iunlock: dev 8:32 ino 0xb52c flags ILOCK_EXCL caller 0xffffffff81526d6cs Ad so the cycle goes. The next page at offset 0x1f37b000 fails allocation, triggers a flush, which results in the write at 0x1f37a000 failing and freeing it's delalloc blocks, which then allows the retry of the write at 0x1f37b000 to succeed.... I can see that mapping errors (i.e. the AS_ENOSPC error) ar enot propagated into the write() path, but that's the same as the old code. What I don't quite understand is how delalloc has blown through the XFS_ALLOC_SET_ASIDE() limits which are supposed to trigger ENOSPC on the write() side much earlier than just one or two blocks free.... Something doesn't quite add up here, and I haven't been able to put my finger on it yet. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: iomap infrastructure and multipage writes V5 2016-06-28 0:26 ` Dave Chinner @ 2016-06-28 13:28 ` Christoph Hellwig 2016-06-28 13:38 ` Christoph Hellwig 2016-06-30 17:22 ` Christoph Hellwig 1 sibling, 1 reply; 14+ messages in thread From: Christoph Hellwig @ 2016-06-28 13:28 UTC (permalink / raw) To: Dave Chinner; +Cc: rpeterso, linux-fsdevel, Christoph Hellwig, xfs On Tue, Jun 28, 2016 at 10:26:49AM +1000, Dave Chinner wrote: > Christoph, it look slike there's an ENOSPC+ENOMEM behavioural regression here. > generic/224 on my 1p/1GB RAM VM using a 1k lock size filesystem has > significantly different behaviour once ENOSPC is hit withi this patchset. Works fine on my 1k test setup with 4 CPUs and 2GB RAM. 1 CPU and 1GB RAM runs into the OOM killer, although I haven't checked if that was the case with the old code as well. I'll look into this more later today or tomorrow. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: iomap infrastructure and multipage writes V5 2016-06-28 13:28 ` Christoph Hellwig @ 2016-06-28 13:38 ` Christoph Hellwig 0 siblings, 0 replies; 14+ messages in thread From: Christoph Hellwig @ 2016-06-28 13:38 UTC (permalink / raw) To: Dave Chinner; +Cc: rpeterso, linux-fsdevel, Christoph Hellwig, xfs On Tue, Jun 28, 2016 at 03:28:39PM +0200, Christoph Hellwig wrote: > > generic/224 on my 1p/1GB RAM VM using a 1k lock size filesystem has > > significantly different behaviour once ENOSPC is hit withi this patchset. > > Works fine on my 1k test setup with 4 CPUs and 2GB RAM. 1 CPU and 1GB > RAM runs into the OOM killer, although I haven't checked if that was > the case with the old code as well. I'll look into this more later > today or tomorrow. It was doing fine before my changes - I'm going to investigate what caused the change in behavior. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: iomap infrastructure and multipage writes V5 2016-06-28 0:26 ` Dave Chinner 2016-06-28 13:28 ` Christoph Hellwig @ 2016-06-30 17:22 ` Christoph Hellwig 2016-06-30 23:16 ` Dave Chinner 2016-07-18 11:14 ` Dave Chinner 1 sibling, 2 replies; 14+ messages in thread From: Christoph Hellwig @ 2016-06-30 17:22 UTC (permalink / raw) To: Dave Chinner; +Cc: rpeterso, linux-fsdevel, xfs On Tue, Jun 28, 2016 at 10:26:49AM +1000, Dave Chinner wrote: > Christoph, it look slike there's an ENOSPC+ENOMEM behavioural regression here. > generic/224 on my 1p/1GB RAM VM using a 1k lock size filesystem has > significantly different behaviour once ENOSPC is hit withi this patchset. > > It ends up with an endless stream of errors like this: I've spent some time trying to reproduce this. I'm actually getting the OOM killer almost reproducible for for-next without the iomap patches as well when just using 1GB of mem. 1400 MB is the minimum I can reproducibly finish the test with either code base. But with the 1400 MB setup I see a few interesting things. Even with the baseline, no-iomap case I see a few errors in the log: [ 70.407465] Filesystem "vdc": reserve blocks depleted! Consider increasing reserve pool size. [ 70.195645] XFS (vdc): page discard on page ffff88005682a988, inode 0xd3, offset 761856. [ 70.408079] Buffer I/O error on dev vdc, logical block 1048513, lost async page write [ 70.408598] Buffer I/O error on dev vdc, logical block 1048514, lost async page write 27s With iomap I also see the spew of page discard errors your see, but while I see a lot of them, the rest still finishes after a reasonable time, just a few seconds more than the pre-iomap baseline. I also see the reserve block depleted message in this case. Digging into the reserve block depleted message - it seems we have too many parallel iomap_allocate transactions going on. I suspect this might be because the writeback code will not finish a writeback context if we have multiple blocks inside a page, which can happen easily for this 1k ENOSPC setup. I've not had time to fully check if this is what really happens, but I did a quick hack (see below) to only allocate 1k at a time in iomap_begin, and with that generic/224 finishes without the warning spew. Of course this isn't a real fix, and I need to fully understand what's going on in writeback due to different allocation / dirtying patterns from the iomap change. diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index 620fc91..d9afba2 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -1018,7 +1018,7 @@ xfs_file_iomap_begin( * Note that the values needs to be less than 32-bits wide until * the lower level functions are updated. */ - length = min_t(loff_t, length, 1024 * PAGE_SIZE); + length = min_t(loff_t, length, 1024); if (xfs_get_extsz_hint(ip)) { /* * xfs_iomap_write_direct() expects the shared lock. It _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: iomap infrastructure and multipage writes V5 2016-06-30 17:22 ` Christoph Hellwig @ 2016-06-30 23:16 ` Dave Chinner 2016-07-18 11:14 ` Dave Chinner 1 sibling, 0 replies; 14+ messages in thread From: Dave Chinner @ 2016-06-30 23:16 UTC (permalink / raw) To: Christoph Hellwig; +Cc: rpeterso, linux-fsdevel, xfs On Thu, Jun 30, 2016 at 07:22:39PM +0200, Christoph Hellwig wrote: > On Tue, Jun 28, 2016 at 10:26:49AM +1000, Dave Chinner wrote: > > Christoph, it look slike there's an ENOSPC+ENOMEM behavioural regression here. > > generic/224 on my 1p/1GB RAM VM using a 1k lock size filesystem has > > significantly different behaviour once ENOSPC is hit withi this patchset. > > > > It ends up with an endless stream of errors like this: > > I've spent some time trying to reproduce this. I'm actually getting > the OOM killer almost reproducible for for-next without the iomap > patches as well when just using 1GB of mem. 1400 MB is the minimum > I can reproducibly finish the test with either code base. > > But with the 1400 MB setup I see a few interesting things. Even > with the baseline, no-iomap case I see a few errors in the log: > > [ 70.407465] Filesystem "vdc": reserve blocks depleted! Consider increasing > reserve pool > size. > [ 70.195645] XFS (vdc): page discard on page ffff88005682a988, inode 0xd3, offset 761856. > [ 70.408079] Buffer I/O error on dev vdc, logical block 1048513, lost async > page write > [ 70.408598] Buffer I/O error on dev vdc, logical block 1048514, lost async > page write > 27s > > With iomap I also see the spew of page discard errors your see, but while > I see a lot of them, the rest still finishes after a reasonable time, > just a few seconds more than the pre-iomap baseline. I also see the > reserve block depleted message in this case. The reserve block pool depleted message is normal for me in this test. We're throwing a thousand concurrent processes at the filesystem at ENOSPC, and so the metadata reservation for the delayed allocation totals quite a lot. We only reserve 8192 blocks for the reserve pool, so a delalloc reservation for one page on each file (4 blocks per page, which means a couple of blocks for the metadata reservation via the indlen calculation) is going to consume the reserve pool quite quickly if the up front reservation overshoots the XFS_ALLOC_SET_ASIDE() ENOSPC threshold. > Digging into the reserve block depleted message - it seems we have > too many parallel iomap_allocate transactions going on. I suspect > this might be because the writeback code will not finish a writeback > context if we have multiple blocks inside a page, which can > happen easily for this 1k ENOSPC setup. Right - this test has regularly triggered that warning on this particular test setup for me - it's not something new to the iomap patchset. > I've not had time to fully > check if this is what really happens, but I did a quick hack (see below) > to only allocate 1k at a time in iomap_begin, and with that generic/224 > finishes without the warning spew. Of course this isn't a real fix, > and I need to fully understand what's going on in writeback due to > different allocation / dirtying patterns from the iomap change. Which tends to indicate that the multi-block allocation has a larger indlen reservation, and that's what is causing the code to hit whatever edge-case that is leading to it not recovering. However, I'm still wondering how we are not throwing ENOSPC back to userspace at XFS_ALLOC_SET_ASIDE limits. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: iomap infrastructure and multipage writes V5 2016-06-30 17:22 ` Christoph Hellwig 2016-06-30 23:16 ` Dave Chinner @ 2016-07-18 11:14 ` Dave Chinner 2016-07-18 11:18 ` Dave Chinner 2016-07-19 3:50 ` Christoph Hellwig 1 sibling, 2 replies; 14+ messages in thread From: Dave Chinner @ 2016-07-18 11:14 UTC (permalink / raw) To: Christoph Hellwig; +Cc: rpeterso, linux-fsdevel, xfs On Thu, Jun 30, 2016 at 07:22:39PM +0200, Christoph Hellwig wrote: > On Tue, Jun 28, 2016 at 10:26:49AM +1000, Dave Chinner wrote: > > Christoph, it look slike there's an ENOSPC+ENOMEM behavioural regression here. > > generic/224 on my 1p/1GB RAM VM using a 1k lock size filesystem has > > significantly different behaviour once ENOSPC is hit withi this patchset. > > > > It ends up with an endless stream of errors like this: > > I've spent some time trying to reproduce this. I'm actually getting > the OOM killer almost reproducible for for-next without the iomap > patches as well when just using 1GB of mem. 1400 MB is the minimum > I can reproducibly finish the test with either code base. > > But with the 1400 MB setup I see a few interesting things. Even > with the baseline, no-iomap case I see a few errors in the log: > > [ 70.407465] Filesystem "vdc": reserve blocks depleted! Consider increasing > reserve pool > size. > [ 70.195645] XFS (vdc): page discard on page ffff88005682a988, inode 0xd3, offset 761856. > [ 70.408079] Buffer I/O error on dev vdc, logical block 1048513, lost async > page write > [ 70.408598] Buffer I/O error on dev vdc, logical block 1048514, lost async > page write > 27s > > With iomap I also see the spew of page discard errors your see, but while > I see a lot of them, the rest still finishes after a reasonable time, > just a few seconds more than the pre-iomap baseline. I also see the > reserve block depleted message in this case. > > Digging into the reserve block depleted message - it seems we have > too many parallel iomap_allocate transactions going on. I suspect > this might be because the writeback code will not finish a writeback > context if we have multiple blocks inside a page, which can > happen easily for this 1k ENOSPC setup. I've not had time to fully > check if this is what really happens, but I did a quick hack (see below) > to only allocate 1k at a time in iomap_begin, and with that generic/224 > finishes without the warning spew. Of course this isn't a real fix, > and I need to fully understand what's going on in writeback due to > different allocation / dirtying patterns from the iomap change. Any progress here, Christoph? The current test run has been running generic/224 on the 1GB mem test Vm for almost 6 hours now, and it's still discarding pages. This doesn't always happen - sometimes it takes the normal amount of time to run, but every so often it falls into this "discard every page" loop and it takes hours to complete... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: iomap infrastructure and multipage writes V5 2016-07-18 11:14 ` Dave Chinner @ 2016-07-18 11:18 ` Dave Chinner 2016-07-31 19:19 ` Christoph Hellwig 2016-07-19 3:50 ` Christoph Hellwig 1 sibling, 1 reply; 14+ messages in thread From: Dave Chinner @ 2016-07-18 11:18 UTC (permalink / raw) To: Christoph Hellwig; +Cc: rpeterso, linux-fsdevel, xfs On Mon, Jul 18, 2016 at 09:14:00PM +1000, Dave Chinner wrote: > On Thu, Jun 30, 2016 at 07:22:39PM +0200, Christoph Hellwig wrote: > > On Tue, Jun 28, 2016 at 10:26:49AM +1000, Dave Chinner wrote: > > > Christoph, it look slike there's an ENOSPC+ENOMEM behavioural regression here. > > > generic/224 on my 1p/1GB RAM VM using a 1k lock size filesystem has > > > significantly different behaviour once ENOSPC is hit withi this patchset. > > > > > > It ends up with an endless stream of errors like this: > > > > I've spent some time trying to reproduce this. I'm actually getting > > the OOM killer almost reproducible for for-next without the iomap > > patches as well when just using 1GB of mem. 1400 MB is the minimum > > I can reproducibly finish the test with either code base. > > > > But with the 1400 MB setup I see a few interesting things. Even > > with the baseline, no-iomap case I see a few errors in the log: > > > > [ 70.407465] Filesystem "vdc": reserve blocks depleted! Consider increasing > > reserve pool > > size. > > [ 70.195645] XFS (vdc): page discard on page ffff88005682a988, inode 0xd3, offset 761856. > > [ 70.408079] Buffer I/O error on dev vdc, logical block 1048513, lost async > > page write > > [ 70.408598] Buffer I/O error on dev vdc, logical block 1048514, lost async > > page write > > 27s > > > > With iomap I also see the spew of page discard errors your see, but while > > I see a lot of them, the rest still finishes after a reasonable time, > > just a few seconds more than the pre-iomap baseline. I also see the > > reserve block depleted message in this case. > > > > Digging into the reserve block depleted message - it seems we have > > too many parallel iomap_allocate transactions going on. I suspect > > this might be because the writeback code will not finish a writeback > > context if we have multiple blocks inside a page, which can > > happen easily for this 1k ENOSPC setup. I've not had time to fully > > check if this is what really happens, but I did a quick hack (see below) > > to only allocate 1k at a time in iomap_begin, and with that generic/224 > > finishes without the warning spew. Of course this isn't a real fix, > > and I need to fully understand what's going on in writeback due to > > different allocation / dirtying patterns from the iomap change. > > Any progress here, Christoph? The current test run has been running > generic/224 on the 1GB mem test Vm for almost 6 hours now, and it's > still discarding pages. This doesn't always happen - sometimes it > takes the normal amount of time to run, but every so often it falls > into this "discard every page" loop and it takes hours to > complete... .... and I've now got a 16p/16GB RAM VM stuck in this loop in generic/224, so it's not limited to low memory machines.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: iomap infrastructure and multipage writes V5 2016-07-18 11:18 ` Dave Chinner @ 2016-07-31 19:19 ` Christoph Hellwig 2016-08-01 0:16 ` Dave Chinner 2016-08-02 23:42 ` Dave Chinner 0 siblings, 2 replies; 14+ messages in thread From: Christoph Hellwig @ 2016-07-31 19:19 UTC (permalink / raw) To: Dave Chinner; +Cc: rpeterso, linux-fsdevel, Christoph Hellwig, xfs Another quiet weekend trying to debug this, and only minor progress. The biggest different in traces of the old vs new code is that we manage to allocate much bigger delalloc reservations at a time in xfs_bmapi_delay -> xfs_bmapi_reserve_delalloc. The old code always went for a single FSB, which also meant allocating an indlen of 7 FSBs. With the iomap code we always allocate at least 4FSB (aka a page), and sometimes 8 or 12. All of these still need 7 FSBs for the worst case indirect blocks. So what happens here is that in an ENOSPC case we manage to allocate more actual delalloc blocks before hitting ENOSPC - notwithstanding that the old case would immediately release them a little later in xfs_bmap_add_extent_hole_delay after merging the delalloc extents. On the writeback side I don't see to many changes either. We'll eventually run out of blocks when allocating the transaction in xfs_iomap_write_allocate because the reserved pool is too small. The only real difference to before is that under the ENOSPC / out of memory case we have allocated between 4 to 12 times more blocks, so we have to clean up 4 to 12 times as much while write_cache_pages continues iterating over these dirty delalloc blocks. For me this happens ~6 times as much as before, but I still don't manage to hit an endless loop. Now after spending this much time I've started wondering why we even reserve blocks in xfs_iomap_write_allocate - after all we've reserved space for the actual data blocks and the indlen worst case in xfs_bmapi_reserve_delalloc. And in fact a little hack to drop that reservation seems to solve both the root cause (depleted reserved pool) and the cleanup mess. I just haven't spend enought time to convince myself that it's actually safe, and in fact looking at the allocator makes me thing it only works by accident currently despite generally postive test results. Here is the quick patch if anyone wants to chime in: diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index 620fc91..67c317f 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -717,7 +717,7 @@ xfs_iomap_write_allocate( nimaps = 0; while (nimaps == 0) { - nres = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK); + nres = 0; // XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK); error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, nres, 0, XFS_TRANS_RESERVE, &tp); _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: iomap infrastructure and multipage writes V5 2016-07-31 19:19 ` Christoph Hellwig @ 2016-08-01 0:16 ` Dave Chinner 2016-08-02 23:42 ` Dave Chinner 1 sibling, 0 replies; 14+ messages in thread From: Dave Chinner @ 2016-08-01 0:16 UTC (permalink / raw) To: Christoph Hellwig; +Cc: rpeterso, linux-fsdevel, xfs On Sun, Jul 31, 2016 at 09:19:00PM +0200, Christoph Hellwig wrote: > Another quiet weekend trying to debug this, and only minor progress. > > The biggest different in traces of the old vs new code is that we manage > to allocate much bigger delalloc reservations at a time in xfs_bmapi_delay > -> xfs_bmapi_reserve_delalloc. The old code always went for a single FSB, > which also meant allocating an indlen of 7 FSBs. With the iomap code > we always allocate at least 4FSB (aka a page), and sometimes 8 or 12. > All of these still need 7 FSBs for the worst case indirect blocks. So > what happens here is that in an ENOSPC case we manage to allocate more > actual delalloc blocks before hitting ENOSPC - notwithstanding that the > old case would immediately release them a little later in > xfs_bmap_add_extent_hole_delay after merging the delalloc extents. > > On the writeback side I don't see to many changes either. We'll > eventually run out of blocks when allocating the transaction in > xfs_iomap_write_allocate because the reserved pool is too small. Yup, that's exactly what generic/224 is testing - it sets the reserve pool to 4 blocks so it does get exhausted very quickly and then it exposes the underlying ENOSPC issue. Most users won't ever see reserve pool exhaustion, which is why I didn't worry too much about solving this before merging. > The > only real difference to before is that under the ENOSPC / out of memory > case we have allocated between 4 to 12 times more blocks, so we have > to clean up 4 to 12 times as much while write_cache_pages continues > iterating over these dirty delalloc blocks. For me this happens > ~6 times as much as before, but I still don't manage to hit an > endless loop. Ok, I'd kind of got that far myself, but then never really got much further than that - I suspected some kind of "split a delalloc extent too many times, run out of reservation" type of issue, but couldn't isolate such a problem in any of the traces. > Now after spending this much time I've started wondering why we even > reserve blocks in xfs_iomap_write_allocate - after all we've reserved > space for the actual data blocks and the indlen worst case in > xfs_bmapi_reserve_delalloc. And in fact a little hack to drop that > reservation seems to solve both the root cause (depleted reserved pool) > and the cleanup mess. I just haven't spend enought time to convince > myself that it's actually safe, and in fact looking at the allocator > makes me thing it only works by accident currently despite generally > postive test results. Hmmm, interesting. I didn't think about that. I have been looking at this exact code as a result of rmap ENOSPC problems, and now that you mention this, I can't see why we'd need a block reservation here for delalloc conversion, either. Time for more <reverb on> Adventures in Code Archeology! <reverb off> /me digs First stop - just after we removed the behaviour layer. Only caller of xfs_iomap_write_allocate was: writepage xfs_map_block(BMAPI_ALLOCATE) // only for delalloc xfs_iomap xfs_iomap_write_allocate Which is essentially the same single caller we have now, just with much less indirection. Looking at the code before the behaviour layer removal, there was also an "xfs_iocore" abstraction, which abstracted inode locking, block mapping and allocation and a few other miscellaneous IO functions. This was so CXFS server could plug into the XFS IO path and intercept allocation requests on the CXFS client side. This leads me to think that the CXFS server could call xfs_iomap_write_allocate() directly. Whether or not the server kept the delalloc reservation or not, I'm not sure. So, let's go back to before this abstraction was in place. Takes us back to before the linux port was started, back to pure Irix code.... .... and there's no block reservation done for delalloc conversion. Ok, so here's the commit that introduced the block reservation for delalloc conversion: commit 6706e96e324a2fa9636e93facfd4b7fbbf5b85f8 Author: Glen Overby <overby@sgi.com> Date: Tue Mar 4 20:15:43 2003 +0000 Add error reporting calls in error paths that return EFSCORRUPTED Yup, hidden deep inside the commit that added the XFS_CORRUPTION_ERROR and XFS_ERROR_REPORT macros for better error reporting was this unexplained, uncommented hunk: @@ -562,9 +563,19 @@ xfs_iomap_write_allocate( nimaps = 0; while (nimaps == 0) { tp = xfs_trans_alloc(mp, XFS_TRANS_STRAT_WRITE); - error = xfs_trans_reserve(tp, 0, XFS_WRITE_LOG_RES(mp), + nres = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK); + error = xfs_trans_reserve(tp, nres, + XFS_WRITE_LOG_RES(mp), 0, XFS_TRANS_PERM_LOG_RES, XFS_WRITE_LOG_COUNT); + + if (error == ENOSPC) { + error = xfs_trans_reserve(tp, 0, + XFS_WRITE_LOG_RES(mp), + 0, + XFS_TRANS_PERM_LOG_RES, + XFS_WRITE_LOG_COUNT); + } if (error) { xfs_trans_cancel(tp, 0); return XFS_ERROR(error); It's completely out of place compared to the rest of the patch which didn't change any code logic or algorithms - it only added error reporting macros. Hence THIS looks like it may have been an accidental/unintended change in the commit. The ENOSPC check here went away in 2007 when I expanded the reserve block pool and added XFS_TRANS_RESERVE to this function to allow it dip into the reserve pool (commit bdebc6a4 "Prevent ENOSPC from aborting transactions that need to succeed"). I didn't pick up on the history back then, I bet I was just focussed on fixing the ENOSPC issue.... So, essentially, by emptying the reserve block pool, we've opened this code back up to whatever underlying ENOSPC issue it had prior to bdebc6a4. And looking back at 6706e96e, I can only guess that the block reservation was added for a CXFS use case, because XFS still only called this from a single place - delalloc conversion. Christoph - it does look like you've found the problem - I agree with your analysis that the delalloc already reserves space for the bmbt blocks in the indlen reservation, and that adding another reservation for bmbt blocks at transaction allocation makes no obvious sense. The code history doesn't explain it - it only raises more questions as to why this was done - it may even have been an accidental change in the first place... > Here is the quick patch if anyone wants to chime in: > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > index 620fc91..67c317f 100644 > --- a/fs/xfs/xfs_iomap.c > +++ b/fs/xfs/xfs_iomap.c > @@ -717,7 +717,7 @@ xfs_iomap_write_allocate( > > nimaps = 0; > while (nimaps == 0) { > - nres = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK); > + nres = 0; // XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK); > > error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, nres, > 0, XFS_TRANS_RESERVE, &tp); Let me go test it, see what comes up. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: iomap infrastructure and multipage writes V5 2016-07-31 19:19 ` Christoph Hellwig 2016-08-01 0:16 ` Dave Chinner @ 2016-08-02 23:42 ` Dave Chinner 1 sibling, 0 replies; 14+ messages in thread From: Dave Chinner @ 2016-08-02 23:42 UTC (permalink / raw) To: Christoph Hellwig; +Cc: rpeterso, linux-fsdevel, xfs On Sun, Jul 31, 2016 at 09:19:00PM +0200, Christoph Hellwig wrote: > Now after spending this much time I've started wondering why we even > reserve blocks in xfs_iomap_write_allocate - after all we've reserved > space for the actual data blocks and the indlen worst case in > xfs_bmapi_reserve_delalloc. And in fact a little hack to drop that > reservation seems to solve both the root cause (depleted reserved pool) > and the cleanup mess. I just haven't spend enought time to convince > myself that it's actually safe, and in fact looking at the allocator > makes me thing it only works by accident currently despite generally > postive test results. > > Here is the quick patch if anyone wants to chime in: > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > index 620fc91..67c317f 100644 > --- a/fs/xfs/xfs_iomap.c > +++ b/fs/xfs/xfs_iomap.c > @@ -717,7 +717,7 @@ xfs_iomap_write_allocate( > > nimaps = 0; > while (nimaps == 0) { > - nres = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK); > + nres = 0; // XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK); > > error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, nres, > 0, XFS_TRANS_RESERVE, &tp); > This solves the problem for me, and from history appears to be the right thing to do. Christoph, can you send a proper patch for this? Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: iomap infrastructure and multipage writes V5 2016-07-18 11:14 ` Dave Chinner 2016-07-18 11:18 ` Dave Chinner @ 2016-07-19 3:50 ` Christoph Hellwig 1 sibling, 0 replies; 14+ messages in thread From: Christoph Hellwig @ 2016-07-19 3:50 UTC (permalink / raw) To: Dave Chinner; +Cc: rpeterso, linux-fsdevel, Christoph Hellwig, xfs On Mon, Jul 18, 2016 at 09:14:00PM +1000, Dave Chinner wrote: > Any progress here, Christoph? Nothing after the last posting yet. Shortly after that I left to Japan for about two weeks of hiking and a conference, which doesn't help my ability to spend some quiet hours with the code. It's on top of my priority list for non-trivial things and I should be able to get back it tomorrow after finishing the next conference today.. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2017-02-13 22:32 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-02-13 22:32 iomap infrastructure and multipage writes V5 xfs-owner -- strict thread matches above, loose matches on Subject: below -- 2016-06-01 14:44 Christoph Hellwig 2016-06-01 14:46 ` Christoph Hellwig 2016-06-28 0:26 ` Dave Chinner 2016-06-28 13:28 ` Christoph Hellwig 2016-06-28 13:38 ` Christoph Hellwig 2016-06-30 17:22 ` Christoph Hellwig 2016-06-30 23:16 ` Dave Chinner 2016-07-18 11:14 ` Dave Chinner 2016-07-18 11:18 ` Dave Chinner 2016-07-31 19:19 ` Christoph Hellwig 2016-08-01 0:16 ` Dave Chinner 2016-08-02 23:42 ` Dave Chinner 2016-07-19 3:50 ` Christoph Hellwig
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).