From: Filipe Manana <fdmanana@kernel.org>
To: Naohiro Aota <Naohiro.Aota@wdc.com>
Cc: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>,
"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>,
David Sterba <dsterba@suse.com>,
Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>,
Filipe Manana <fdmanana@suse.com>,
Josef Bacik <josef@toxicpanda.com>
Subject: Re: [PATCH] btrfs: ensure pages are unlocked on cow_file_range() failure
Date: Mon, 13 Jun 2022 16:33:18 +0100 [thread overview]
Message-ID: <20220613153318.GA3829367@falcondesktop> (raw)
In-Reply-To: <20220613122110.6pg6q274wy4er7ri@naota-xeon>
On Mon, Jun 13, 2022 at 12:21:11PM +0000, Naohiro Aota wrote:
> On Thu, May 19, 2022 at 02:38:50PM +0100, Filipe Manana wrote:
> > On Thu, May 19, 2022 at 12:24:00PM +0000, Johannes Thumshirn wrote:
> > > What's the status of this patch? It fixes actual errors
> > > (hung_tasks) for me.
> >
> > Well, there was previous review about it, and nothing was addressed in the
> > meanwhile.
> >
> > It may happen to fix the hang for some test case for fstests on zoned mode.
> > But there's a fundamental problem with the error handling as Josef pointed
> > before - this is an existing problem, and not a problem with this patch or
> > exclusive to zoned mode.
> >
> > The problem is if we fail on the second iteration of the while loop, when
> > reserving an extent for example, we leave the loop and with an ordered
> > extent created in the previous iteration, and return an error to the caller.
> > After that we end up never submitting a bio for that ordered extent's range,
> > which means we end up with an ordered extent that will never be completed.
> > So something like an fsync after that will hang forever for example, or
> > anything else calling btrfs_wait_ordered_range().
>
> I'm recently revisiting this patch and I think this is not true. The
> ordered extents instantiated in the previous loops (before an error by
> btrfs_reserve_extent) are then cleaned up by
> btrfs_cleanup_ordered_extents() calling in btrfs_run_delalloc_range(), no?
> So, btrfs_wait_ordered_range() never hang for the ordered extents.
You are right, we are doing the proper cleanup of ordered extents.
I missed that we had this function btrfs_cleanup_ordered_extents() that is
run for all delalloc cases. In older days we had that only for the error
path of direct IO writes, but it was later refactored and added for the
dealloc cases too. And I had the very old days in my mind, sorry.
>
> Well, there is a path not calling btrfs_cleanup_ordered_extents(), which is
> submit_uncompressed_range(). So, this path needs to call it.
Sounds right.
>
> Or should we do the clean-up within cow_file_range()? It is more
> understandable to clean the extents where it is created in the error
> case. But then, run_delalloc_nocow() should also clean the ordered extents
> created by itself (BTRFS_ORDERED_PREALLOC or BTRFS_ORDERED_NOCOW extents).
I think it's fine as it is.
In case it confuses someone, we could just leave a comment at cow_file_range(),
etc, telling we do cleanup of previously created ordered extents at
btrfs_run_delalloc_range().
>
> > So on error we need to go through previously created ordered extents, set
> > the IOERR flag on them, complete them to wake up any waiters and remove it,
> > which also takes care or adding the reserved extent back to the free space
> > cache/tree.
>
> I think this is exactly done by btrfs_cleanup_ordered_extents() +
> end_extent_writepage() in __extent_writepage(). No?
>
> So, in the end, we just need to unlock the pages except locked_page in the
> error case.
So I'm back to my initial review, where everything seems fine except for
the small mistake in the comment.
Thanks.
>
> >
> >
> > >
> > > On 13/12/2021 04:43, Naohiro Aota wrote:
> > > > There is a hung_task report regarding page lock on zoned btrfs like below.
> > > >
> > > > https://github.com/naota/linux/issues/59
> > > >
> > > > [ 726.328648] INFO: task rocksdb:high0:11085 blocked for more than 241 seconds.
> > > > [ 726.329839] Not tainted 5.16.0-rc1+ #1
> > > > [ 726.330484] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > [ 726.331603] task:rocksdb:high0 state:D stack: 0 pid:11085 ppid: 11082 flags:0x00000000
> > > > [ 726.331608] Call Trace:
> > > > [ 726.331611] <TASK>
> > > > [ 726.331614] __schedule+0x2e5/0x9d0
> > > > [ 726.331622] schedule+0x58/0xd0
> > > > [ 726.331626] io_schedule+0x3f/0x70
> > > > [ 726.331629] __folio_lock+0x125/0x200
> > > > [ 726.331634] ? find_get_entries+0x1bc/0x240
> > > > [ 726.331638] ? filemap_invalidate_unlock_two+0x40/0x40
> > > > [ 726.331642] truncate_inode_pages_range+0x5b2/0x770
> > > > [ 726.331649] truncate_inode_pages_final+0x44/0x50
> > > > [ 726.331653] btrfs_evict_inode+0x67/0x480
> > > > [ 726.331658] evict+0xd0/0x180
> > > > [ 726.331661] iput+0x13f/0x200
> > > > [ 726.331664] do_unlinkat+0x1c0/0x2b0
> > > > [ 726.331668] __x64_sys_unlink+0x23/0x30
> > > > [ 726.331670] do_syscall_64+0x3b/0xc0
> > > > [ 726.331674] entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > > [ 726.331677] RIP: 0033:0x7fb9490a171b
> > > > [ 726.331681] RSP: 002b:00007fb943ffac68 EFLAGS: 00000246 ORIG_RAX: 0000000000000057
> > > > [ 726.331684] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb9490a171b
> > > > [ 726.331686] RDX: 00007fb943ffb040 RSI: 000055a6bbe6ec20 RDI: 00007fb94400d300
> > > > [ 726.331687] RBP: 00007fb943ffad00 R08: 0000000000000000 R09: 0000000000000000
> > > > [ 726.331688] R10: 0000000000000031 R11: 0000000000000246 R12: 00007fb943ffb000
> > > > [ 726.331690] R13: 00007fb943ffb040 R14: 0000000000000000 R15: 00007fb943ffd260
> > > > [ 726.331693] </TASK>
> > > >
> > > > While we debug the issue, we found running fstests generic/551 on 5GB
> > > > non-zoned null_blk device in the emulated zoned mode also had a
> > > > similar hung issue.
> > > >
> > > > The hang occurs when cow_file_range() fails in the middle of
> > > > allocation. cow_file_range() called from do_allocation_zoned() can
> > > > split the give region ([start, end]) for allocation depending on
> > > > current block group usages. When btrfs can allocate bytes for one part
> > > > of the split regions but fails for the other region (e.g. because of
> > > > -ENOSPC), we return the error leaving the pages in the succeeded regions
> > > > locked. Technically, this occurs only when @unlock == 0. Otherwise, we
> > > > unlock the pages in an allocated region after creating an ordered
> > > > extent.
> > > >
> > > > Theoretically, the same issue can happen on
> > > > submit_uncompressed_range(). However, I could not make it happen even
> > > > if I modified the code to go always through
> > > > submit_uncompressed_range().
> > > >
> > > > Considering the callers of cow_file_range(unlock=0) won't write out
> > > > the pages, we can unlock the pages on error exit from
> > > > cow_file_range(). So, we can ensure all the pages except @locked_page
> > > > are unlocked on error case.
> > > >
> > > > In summary, cow_file_range now behaves like this:
> > > >
> > > > - page_started == 1 (return value)
> > > > - All the pages are unlocked. IO is started.
> > > > - unlock == 0
> > > > - All the pages except @locked_page are unlocked in any case
> > > > - unlock == 1
> > > > - On success, all the pages are locked for writing out them
> > > > - On failure, all the pages except @locked_page are unlocked
> > > >
next prev parent reply other threads:[~2022-06-13 18:40 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-12-13 3:43 [PATCH] btrfs: ensure pages are unlocked on cow_file_range() failure Naohiro Aota
2021-12-13 10:14 ` Filipe Manana
2021-12-14 12:21 ` Naohiro Aota
2021-12-13 16:08 ` Josef Bacik
2021-12-14 13:04 ` Naohiro Aota
2022-05-19 12:24 ` Johannes Thumshirn
2022-05-19 13:38 ` Filipe Manana
2022-05-19 13:51 ` Johannes Thumshirn
2022-05-19 16:18 ` David Sterba
2022-06-13 12:21 ` Naohiro Aota
2022-06-13 15:33 ` Filipe Manana [this message]
2022-06-14 3:00 ` Naohiro Aota
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220613153318.GA3829367@falcondesktop \
--to=fdmanana@kernel.org \
--cc=Johannes.Thumshirn@wdc.com \
--cc=Naohiro.Aota@wdc.com \
--cc=dsterba@suse.com \
--cc=fdmanana@suse.com \
--cc=josef@toxicpanda.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=shinichiro.kawasaki@wdc.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox