From: Anand Jain <anand.jain@oracle.com>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>, Qu Wenruo <wqu@suse.com>,
linux-btrfs@vger.kernel.org, fstests@vger.kernel.org
Cc: axboe@kernel.dk
Subject: Re: [PATCH v2] fstests: fsstress: wait interrupted aio to finish
Date: Wed, 23 Aug 2023 10:37:30 +0800 [thread overview]
Message-ID: <617bdb2b-7220-80f3-d31d-580a8901763c@oracle.com> (raw)
In-Reply-To: <e1741eb5-db9e-4da6-9d0d-dbc09cb2b66d@gmx.com>
On 22/8/23 13:16, Qu Wenruo wrote:
>
>
> On 2023/8/22 12:05, Anand Jain wrote:
>> On 22/08/2023 07:01, Qu Wenruo wrote:
>>> [BUG]
>>> There is a very low chance to hit data csum mismatch (caught by scrub)
>>> during test case btrfs/06[234567].
>>>
>>> After some extra digging, it turns out that plain fsstress itself is
>>> enough to cause the problem:
>>>
>>> ```
>>> workload()
>>> {
>>> mkfs.btrfs -f -m single -d single --csum sha256 $dev1 > /dev/null
>>> mount $dev1 $mnt
>>>
>>> #$fsstress -p 10 -n 1000 -w -d $mnt
>>> umount $mnt
>>> btrfs check --check-data-csum $dev1 || fail
>>> }
>>>
>>> runtime=1024
>>> for (( i = 0; i < $runtime; i++ )); do
>>> echo "=== $i / $runtime ==="
>>> workload
>>> done
>>> ```
>>>
>>> Inside a VM which has only 6 cores, above script can trigger with 1/20
>>> possibility.
>>>
>>> [CAUSE]
>>> Locally I got a much smaller workload to reproduce:
>>>
>>> $fsstress -p 7 -n 50 -s 1691396493 -w -d $mnt -v > /tmp/fsstress
>>>
>>> With extra kernel trace_prinkt() on the buffered/direct writes.
>>>
>>> It turns out that the following direct write is always the cause:
>>>
>>> btrfs_do_write_iter: r/i=5/283 buffered fileoff=708608(709121)
>>> len=12288(7712)
>>>
>>> btrfs_do_write_iter: r/i=5/283 direct fileoff=8192(8192)
>>> len=73728(73728) <<<<<
>>>
>>> btrfs_do_write_iter: r/i=5/283 direct fileoff=589824(589824)
>>> len=16384(16384)
>>>
>>> With the involved byte number, it's easy to pin down the fsstress
>>> opeartion:
>>>
>>> 0/31: writev d0/f3[285 2 0 0 296 1457078] [709121,8,964] 0
>>> 0/32: chown d0/f2 308134/1763236 0
>>>
>>> 0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[285 2 308134
>>> 1763236 320 1457078] return 25, fallback to stat()
>>> 0/33: awrite - io_getevents failed -4 <<<<
>>>
>>> 0/34: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[285 2 308134 1763236
>>> 320 1457078] return 25, fallback to stat()
>>>
>>> Note the 0/33, when the data csum mismatch triggered, it always fail
>>> with -4 (-EINTR).
>>>
>>> It looks like with lucky enough concurrency, we can get to the following
>>> situation inside fsstress:
>>>
>>> Process A | Process B
>>> -----------------------------------+---------------------------------------
>>> do_aio_rw() |
>>> |- io_sumit(); |
>>> |- io_get_events(); |
>>> | Returned -EINTR, but IO hasn't |
>>> | finished. |
>>> `- free(buf); | malloc();
>>> | Got the same memory of @buf from
>>> | thread A.
>>> | Modify the memory
>>> | Now the buffer is changed while
>>> | still under IO
>>>
>>> This is the typical buffer modification during direct IO, which is going
>>> to cause csum mismatch for btrfs, and btrfs properly detects it.
>>>
>>> This is the direct cause of the problem.
>>>
>>> The root cause is that, io_uring would use signals to handle
>>> submission/completion of IOs.
>>> Thus io_uring operations would interrupt AIO operations, thus causing
>>> the above problem.
>>>
>>> [FIX]
>>> To fix the problem, we can just retry io_getevents() so that we can
>>> properly wait for the IO.
>>>
>>> This prevents us to modify the IO buffer before writeback really
>>> finishes.
>>>
>>> With this fixes, I can no longer reproduce the data corruption.
>>>
>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>> ---
>>> Changelog:
>>> v2:
>>> - Fix all call sites of io_getevents()
>>
>> Should io_getevents() in aio-stress.c and fsx.c also be using
>> io_get_single_event()?
>
> Nope, this problem is caused by the fact that io uring is using signal
> to notify the completion, which would interrupt io_getevents().
>
> For aio-stress.c, there is no io uring utilized at all, thus the signals
> are real signals provided by users.
> Although it's still possible that user provided signals interrupt the
> operation and cause the corruption, it's not really a bit concern AFAIK.
>
> For fsx, io uring and aio are exclusive to each other, thus it's the
> same as aio-stress.c.
>
Okay, thanks.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
next prev parent reply other threads:[~2023-08-23 2:38 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-08-21 23:01 [PATCH v2] fstests: fsstress: wait interrupted aio to finish Qu Wenruo
2023-08-22 4:05 ` Anand Jain
2023-08-22 5:16 ` Qu Wenruo
2023-08-23 2:37 ` Anand Jain [this message]
2023-08-22 17:14 ` Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=617bdb2b-7220-80f3-d31d-580a8901763c@oracle.com \
--to=anand.jain@oracle.com \
--cc=axboe@kernel.dk \
--cc=fstests@vger.kernel.org \
--cc=linux-btrfs@vger.kernel.org \
--cc=quwenruo.btrfs@gmx.com \
--cc=wqu@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox