Linux Btrfs filesystem development
 help / color / mirror / Atom feed
From: Anand Jain <anand.jain@oracle.com>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>, Qu Wenruo <wqu@suse.com>,
	linux-btrfs@vger.kernel.org, fstests@vger.kernel.org
Cc: axboe@kernel.dk
Subject: Re: [PATCH v2] fstests: fsstress: wait interrupted aio to finish
Date: Wed, 23 Aug 2023 10:37:30 +0800	[thread overview]
Message-ID: <617bdb2b-7220-80f3-d31d-580a8901763c@oracle.com> (raw)
In-Reply-To: <e1741eb5-db9e-4da6-9d0d-dbc09cb2b66d@gmx.com>

On 22/8/23 13:16, Qu Wenruo wrote:
> 
> 
> On 2023/8/22 12:05, Anand Jain wrote:
>> On 22/08/2023 07:01, Qu Wenruo wrote:
>>> [BUG]
>>> There is a very low chance to hit data csum mismatch (caught by scrub)
>>> during test case btrfs/06[234567].
>>>
>>> After some extra digging, it turns out that plain fsstress itself is
>>> enough to cause the problem:
>>>
>>> ```
>>> workload()
>>> {
>>>     mkfs.btrfs -f -m single -d single --csum sha256 $dev1 > /dev/null
>>>     mount $dev1 $mnt
>>>
>>>     #$fsstress -p 10 -n 1000 -w -d $mnt
>>>     umount $mnt
>>>     btrfs check --check-data-csum $dev1 || fail
>>> }
>>>
>>> runtime=1024
>>> for (( i = 0; i < $runtime; i++ )); do
>>>     echo "=== $i / $runtime ==="
>>>     workload
>>> done
>>> ```
>>>
>>> Inside a VM which has only 6 cores, above script can trigger with 1/20
>>> possibility.
>>>
>>> [CAUSE]
>>> Locally I got a much smaller workload to reproduce:
>>>
>>>     $fsstress -p 7 -n 50 -s 1691396493 -w -d $mnt -v > /tmp/fsstress
>>>
>>> With extra kernel trace_prinkt() on the buffered/direct writes.
>>>
>>> It turns out that the following direct write is always the cause:
>>>
>>>    btrfs_do_write_iter: r/i=5/283 buffered fileoff=708608(709121) 
>>> len=12288(7712)
>>>
>>>    btrfs_do_write_iter: r/i=5/283 direct fileoff=8192(8192) 
>>> len=73728(73728) <<<<<
>>>
>>>    btrfs_do_write_iter: r/i=5/283 direct fileoff=589824(589824) 
>>> len=16384(16384)
>>>
>>> With the involved byte number, it's easy to pin down the fsstress
>>> opeartion:
>>>
>>>   0/31: writev d0/f3[285 2 0 0 296 1457078] [709121,8,964] 0
>>>   0/32: chown d0/f2 308134/1763236 0
>>>
>>>   0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[285 2 308134 
>>> 1763236 320 1457078] return 25, fallback to stat()
>>>   0/33: awrite - io_getevents failed -4 <<<<
>>>
>>>   0/34: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[285 2 308134 1763236 
>>> 320 1457078] return 25, fallback to stat()
>>>
>>> Note the 0/33, when the data csum mismatch triggered, it always fail
>>> with -4 (-EINTR).
>>>
>>> It looks like with lucky enough concurrency, we can get to the following
>>> situation inside fsstress:
>>>
>>>            Process A                 |               Process B
>>> -----------------------------------+---------------------------------------
>>>   do_aio_rw()                        |
>>>   |- io_sumit();                     |
>>>   |- io_get_events();                |
>>>   |  Returned -EINTR, but IO hasn't  |
>>>   |  finished.                       |
>>>   `- free(buf);                      | malloc();
>>>                                      |  Got the same memory of @buf from
>>>                                      |  thread A.
>>>                                      | Modify the memory
>>>                                      | Now the buffer is changed while
>>>                                      | still under IO
>>>
>>> This is the typical buffer modification during direct IO, which is going
>>> to cause csum mismatch for btrfs, and btrfs properly detects it.
>>>
>>> This is the direct cause of the problem.
>>>
>>> The root cause is that, io_uring would use signals to handle
>>> submission/completion of IOs.
>>> Thus io_uring operations would interrupt AIO operations, thus causing
>>> the above problem.
>>>
>>> [FIX]
>>> To fix the problem, we can just retry io_getevents() so that we can
>>> properly wait for the IO.
>>>
>>> This prevents us to modify the IO buffer before writeback really
>>> finishes.
>>>
>>> With this fixes, I can no longer reproduce the data corruption.
>>>
>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>> ---
>>> Changelog:
>>> v2:
>>> - Fix all call sites of io_getevents()
>>
>> Should io_getevents() in aio-stress.c and fsx.c also be using 
>> io_get_single_event()?
> 
> Nope, this problem is caused by the fact that io uring is using signal 
> to notify the completion, which would interrupt io_getevents().
> 
> For aio-stress.c, there is no io uring utilized at all, thus the signals 
> are real signals provided by users.
> Although it's still possible that user provided signals interrupt the 
> operation and cause the corruption, it's not really a bit concern AFAIK.
> 
> For fsx, io uring and aio are exclusive to each other, thus it's the 
> same as aio-stress.c.
> 

Okay, thanks.

Reviewed-by: Anand Jain <anand.jain@oracle.com>



  reply	other threads:[~2023-08-23  2:38 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-21 23:01 [PATCH v2] fstests: fsstress: wait interrupted aio to finish Qu Wenruo
2023-08-22  4:05 ` Anand Jain
2023-08-22  5:16   ` Qu Wenruo
2023-08-23  2:37     ` Anand Jain [this message]
2023-08-22 17:14 ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=617bdb2b-7220-80f3-d31d-580a8901763c@oracle.com \
    --to=anand.jain@oracle.com \
    --cc=axboe@kernel.dk \
    --cc=fstests@vger.kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=quwenruo.btrfs@gmx.com \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox