[PATCH v3 0/3] Fix data corruption within preallocation

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/3] Fix data corruption within preallocation
@ 2024-07-16 14:41 Andrey Drobyshev
  2024-07-16 14:41 ` [PATCH v3 1/3] block: zero data data corruption using prealloc-filter Andrey Drobyshev
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Andrey Drobyshev @ 2024-07-16 14:41 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, hreitz, kwolf, vsementsov, pbonzini, eesposit,
	andrey.drobyshev, den

v2 -> v3:

  * Patch 2: modify test case.  Increase number of requests from 1024 to
    2048; make odd requests write actual data, while even requests cause
    write_zeroes operation;

  * Add patch 3: add scripts/filev2p.py for mapping of virtual file
    offsets to physical block device offsets.  The script was used to
    initially track down the data corruption problem, so it's included
    here.

v2: https://lists.nongnu.org/archive/html/qemu-block/2024-07/msg00413.html

Andrey Drobyshev (2):
  iotests/298: add testcase for async writes with preallocation filter
  scripts: add filev2p.py script for mapping virtual file offsets
    mapping

Denis V. Lunev (1):
  block: zero data data corruption using prealloc-filter

 block/preallocate.c        |   8 +-
 scripts/filev2p.py         | 311 +++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/298     |  49 ++++++
 tests/qemu-iotests/298.out |   4 +-
 4 files changed, 369 insertions(+), 3 deletions(-)
 create mode 100755 scripts/filev2p.py

-- 
2.39.3



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v3 1/3] block: zero data data corruption using prealloc-filter
  2024-07-16 14:41 [PATCH v3 0/3] Fix data corruption within preallocation Andrey Drobyshev
@ 2024-07-16 14:41 ` Andrey Drobyshev
  2024-07-18 15:51   ` Kevin Wolf
  2024-07-16 14:41 ` [PATCH v3 2/3] iotests/298: add testcase for async writes with preallocation filter Andrey Drobyshev
  2024-07-16 14:41 ` [PATCH v3 3/3] scripts: add filev2p.py script for mapping virtual file offsets mapping Andrey Drobyshev
  2 siblings, 1 reply; 14+ messages in thread
From: Andrey Drobyshev @ 2024-07-16 14:41 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, hreitz, kwolf, vsementsov, pbonzini, eesposit,
	andrey.drobyshev, den

From: "Denis V. Lunev" <den@openvz.org>

We have observed that some clusters in the QCOW2 files are zeroed
while preallocation filter is used.

We are able to trace down the following sequence when prealloc-filter
is used:
    co=0x55e7cbed7680 qcow2_co_pwritev_task()
    co=0x55e7cbed7680 preallocate_co_pwritev_part()
    co=0x55e7cbed7680 handle_write()
    co=0x55e7cbed7680 bdrv_co_do_pwrite_zeroes()
    co=0x55e7cbed7680 raw_do_pwrite_zeroes()
    co=0x7f9edb7fe500 do_fallocate()

Here coroutine 0x55e7cbed7680 is being blocked waiting while coroutine
0x7f9edb7fe500 will finish with fallocate of the file area. OK. It is
time to handle next coroutine, which
    co=0x55e7cbee91b0 qcow2_co_pwritev_task()
    co=0x55e7cbee91b0 preallocate_co_pwritev_part()
    co=0x55e7cbee91b0 handle_write()
    co=0x55e7cbee91b0 bdrv_co_do_pwrite_zeroes()
    co=0x55e7cbee91b0 raw_do_pwrite_zeroes()
    co=0x7f9edb7deb00 do_fallocate()

The trouble comes here. Coroutine 0x55e7cbed7680 has not advanced
file_end yet and coroutine 0x55e7cbee91b0 will start fallocate() for
the same area. This means that if (once fallocate is started inside
0x7f9edb7deb00) original fallocate could end and the real write will
be executed. In that case write() request is handled at the same time
as fallocate().

The patch moves s->file_lock assignment before fallocate and that is
crucial. The idea is that all subsequent requests into the area
being preallocation will be issued as just writes without fallocate
to this area and they will not proceed thanks to overlapping
requests mechanics. If preallocation will fail, we will just switch
to the normal expand-by-write behavior and that is not a problem
except performance.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Tested-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
---
 block/preallocate.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/block/preallocate.c b/block/preallocate.c
index d215bc5d6d..ecf0aa4baa 100644
--- a/block/preallocate.c
+++ b/block/preallocate.c
@@ -383,6 +383,13 @@ handle_write(BlockDriverState *bs, int64_t offset, int64_t bytes,
 
     want_merge_zero = want_merge_zero && (prealloc_start <= offset);
 
+    /*
+     * Assign file_end before making actual preallocation. This will ensure
+     * that next request performed while preallocation is in progress will
+     * be passed without preallocation.
+     */
+    s->file_end = prealloc_end;
+
     ret = bdrv_co_pwrite_zeroes(
             bs->file, prealloc_start, prealloc_end - prealloc_start,
             BDRV_REQ_NO_FALLBACK | BDRV_REQ_SERIALISING | BDRV_REQ_NO_WAIT);
@@ -391,7 +398,6 @@ handle_write(BlockDriverState *bs, int64_t offset, int64_t bytes,
         return false;
     }
 
-    s->file_end = prealloc_end;
     return want_merge_zero;
 }
 
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 2/3] iotests/298: add testcase for async writes with preallocation filter
  2024-07-16 14:41 [PATCH v3 0/3] Fix data corruption within preallocation Andrey Drobyshev
  2024-07-16 14:41 ` [PATCH v3 1/3] block: zero data data corruption using prealloc-filter Andrey Drobyshev
@ 2024-07-16 14:41 ` Andrey Drobyshev
  2024-08-05 12:04   ` Kevin Wolf
  2024-07-16 14:41 ` [PATCH v3 3/3] scripts: add filev2p.py script for mapping virtual file offsets mapping Andrey Drobyshev
  2 siblings, 1 reply; 14+ messages in thread
From: Andrey Drobyshev @ 2024-07-16 14:41 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, hreitz, kwolf, vsementsov, pbonzini, eesposit,
	andrey.drobyshev, den

The testcase simply creates a 64G image with 1M clusters, generates a list
of 1M aligned offsets and feeds aio_write commands with those offsets to
qemu-io run with '--aio native --nocache'.  Then we check the data
written at each of the offsets.  Before the previous commit this could
result into a race within the preallocation filter which would zeroize
some clusters after actually writing data to them.

Note: the test doesn't fail in 100% cases as there's a race involved,
but the failures are pretty consistent so it should be good enough for
detecting the problem.

Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
---
 tests/qemu-iotests/298     | 49 ++++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/298.out |  4 ++--
 2 files changed, 51 insertions(+), 2 deletions(-)

diff --git a/tests/qemu-iotests/298 b/tests/qemu-iotests/298
index 09c9290711..b7126e9e15 100755
--- a/tests/qemu-iotests/298
+++ b/tests/qemu-iotests/298
@@ -20,8 +20,10 @@
 
 import os
 import iotests
+import random
 
 MiB = 1024 * 1024
+GiB = MiB * 1024
 disk = os.path.join(iotests.test_dir, 'disk')
 overlay = os.path.join(iotests.test_dir, 'overlay')
 refdisk = os.path.join(iotests.test_dir, 'refdisk')
@@ -176,5 +178,52 @@ class TestTruncate(iotests.QMPTestCase):
         self.do_test('off', '150M')
 
 
+class TestPreallocAsyncWrites(iotests.QMPTestCase):
+    def setUp(self):
+        # Make sure we get reproducible write patterns on each run
+        random.seed(42)
+        iotests.qemu_img_create('-f', iotests.imgfmt, disk, '-o',
+                                f'cluster_size={MiB},lazy_refcounts=on',
+                                str(64 * GiB))
+
+    def tearDown(self):
+        os.remove(disk)
+
+    def test_prealloc_async_writes(self):
+        def gen_write_pattern():
+            n = 0
+            while True:
+                yield '-P 0xaa' if n else '-z'
+                n = 1 - n
+
+        def gen_read_pattern():
+            n = 0
+            while True:
+                yield '-P 0xaa' if n else '-P 0x00'
+                n = 1 - n
+
+        requests = 2048 # Number of write/read requests to feed to qemu-io
+        total_clusters = 64 * 1024 # 64G / 1M
+
+        wpgen = gen_write_pattern()
+        rpgen = gen_read_pattern()
+
+        offsets = random.sample(range(0, total_clusters), requests)
+        aio_write_cmds = [f'aio_write {next(wpgen)} {off}M 1M' for off in offsets]
+        read_cmds = [f'read {next(rpgen)} {off}M 1M' for off in offsets]
+
+        proc = iotests.QemuIoInteractive('--aio', 'native', '--nocache',
+                                         '--image-opts', drive_opts)
+        for cmd in aio_write_cmds:
+            proc.cmd(cmd)
+        proc.close()
+
+        proc = iotests.QemuIoInteractive('-f', iotests.imgfmt, disk)
+        for cmd in read_cmds:
+            out = proc.cmd(cmd)
+            self.assertFalse('Pattern verification failed' in str(out))
+        proc.close()
+
+
 if __name__ == '__main__':
     iotests.main(supported_fmts=['qcow2'], required_fmts=['preallocate'])
diff --git a/tests/qemu-iotests/298.out b/tests/qemu-iotests/298.out
index fa16b5ccef..6323079e08 100644
--- a/tests/qemu-iotests/298.out
+++ b/tests/qemu-iotests/298.out
@@ -1,5 +1,5 @@
-.............
+..............
 ----------------------------------------------------------------------
-Ran 13 tests
+Ran 14 tests
 
 OK
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v3 3/3] scripts: add filev2p.py script for mapping virtual file offsets mapping
  2024-07-16 14:41 [PATCH v3 0/3] Fix data corruption within preallocation Andrey Drobyshev
  2024-07-16 14:41 ` [PATCH v3 1/3] block: zero data data corruption using prealloc-filter Andrey Drobyshev
  2024-07-16 14:41 ` [PATCH v3 2/3] iotests/298: add testcase for async writes with preallocation filter Andrey Drobyshev
@ 2024-07-16 14:41 ` Andrey Drobyshev
  2024-08-05 12:29   ` Kevin Wolf
  2 siblings, 1 reply; 14+ messages in thread
From: Andrey Drobyshev @ 2024-07-16 14:41 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, hreitz, kwolf, vsementsov, pbonzini, eesposit,
	andrey.drobyshev, den

The script is basically a wrapper around "filefrag" utility.  This might
be used to map virtual offsets within the file to the underlying block
device offsets.  In addition, a chunk size might be specified, in which
case a list of such mappings will be obtained:

$ scripts/filev2p.py -s 100M /sparsefile 1768M
1853882368..1895825407 (file)  ->  16332619776..16374562815 (/dev/sda4)  ->  84492156928..84534099967 (/dev/sda)
1895825408..1958739967 (file)  ->  17213591552..17276506111 (/dev/sda4)  ->  85373128704..85436043263 (/dev/sda)

This could come in handy when we need to map a certain piece of data
within a file inside VM to the same data within the image on the host
(e.g. physical offset on VM's /dev/sda would be the virtual offset
within QCOW2 image).

Note: as of now the script only works with the files located on plain
partitions, i.e. it doesn't work with partitions built on top of LVM.
Partitions on LVM would require another level of mapping.

Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
---
 scripts/filev2p.py | 311 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 311 insertions(+)
 create mode 100755 scripts/filev2p.py

diff --git a/scripts/filev2p.py b/scripts/filev2p.py
new file mode 100755
index 0000000000..3bd7d18b5e
--- /dev/null
+++ b/scripts/filev2p.py
@@ -0,0 +1,311 @@
+#!/usr/bin/env python3
+#
+# Map file virtual offset to the offset on the underlying block device.
+# Works by parsing 'filefrag' output.
+#
+# Copyright (c) 2024 Virtuozzo International GmbH.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+import argparse
+import os
+import subprocess
+import re
+import sys
+
+from bisect import bisect_right
+from collections import namedtuple
+from dataclasses import dataclass
+from shutil import which
+from stat import S_ISBLK
+
+
+Partition = namedtuple('Partition', ['partpath', 'diskpath', 'part_offt'])
+
+
+@dataclass
+class Extent:
+    '''Class representing an individual file extent.
+
+    This is basically a piece of data within the file which is located
+    consecutively (i.e. not sparsely) on the underlying block device.
+    '''
+
+    log_start:  int
+    log_end:    int
+    phys_start: int
+    phys_end:   int
+    length:     int
+    partition:  Partition
+
+    @property
+    def disk_start(self):
+        'Number of the first byte of this extent on the whole disk (/dev/sda)'
+        return self.partition.part_offt + self.phys_start
+
+    @property
+    def disk_end(self):
+        'Number of the last byte of this extent on the whole disk (/dev/sda)'
+        return self.partition.part_offt + self.phys_end
+
+    def __str__(self):
+        ischunk = self.log_end > self.log_start
+        maybe_end = lambda s: f'..{s}' if ischunk else ''
+        return '%s%s (file)  ->  %s%s (%s)  ->  %s%s (%s)' % (
+            self.log_start, maybe_end(self.log_end),
+            self.phys_start, maybe_end(self.phys_end), self.partition.partpath,
+            self.disk_start, maybe_end(self.disk_end), self.partition.diskpath
+        )
+
+    @classmethod
+    def ext_slice(cls, bigger_ext, start, end):
+        '''Constructor for the Extent class from a bigger extent.
+
+        Return Extent instance which is a slice of @bigger_ext contained
+        within the range [start, end].
+        '''
+
+        assert start >= bigger_ext.log_start
+        assert end <= bigger_ext.log_end
+
+        if start == bigger_ext.log_start and end == bigger_ext.log_end:
+            return bigger_ext
+
+        phys_start = bigger_ext.phys_start + (start - bigger_ext.log_start)
+        phys_end = bigger_ext.phys_end - (bigger_ext.log_end - end)
+        length = end - start + 1
+
+        return cls(start, end, phys_start, phys_end, length,
+                   bigger_ext.partition)
+
+
+def run_cmd(cmd: str) -> str:
+    '''Wrapper around subprocess.run.
+
+    Returns stdout in case of success, emits en error and exits in case
+    of failure.
+    '''
+
+    proc = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE,
+                          check=False, shell=True)
+    if proc.stderr is not None:
+        stderr = f'\n{proc.stderr.decode().strip()}'
+    else:
+        stderr = ''
+
+    if proc.returncode:
+        sys.exit(f'Error: Command "{cmd}" returned {proc.returncode}:{stderr}')
+
+    return proc.stdout.decode().strip()
+
+
+def parse_size(offset: str) -> int:
+    'Convert human readable size to bytes'
+
+    suffixes = {
+        **dict.fromkeys(['k', 'K', 'Kb', 'KB', 'KiB'], 2 ** 10),
+        **dict.fromkeys(['m', 'M', 'Mb', 'MB', 'MiB'], 2 ** 20),
+        **dict.fromkeys(['g', 'G', 'Gb', 'GB', 'GiB'], 2 ** 30),
+        **dict.fromkeys(     ['T', 'Tb', 'TB', 'TiB'], 2 ** 40),
+        **dict.fromkeys([''],                          1)
+    }
+
+    sizematch = re.match(r'^([0-9]+)\s*([a-zA-Z]*)$', offset)
+    if not bool(sizematch):
+        sys.exit(f'Error: Couldn\'t parse size "{offset}". Pass offset '
+                  'either in bytes or in format 1K, 2M, 3G')
+
+    num, suff = sizematch.groups()
+    num = int(num)
+
+    mult = suffixes.get(suff)
+    if mult is None:
+        sys.exit(f'Error: Couldn\'t parse size "{offset}": '
+                 f'unknown suffix {suff}')
+
+    return num * mult
+
+
+def fpath2part(filename: str) -> str:
+    'Get partition on which @filename is located (i.e. /dev/sda1).'
+
+    partpath = run_cmd(f'df --output=source {filename} | tail -n+2')
+    if not os.path.exists(partpath) or not S_ISBLK(os.stat(partpath).st_mode):
+        sys.exit(f'Error: file {filename} is located on {partpath} which '
+                 'isn\'t a block device')
+    return partpath
+
+
+def part2dev(partpath: str, filename: str) -> str:
+    'Get block device on which @partpath is located (i.e. /dev/sda).'
+    dev = run_cmd(f'lsblk -no PKNAME {partpath}')
+    diskpath = f'/dev/{dev}'
+    if not os.path.exists(diskpath) or not S_ISBLK(os.stat(diskpath).st_mode):
+        sys.exit(f'Error: file {filename} is located on {diskpath} which '
+                 'isn\'t a block device')
+    return diskpath
+
+
+def part2disktype(partpath: str) -> str:
+    'Parse /proc/devices and get block device type for @partpath'
+
+    major = os.major(os.stat(partpath).st_rdev)
+    assert major
+    with open('/proc/devices', encoding='utf-8') as devf:
+        for line in reversed(list(devf)):
+            # Our major cannot be absent among block devs
+            if line.startswith('Block'):
+                break
+            devmajor, devtype = line.strip().split()
+            if int(devmajor) == major:
+                return devtype
+
+    sys.exit('Error: We haven\'t found major {major} in /proc/devices, '
+             'and that can\'t be')
+
+
+def get_part_offset(part: str, disk: str) -> int:
+    'Get offset in bytes of the partition @part on the block device @disk.'
+
+    lines = run_cmd(f'fdisk -l {disk} | egrep "^(Units|{part})"').splitlines()
+
+    unitmatch = re.match('^.* = ([0-9]+) bytes$', lines[0])
+    if not bool(unitmatch):
+        sys.exit(f'Error: Couldn\'t parse "fdisk -l" output:\n{lines[0]}')
+    secsize = int(unitmatch.group(1))
+
+    part_offt = int(lines[1].split()[1])
+    return part_offt * secsize
+
+
+def parse_frag_line(line: str, partition: Partition) -> Extent:
+    'Construct Extent instance from a "filefrag" output line.'
+
+    nums = [int(n) for n in re.findall(r'[0-9]+', line)]
+
+    log_start  = nums[1]
+    log_end    = nums[2]
+    phys_start = nums[3]
+    phys_end   = nums[4]
+    length     = nums[5]
+
+    assert log_start < log_end
+    assert phys_start < phys_end
+    assert (log_end - log_start + 1) == (phys_end - phys_start + 1) == length
+
+    return Extent(log_start, log_end, phys_start, phys_end, length, partition)
+
+
+def preliminary_checks(args: argparse.Namespace) -> None:
+    'A bunch of checks to emit an error and exit at the earlier stage.'
+
+    if which('filefrag') is None:
+        sys.exit('Error: Program "filefrag" doesn\'t exist')
+
+    if not os.path.exists(args.filename):
+        sys.exit(f'Error: File {args.filename} doesn\'t exist')
+
+    args.filesize = os.path.getsize(args.filename)
+    if args.offset >= args.filesize:
+        sys.exit(f'Error: Specified offset {args.offset} exceeds '
+                 f'file size {args.filesize}')
+    if args.size and (args.offset + args.size > args.filesize):
+        sys.exit(f'Error: Chunk of size {args.size} at offset '
+                 f'{args.offset} exceeds file size {args.filesize}')
+
+    args.partpath = fpath2part(args.filename)
+    args.disktype = part2disktype(args.partpath)
+    if args.disktype not in ('sd', 'virtblk'):
+        sys.exit(f'Error: Cannot analyze files on {args.disktype} disks')
+    args.diskpath = part2dev(args.partpath, args.filename)
+    args.part_offt = get_part_offset(args.partpath, args.diskpath)
+
+
+def get_extent_maps(args: argparse.Namespace) -> list[Extent]:
+    'Run "filefrag", parse its output and return a list of Extent instances.'
+
+    lines = run_cmd(f'filefrag -b1 -v {args.filename}').splitlines()
+
+    ffinfo_re = re.compile('.* is ([0-9]+) .*of ([0-9]+) bytes')
+    ff_size, ff_block = re.match(ffinfo_re, lines[1]).groups()
+
+    # Paranoia checks
+    if int(ff_size) != args.filesize:
+        sys.exit('Error: filefrag and os.path.getsize() report different '
+                 f'sizes: {ff_size} and {args.filesize}')
+    if int(ff_block) != 1:
+        sys.exit(f'Error: "filefrag -b1" invoked, but block size is {ff_block}')
+
+    partition = Partition(args.partpath, args.diskpath, args.part_offt)
+
+    # Fill extents list from the output
+    extents = []
+    for line in lines:
+        if not re.match(r'^\s*[0-9]+:', line):
+            continue
+        extents += [parse_frag_line(line, partition)]
+
+    chunk_start = args.offset
+    chunk_end = args.offset + args.size - 1
+    ext_offsets = [ext.log_start for ext in extents]
+    start_ind = bisect_right(ext_offsets, chunk_start) - 1
+    end_ind = bisect_right(ext_offsets, chunk_end) - 1
+
+    res_extents = extents[start_ind : end_ind + 1]
+    for i, ext in enumerate(res_extents):
+        start = max(chunk_start, ext.log_start)
+        end = min(chunk_end, ext.log_end)
+        res_extents[i] = Extent.ext_slice(ext, start, end)
+
+    return res_extents
+
+
+def parse_args() -> argparse.Namespace:
+    'Define program arguments and parse user input.'
+
+    parser = argparse.ArgumentParser(description='''
+Map file offset to physical offset on the block device
+
+With --size provided get a list of mappings for the chunk''',
+    formatter_class=argparse.RawTextHelpFormatter)
+
+    parser.add_argument('filename', type=str, help='filename to process')
+    parser.add_argument('offset', type=str,
+                        help='logical offset inside the file')
+    parser.add_argument('-s', '--size', required=False, type=str,
+                        help='size of the file chunk to get offsets for')
+    args = parser.parse_args()
+
+    args.offset = parse_size(args.offset)
+    if args.size:
+        args.size = parse_size(args.size)
+    else:
+        # When no chunk size is provided (only offset), it's equivalent to
+        # chunk size == 1
+        args.size = 1
+
+    return args
+
+
+def main() -> int:
+    args = parse_args()
+    preliminary_checks(args)
+    extents = get_extent_maps(args)
+    for ext in extents:
+        print(ext)
+
+
+if __name__ == '__main__':
+    sys.exit(main())
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 1/3] block: zero data data corruption using prealloc-filter
  2024-07-16 14:41 ` [PATCH v3 1/3] block: zero data data corruption using prealloc-filter Andrey Drobyshev
@ 2024-07-18 15:51   ` Kevin Wolf
  2024-07-18 15:52     ` Denis V. Lunev
  2024-07-18 19:46     ` Denis V. Lunev
  0 siblings, 2 replies; 14+ messages in thread
From: Kevin Wolf @ 2024-07-18 15:51 UTC (permalink / raw)
  To: Andrey Drobyshev
  Cc: qemu-block, qemu-devel, hreitz, vsementsov, pbonzini, eesposit,
	den

Am 16.07.2024 um 16:41 hat Andrey Drobyshev geschrieben:
> From: "Denis V. Lunev" <den@openvz.org>
> 
> We have observed that some clusters in the QCOW2 files are zeroed
> while preallocation filter is used.
> 
> We are able to trace down the following sequence when prealloc-filter
> is used:
>     co=0x55e7cbed7680 qcow2_co_pwritev_task()
>     co=0x55e7cbed7680 preallocate_co_pwritev_part()
>     co=0x55e7cbed7680 handle_write()
>     co=0x55e7cbed7680 bdrv_co_do_pwrite_zeroes()
>     co=0x55e7cbed7680 raw_do_pwrite_zeroes()
>     co=0x7f9edb7fe500 do_fallocate()
> 
> Here coroutine 0x55e7cbed7680 is being blocked waiting while coroutine
> 0x7f9edb7fe500 will finish with fallocate of the file area. OK. It is
> time to handle next coroutine, which
>     co=0x55e7cbee91b0 qcow2_co_pwritev_task()
>     co=0x55e7cbee91b0 preallocate_co_pwritev_part()
>     co=0x55e7cbee91b0 handle_write()
>     co=0x55e7cbee91b0 bdrv_co_do_pwrite_zeroes()
>     co=0x55e7cbee91b0 raw_do_pwrite_zeroes()
>     co=0x7f9edb7deb00 do_fallocate()
> 
> The trouble comes here. Coroutine 0x55e7cbed7680 has not advanced
> file_end yet and coroutine 0x55e7cbee91b0 will start fallocate() for
> the same area. This means that if (once fallocate is started inside
> 0x7f9edb7deb00) original fallocate could end and the real write will
> be executed. In that case write() request is handled at the same time
> as fallocate().
> 
> The patch moves s->file_lock assignment before fallocate and that is

s/file_lock/file_end/?

> crucial. The idea is that all subsequent requests into the area
> being preallocation will be issued as just writes without fallocate
> to this area and they will not proceed thanks to overlapping
> requests mechanics. If preallocation will fail, we will just switch
> to the normal expand-by-write behavior and that is not a problem
> except performance.
> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> Tested-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
> ---
>  block/preallocate.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/block/preallocate.c b/block/preallocate.c
> index d215bc5d6d..ecf0aa4baa 100644
> --- a/block/preallocate.c
> +++ b/block/preallocate.c
> @@ -383,6 +383,13 @@ handle_write(BlockDriverState *bs, int64_t offset, int64_t bytes,
>  
>      want_merge_zero = want_merge_zero && (prealloc_start <= offset);
>  
> +    /*
> +     * Assign file_end before making actual preallocation. This will ensure
> +     * that next request performed while preallocation is in progress will
> +     * be passed without preallocation.
> +     */
> +    s->file_end = prealloc_end;
> +
>      ret = bdrv_co_pwrite_zeroes(
>              bs->file, prealloc_start, prealloc_end - prealloc_start,
>              BDRV_REQ_NO_FALLBACK | BDRV_REQ_SERIALISING | BDRV_REQ_NO_WAIT);
> @@ -391,7 +398,6 @@ handle_write(BlockDriverState *bs, int64_t offset, int64_t bytes,
>          return false;
>      }
>  
> -    s->file_end = prealloc_end;
>      return want_merge_zero;
>  }

Until bdrv_co_pwrite_zeroes() completes successfully, the file size is
unchanged. In other words, if anything calls preallocate_co_getlength()
in the meantime, don't we run into...

    ret = bdrv_co_getlength(bs->file->bs);

    if (has_prealloc_perms(bs)) {
        s->file_end = s->zero_start = s->data_end = ret;
    }

...and reset s->file_end back to the old value, re-enabling the bug
you're trying to fix here?

Or do we know that no such code path can be called concurrently for some
reason?

Kevin



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 1/3] block: zero data data corruption using prealloc-filter
  2024-07-18 15:51   ` Kevin Wolf
@ 2024-07-18 15:52     ` Denis V. Lunev
  2024-07-18 19:46     ` Denis V. Lunev
  1 sibling, 0 replies; 14+ messages in thread
From: Denis V. Lunev @ 2024-07-18 15:52 UTC (permalink / raw)
  To: Kevin Wolf, Andrey Drobyshev
  Cc: qemu-block, qemu-devel, hreitz, vsementsov, pbonzini, eesposit

On 7/18/24 17:51, Kevin Wolf wrote:
> Am 16.07.2024 um 16:41 hat Andrey Drobyshev geschrieben:
>> From: "Denis V. Lunev" <den@openvz.org>
>>
>> We have observed that some clusters in the QCOW2 files are zeroed
>> while preallocation filter is used.
>>
>> We are able to trace down the following sequence when prealloc-filter
>> is used:
>>      co=0x55e7cbed7680 qcow2_co_pwritev_task()
>>      co=0x55e7cbed7680 preallocate_co_pwritev_part()
>>      co=0x55e7cbed7680 handle_write()
>>      co=0x55e7cbed7680 bdrv_co_do_pwrite_zeroes()
>>      co=0x55e7cbed7680 raw_do_pwrite_zeroes()
>>      co=0x7f9edb7fe500 do_fallocate()
>>
>> Here coroutine 0x55e7cbed7680 is being blocked waiting while coroutine
>> 0x7f9edb7fe500 will finish with fallocate of the file area. OK. It is
>> time to handle next coroutine, which
>>      co=0x55e7cbee91b0 qcow2_co_pwritev_task()
>>      co=0x55e7cbee91b0 preallocate_co_pwritev_part()
>>      co=0x55e7cbee91b0 handle_write()
>>      co=0x55e7cbee91b0 bdrv_co_do_pwrite_zeroes()
>>      co=0x55e7cbee91b0 raw_do_pwrite_zeroes()
>>      co=0x7f9edb7deb00 do_fallocate()
>>
>> The trouble comes here. Coroutine 0x55e7cbed7680 has not advanced
>> file_end yet and coroutine 0x55e7cbee91b0 will start fallocate() for
>> the same area. This means that if (once fallocate is started inside
>> 0x7f9edb7deb00) original fallocate could end and the real write will
>> be executed. In that case write() request is handled at the same time
>> as fallocate().
>>
>> The patch moves s->file_lock assignment before fallocate and that is
> s/file_lock/file_end/?
>
>> crucial. The idea is that all subsequent requests into the area
>> being preallocation will be issued as just writes without fallocate
>> to this area and they will not proceed thanks to overlapping
>> requests mechanics. If preallocation will fail, we will just switch
>> to the normal expand-by-write behavior and that is not a problem
>> except performance.
>>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> Tested-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
>> ---
>>   block/preallocate.c | 8 +++++++-
>>   1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/preallocate.c b/block/preallocate.c
>> index d215bc5d6d..ecf0aa4baa 100644
>> --- a/block/preallocate.c
>> +++ b/block/preallocate.c
>> @@ -383,6 +383,13 @@ handle_write(BlockDriverState *bs, int64_t offset, int64_t bytes,
>>   
>>       want_merge_zero = want_merge_zero && (prealloc_start <= offset);
>>   
>> +    /*
>> +     * Assign file_end before making actual preallocation. This will ensure
>> +     * that next request performed while preallocation is in progress will
>> +     * be passed without preallocation.
>> +     */
>> +    s->file_end = prealloc_end;
>> +
>>       ret = bdrv_co_pwrite_zeroes(
>>               bs->file, prealloc_start, prealloc_end - prealloc_start,
>>               BDRV_REQ_NO_FALLBACK | BDRV_REQ_SERIALISING | BDRV_REQ_NO_WAIT);
>> @@ -391,7 +398,6 @@ handle_write(BlockDriverState *bs, int64_t offset, int64_t bytes,
>>           return false;
>>       }
>>   
>> -    s->file_end = prealloc_end;
>>       return want_merge_zero;
>>   }
> Until bdrv_co_pwrite_zeroes() completes successfully, the file size is
> unchanged. In other words, if anything calls preallocate_co_getlength()
> in the meantime, don't we run into...
>
>      ret = bdrv_co_getlength(bs->file->bs);
>
>      if (has_prealloc_perms(bs)) {
>          s->file_end = s->zero_start = s->data_end = ret;
>      }
>
> ...and reset s->file_end back to the old value, re-enabling the bug
> you're trying to fix here?
>
> Or do we know that no such code path can be called concurrently for some
> reason?
>
> Kevin
>
Truly amazing catch!

Den


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 1/3] block: zero data data corruption using prealloc-filter
  2024-07-18 15:51   ` Kevin Wolf
  2024-07-18 15:52     ` Denis V. Lunev
@ 2024-07-18 19:46     ` Denis V. Lunev
  2024-08-05 11:59       ` Kevin Wolf
  1 sibling, 1 reply; 14+ messages in thread
From: Denis V. Lunev @ 2024-07-18 19:46 UTC (permalink / raw)
  To: Kevin Wolf, Andrey Drobyshev
  Cc: qemu-block, qemu-devel, hreitz, vsementsov, pbonzini, eesposit

On 7/18/24 17:51, Kevin Wolf wrote:
> Am 16.07.2024 um 16:41 hat Andrey Drobyshev geschrieben:
>> From: "Denis V. Lunev" <den@openvz.org>
>>
>> We have observed that some clusters in the QCOW2 files are zeroed
>> while preallocation filter is used.
>>
>> We are able to trace down the following sequence when prealloc-filter
>> is used:
>>      co=0x55e7cbed7680 qcow2_co_pwritev_task()
>>      co=0x55e7cbed7680 preallocate_co_pwritev_part()
>>      co=0x55e7cbed7680 handle_write()
>>      co=0x55e7cbed7680 bdrv_co_do_pwrite_zeroes()
>>      co=0x55e7cbed7680 raw_do_pwrite_zeroes()
>>      co=0x7f9edb7fe500 do_fallocate()
>>
>> Here coroutine 0x55e7cbed7680 is being blocked waiting while coroutine
>> 0x7f9edb7fe500 will finish with fallocate of the file area. OK. It is
>> time to handle next coroutine, which
>>      co=0x55e7cbee91b0 qcow2_co_pwritev_task()
>>      co=0x55e7cbee91b0 preallocate_co_pwritev_part()
>>      co=0x55e7cbee91b0 handle_write()
>>      co=0x55e7cbee91b0 bdrv_co_do_pwrite_zeroes()
>>      co=0x55e7cbee91b0 raw_do_pwrite_zeroes()
>>      co=0x7f9edb7deb00 do_fallocate()
>>
>> The trouble comes here. Coroutine 0x55e7cbed7680 has not advanced
>> file_end yet and coroutine 0x55e7cbee91b0 will start fallocate() for
>> the same area. This means that if (once fallocate is started inside
>> 0x7f9edb7deb00) original fallocate could end and the real write will
>> be executed. In that case write() request is handled at the same time
>> as fallocate().
>>
>> The patch moves s->file_lock assignment before fallocate and that is
> s/file_lock/file_end/?
>
>> crucial. The idea is that all subsequent requests into the area
>> being preallocation will be issued as just writes without fallocate
>> to this area and they will not proceed thanks to overlapping
>> requests mechanics. If preallocation will fail, we will just switch
>> to the normal expand-by-write behavior and that is not a problem
>> except performance.
>>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> Tested-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
>> ---
>>   block/preallocate.c | 8 +++++++-
>>   1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/preallocate.c b/block/preallocate.c
>> index d215bc5d6d..ecf0aa4baa 100644
>> --- a/block/preallocate.c
>> +++ b/block/preallocate.c
>> @@ -383,6 +383,13 @@ handle_write(BlockDriverState *bs, int64_t offset, int64_t bytes,
>>   
>>       want_merge_zero = want_merge_zero && (prealloc_start <= offset);
>>   
>> +    /*
>> +     * Assign file_end before making actual preallocation. This will ensure
>> +     * that next request performed while preallocation is in progress will
>> +     * be passed without preallocation.
>> +     */
>> +    s->file_end = prealloc_end;
>> +
>>       ret = bdrv_co_pwrite_zeroes(
>>               bs->file, prealloc_start, prealloc_end - prealloc_start,
>>               BDRV_REQ_NO_FALLBACK | BDRV_REQ_SERIALISING | BDRV_REQ_NO_WAIT);
>> @@ -391,7 +398,6 @@ handle_write(BlockDriverState *bs, int64_t offset, int64_t bytes,
>>           return false;
>>       }
>>   
>> -    s->file_end = prealloc_end;
>>       return want_merge_zero;
>>   }
> Until bdrv_co_pwrite_zeroes() completes successfully, the file size is
> unchanged. In other words, if anything calls preallocate_co_getlength()
> in the meantime, don't we run into...
>
>      ret = bdrv_co_getlength(bs->file->bs);
>
>      if (has_prealloc_perms(bs)) {
>          s->file_end = s->zero_start = s->data_end = ret;
>      }
>
> ...and reset s->file_end back to the old value, re-enabling the bug
> you're trying to fix here?
>
> Or do we know that no such code path can be called concurrently for some
> reason?
>
> Kevin
>
After more detailed thinking I tend to disagree.
Normally we would not hit the problem. Though
this was not obvious at the beginning :)

The point in handle_write() where we move
file_end assignment is reachable only if the
following code has been executed

     if (s->data_end < 0) {
         s->data_end = bdrv_co_getlength(bs->file->bs);
         if (s->data_end < 0) {
             return false;
         }

         if (s->file_end < 0) {
             s->file_end = s->data_end;
         }
     }

     if (end <= s->data_end) {
         return false;
     }

which means that s->data_end > 0.

Thus

static int64_t coroutine_fn GRAPH_RDLOCK
preallocate_co_getlength(BlockDriverState *bs)
{
     int64_t ret;
     BDRVPreallocateState *s = bs->opaque;

     if (s->data_end >= 0) {
         return s->data_end; <--- we will return here
     }

     ret = bdrv_co_getlength(bs->file->bs);

     if (has_prealloc_perms(bs)) {
         s->file_end = s->zero_start = s->data_end = ret;
     }

     return ret;
}

Den


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 1/3] block: zero data data corruption using prealloc-filter
  2024-07-18 19:46     ` Denis V. Lunev
@ 2024-08-05 11:59       ` Kevin Wolf
  2024-08-05 12:13         ` Denis V. Lunev
  0 siblings, 1 reply; 14+ messages in thread
From: Kevin Wolf @ 2024-08-05 11:59 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: Andrey Drobyshev, qemu-block, qemu-devel, hreitz, vsementsov,
	pbonzini, eesposit

Am 18.07.2024 um 21:46 hat Denis V. Lunev geschrieben:
> On 7/18/24 17:51, Kevin Wolf wrote:
> > Am 16.07.2024 um 16:41 hat Andrey Drobyshev geschrieben:
> > > From: "Denis V. Lunev" <den@openvz.org>
> > > 
> > > We have observed that some clusters in the QCOW2 files are zeroed
> > > while preallocation filter is used.
> > > 
> > > We are able to trace down the following sequence when prealloc-filter
> > > is used:
> > >      co=0x55e7cbed7680 qcow2_co_pwritev_task()
> > >      co=0x55e7cbed7680 preallocate_co_pwritev_part()
> > >      co=0x55e7cbed7680 handle_write()
> > >      co=0x55e7cbed7680 bdrv_co_do_pwrite_zeroes()
> > >      co=0x55e7cbed7680 raw_do_pwrite_zeroes()
> > >      co=0x7f9edb7fe500 do_fallocate()
> > > 
> > > Here coroutine 0x55e7cbed7680 is being blocked waiting while coroutine
> > > 0x7f9edb7fe500 will finish with fallocate of the file area. OK. It is
> > > time to handle next coroutine, which
> > >      co=0x55e7cbee91b0 qcow2_co_pwritev_task()
> > >      co=0x55e7cbee91b0 preallocate_co_pwritev_part()
> > >      co=0x55e7cbee91b0 handle_write()
> > >      co=0x55e7cbee91b0 bdrv_co_do_pwrite_zeroes()
> > >      co=0x55e7cbee91b0 raw_do_pwrite_zeroes()
> > >      co=0x7f9edb7deb00 do_fallocate()
> > > 
> > > The trouble comes here. Coroutine 0x55e7cbed7680 has not advanced
> > > file_end yet and coroutine 0x55e7cbee91b0 will start fallocate() for
> > > the same area. This means that if (once fallocate is started inside
> > > 0x7f9edb7deb00) original fallocate could end and the real write will
> > > be executed. In that case write() request is handled at the same time
> > > as fallocate().
> > > 
> > > The patch moves s->file_lock assignment before fallocate and that is
> > s/file_lock/file_end/?
> > 
> > > crucial. The idea is that all subsequent requests into the area
> > > being preallocation will be issued as just writes without fallocate
> > > to this area and they will not proceed thanks to overlapping
> > > requests mechanics. If preallocation will fail, we will just switch
> > > to the normal expand-by-write behavior and that is not a problem
> > > except performance.
> > > 
> > > Signed-off-by: Denis V. Lunev <den@openvz.org>
> > > Tested-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
> > > ---
> > >   block/preallocate.c | 8 +++++++-
> > >   1 file changed, 7 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/block/preallocate.c b/block/preallocate.c
> > > index d215bc5d6d..ecf0aa4baa 100644
> > > --- a/block/preallocate.c
> > > +++ b/block/preallocate.c
> > > @@ -383,6 +383,13 @@ handle_write(BlockDriverState *bs, int64_t offset, int64_t bytes,
> > >       want_merge_zero = want_merge_zero && (prealloc_start <= offset);
> > > +    /*
> > > +     * Assign file_end before making actual preallocation. This will ensure
> > > +     * that next request performed while preallocation is in progress will
> > > +     * be passed without preallocation.
> > > +     */
> > > +    s->file_end = prealloc_end;
> > > +
> > >       ret = bdrv_co_pwrite_zeroes(
> > >               bs->file, prealloc_start, prealloc_end - prealloc_start,
> > >               BDRV_REQ_NO_FALLBACK | BDRV_REQ_SERIALISING | BDRV_REQ_NO_WAIT);
> > > @@ -391,7 +398,6 @@ handle_write(BlockDriverState *bs, int64_t offset, int64_t bytes,
> > >           return false;
> > >       }
> > > -    s->file_end = prealloc_end;
> > >       return want_merge_zero;
> > >   }
> > Until bdrv_co_pwrite_zeroes() completes successfully, the file size is
> > unchanged. In other words, if anything calls preallocate_co_getlength()
> > in the meantime, don't we run into...
> > 
> >      ret = bdrv_co_getlength(bs->file->bs);
> > 
> >      if (has_prealloc_perms(bs)) {
> >          s->file_end = s->zero_start = s->data_end = ret;
> >      }
> > 
> > ...and reset s->file_end back to the old value, re-enabling the bug
> > you're trying to fix here?
> > 
> > Or do we know that no such code path can be called concurrently for some
> > reason?
> > 
> > Kevin
> > 
> After more detailed thinking I tend to disagree.
> Normally we would not hit the problem. Though
> this was not obvious at the beginning :)
> 
> The point in handle_write() where we move
> file_end assignment is reachable only if the
> following code has been executed
> 
>     if (s->data_end < 0) {
>         s->data_end = bdrv_co_getlength(bs->file->bs);
>         if (s->data_end < 0) {
>             return false;
>         }
> 
>         if (s->file_end < 0) {
>             s->file_end = s->data_end;
>         }
>     }
> 
>     if (end <= s->data_end) {
>         return false;
>     }
> 
> which means that s->data_end > 0.
> 
> Thus
> 
> static int64_t coroutine_fn GRAPH_RDLOCK
> preallocate_co_getlength(BlockDriverState *bs)
> {
>     int64_t ret;
>     BDRVPreallocateState *s = bs->opaque;
> 
>     if (s->data_end >= 0) {
>         return s->data_end; <--- we will return here
>     }
> 
>     ret = bdrv_co_getlength(bs->file->bs);
> 
>     if (has_prealloc_perms(bs)) {
>         s->file_end = s->zero_start = s->data_end = ret;
>     }
> 
>     return ret;
> }

I think you're right there. And the other places that update s->file_end
should probably be okay because of the request serialisation.

I'm okay with applying this patch as it seems to fix a corruption, but
the way this whole block driver operates doesn't feel very robust to me.
There seem to be a lot of implicit assumptions that make the code hard
to understand even though it's quite short.

Kevin



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 2/3] iotests/298: add testcase for async writes with preallocation filter
  2024-07-16 14:41 ` [PATCH v3 2/3] iotests/298: add testcase for async writes with preallocation filter Andrey Drobyshev
@ 2024-08-05 12:04   ` Kevin Wolf
  2024-08-05 12:56     ` Andrey Drobyshev
  0 siblings, 1 reply; 14+ messages in thread
From: Kevin Wolf @ 2024-08-05 12:04 UTC (permalink / raw)
  To: Andrey Drobyshev
  Cc: qemu-block, qemu-devel, hreitz, vsementsov, pbonzini, eesposit,
	den

Am 16.07.2024 um 16:41 hat Andrey Drobyshev geschrieben:
> The testcase simply creates a 64G image with 1M clusters, generates a list
> of 1M aligned offsets and feeds aio_write commands with those offsets to
> qemu-io run with '--aio native --nocache'.  Then we check the data
> written at each of the offsets.  Before the previous commit this could
> result into a race within the preallocation filter which would zeroize
> some clusters after actually writing data to them.
> 
> Note: the test doesn't fail in 100% cases as there's a race involved,
> but the failures are pretty consistent so it should be good enough for
> detecting the problem.
> 
> Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>

I left it running in a loop for a while, but couldn't reproduce the bug
with this test.

>  tests/qemu-iotests/298     | 49 ++++++++++++++++++++++++++++++++++++++
>  tests/qemu-iotests/298.out |  4 ++--
>  2 files changed, 51 insertions(+), 2 deletions(-)
> 
> diff --git a/tests/qemu-iotests/298 b/tests/qemu-iotests/298
> index 09c9290711..b7126e9e15 100755
> --- a/tests/qemu-iotests/298
> +++ b/tests/qemu-iotests/298
> @@ -20,8 +20,10 @@
>  
>  import os
>  import iotests
> +import random
>  
>  MiB = 1024 * 1024
> +GiB = MiB * 1024
>  disk = os.path.join(iotests.test_dir, 'disk')
>  overlay = os.path.join(iotests.test_dir, 'overlay')
>  refdisk = os.path.join(iotests.test_dir, 'refdisk')
> @@ -176,5 +178,52 @@ class TestTruncate(iotests.QMPTestCase):
>          self.do_test('off', '150M')
>  
>  
> +class TestPreallocAsyncWrites(iotests.QMPTestCase):
> +    def setUp(self):
> +        # Make sure we get reproducible write patterns on each run
> +        random.seed(42)
> +        iotests.qemu_img_create('-f', iotests.imgfmt, disk, '-o',
> +                                f'cluster_size={MiB},lazy_refcounts=on',
> +                                str(64 * GiB))
> +
> +    def tearDown(self):
> +        os.remove(disk)
> +
> +    def test_prealloc_async_writes(self):
> +        def gen_write_pattern():
> +            n = 0
> +            while True:
> +                yield '-P 0xaa' if n else '-z'
> +                n = 1 - n

This looks like a complicated way to write the following?

    # Alternate between write_zeroes and writing data
    def gen_write_pattern():
        while True:
            yield '-z'
            yield '-P 0xaa'

> +        def gen_read_pattern():
> +            n = 0
> +            while True:
> +                yield '-P 0xaa' if n else '-P 0x00'
> +                n = 1 - n

Same here.

Kevin



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 1/3] block: zero data data corruption using prealloc-filter
  2024-08-05 11:59       ` Kevin Wolf
@ 2024-08-05 12:13         ` Denis V. Lunev
  0 siblings, 0 replies; 14+ messages in thread
From: Denis V. Lunev @ 2024-08-05 12:13 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Andrey Drobyshev, qemu-block, qemu-devel, hreitz, vsementsov,
	pbonzini, eesposit

On 8/5/24 13:59, Kevin Wolf wrote:
> Am 18.07.2024 um 21:46 hat Denis V. Lunev geschrieben:
>> On 7/18/24 17:51, Kevin Wolf wrote:
>>> Am 16.07.2024 um 16:41 hat Andrey Drobyshev geschrieben:
>>>> From: "Denis V. Lunev" <den@openvz.org>
>>>>
>>>> We have observed that some clusters in the QCOW2 files are zeroed
>>>> while preallocation filter is used.
>>>>
>>>> We are able to trace down the following sequence when prealloc-filter
>>>> is used:
>>>>       co=0x55e7cbed7680 qcow2_co_pwritev_task()
>>>>       co=0x55e7cbed7680 preallocate_co_pwritev_part()
>>>>       co=0x55e7cbed7680 handle_write()
>>>>       co=0x55e7cbed7680 bdrv_co_do_pwrite_zeroes()
>>>>       co=0x55e7cbed7680 raw_do_pwrite_zeroes()
>>>>       co=0x7f9edb7fe500 do_fallocate()
>>>>
>>>> Here coroutine 0x55e7cbed7680 is being blocked waiting while coroutine
>>>> 0x7f9edb7fe500 will finish with fallocate of the file area. OK. It is
>>>> time to handle next coroutine, which
>>>>       co=0x55e7cbee91b0 qcow2_co_pwritev_task()
>>>>       co=0x55e7cbee91b0 preallocate_co_pwritev_part()
>>>>       co=0x55e7cbee91b0 handle_write()
>>>>       co=0x55e7cbee91b0 bdrv_co_do_pwrite_zeroes()
>>>>       co=0x55e7cbee91b0 raw_do_pwrite_zeroes()
>>>>       co=0x7f9edb7deb00 do_fallocate()
>>>>
>>>> The trouble comes here. Coroutine 0x55e7cbed7680 has not advanced
>>>> file_end yet and coroutine 0x55e7cbee91b0 will start fallocate() for
>>>> the same area. This means that if (once fallocate is started inside
>>>> 0x7f9edb7deb00) original fallocate could end and the real write will
>>>> be executed. In that case write() request is handled at the same time
>>>> as fallocate().
>>>>
>>>> The patch moves s->file_lock assignment before fallocate and that is
>>> s/file_lock/file_end/?
>>>
>>>> crucial. The idea is that all subsequent requests into the area
>>>> being preallocation will be issued as just writes without fallocate
>>>> to this area and they will not proceed thanks to overlapping
>>>> requests mechanics. If preallocation will fail, we will just switch
>>>> to the normal expand-by-write behavior and that is not a problem
>>>> except performance.
>>>>
>>>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>>>> Tested-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
>>>> ---
>>>>    block/preallocate.c | 8 +++++++-
>>>>    1 file changed, 7 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/block/preallocate.c b/block/preallocate.c
>>>> index d215bc5d6d..ecf0aa4baa 100644
>>>> --- a/block/preallocate.c
>>>> +++ b/block/preallocate.c
>>>> @@ -383,6 +383,13 @@ handle_write(BlockDriverState *bs, int64_t offset, int64_t bytes,
>>>>        want_merge_zero = want_merge_zero && (prealloc_start <= offset);
>>>> +    /*
>>>> +     * Assign file_end before making actual preallocation. This will ensure
>>>> +     * that next request performed while preallocation is in progress will
>>>> +     * be passed without preallocation.
>>>> +     */
>>>> +    s->file_end = prealloc_end;
>>>> +
>>>>        ret = bdrv_co_pwrite_zeroes(
>>>>                bs->file, prealloc_start, prealloc_end - prealloc_start,
>>>>                BDRV_REQ_NO_FALLBACK | BDRV_REQ_SERIALISING | BDRV_REQ_NO_WAIT);
>>>> @@ -391,7 +398,6 @@ handle_write(BlockDriverState *bs, int64_t offset, int64_t bytes,
>>>>            return false;
>>>>        }
>>>> -    s->file_end = prealloc_end;
>>>>        return want_merge_zero;
>>>>    }
>>> Until bdrv_co_pwrite_zeroes() completes successfully, the file size is
>>> unchanged. In other words, if anything calls preallocate_co_getlength()
>>> in the meantime, don't we run into...
>>>
>>>       ret = bdrv_co_getlength(bs->file->bs);
>>>
>>>       if (has_prealloc_perms(bs)) {
>>>           s->file_end = s->zero_start = s->data_end = ret;
>>>       }
>>>
>>> ...and reset s->file_end back to the old value, re-enabling the bug
>>> you're trying to fix here?
>>>
>>> Or do we know that no such code path can be called concurrently for some
>>> reason?
>>>
>>> Kevin
>>>
>> After more detailed thinking I tend to disagree.
>> Normally we would not hit the problem. Though
>> this was not obvious at the beginning :)
>>
>> The point in handle_write() where we move
>> file_end assignment is reachable only if the
>> following code has been executed
>>
>>      if (s->data_end < 0) {
>>          s->data_end = bdrv_co_getlength(bs->file->bs);
>>          if (s->data_end < 0) {
>>              return false;
>>          }
>>
>>          if (s->file_end < 0) {
>>              s->file_end = s->data_end;
>>          }
>>      }
>>
>>      if (end <= s->data_end) {
>>          return false;
>>      }
>>
>> which means that s->data_end > 0.
>>
>> Thus
>>
>> static int64_t coroutine_fn GRAPH_RDLOCK
>> preallocate_co_getlength(BlockDriverState *bs)
>> {
>>      int64_t ret;
>>      BDRVPreallocateState *s = bs->opaque;
>>
>>      if (s->data_end >= 0) {
>>          return s->data_end; <--- we will return here
>>      }
>>
>>      ret = bdrv_co_getlength(bs->file->bs);
>>
>>      if (has_prealloc_perms(bs)) {
>>          s->file_end = s->zero_start = s->data_end = ret;
>>      }
>>
>>      return ret;
>> }
> I think you're right there. And the other places that update s->file_end
> should probably be okay because of the request serialisation.
>
> I'm okay with applying this patch as it seems to fix a corruption, but
> the way this whole block driver operates doesn't feel very robust to me.
> There seem to be a lot of implicit assumptions that make the code hard
> to understand even though it's quite short.
>
> Kevin
>
There are a lot of things to make a change. I'll definitely
have to address this later. Working on that.

For now the patch is working in downstream and it seems OK.

Thank you in advance,
     Den


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 3/3] scripts: add filev2p.py script for mapping virtual file offsets mapping
  2024-07-16 14:41 ` [PATCH v3 3/3] scripts: add filev2p.py script for mapping virtual file offsets mapping Andrey Drobyshev
@ 2024-08-05 12:29   ` Kevin Wolf
  2024-08-05 13:02     ` Andrey Drobyshev
  0 siblings, 1 reply; 14+ messages in thread
From: Kevin Wolf @ 2024-08-05 12:29 UTC (permalink / raw)
  To: Andrey Drobyshev
  Cc: qemu-block, qemu-devel, hreitz, vsementsov, pbonzini, eesposit,
	den

Am 16.07.2024 um 16:41 hat Andrey Drobyshev geschrieben:
> The script is basically a wrapper around "filefrag" utility.  This might
> be used to map virtual offsets within the file to the underlying block
> device offsets.  In addition, a chunk size might be specified, in which
> case a list of such mappings will be obtained:
> 
> $ scripts/filev2p.py -s 100M /sparsefile 1768M
> 1853882368..1895825407 (file)  ->  16332619776..16374562815 (/dev/sda4)  ->  84492156928..84534099967 (/dev/sda)
> 1895825408..1958739967 (file)  ->  17213591552..17276506111 (/dev/sda4)  ->  85373128704..85436043263 (/dev/sda)
> 
> This could come in handy when we need to map a certain piece of data
> within a file inside VM to the same data within the image on the host
> (e.g. physical offset on VM's /dev/sda would be the virtual offset
> within QCOW2 image).
> 
> Note: as of now the script only works with the files located on plain
> partitions, i.e. it doesn't work with partitions built on top of LVM.
> Partitions on LVM would require another level of mapping.
> 
> Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
> ---
>  scripts/filev2p.py | 311 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 311 insertions(+)
>  create mode 100755 scripts/filev2p.py
> 
> diff --git a/scripts/filev2p.py b/scripts/filev2p.py
> new file mode 100755
> index 0000000000..3bd7d18b5e
> --- /dev/null
> +++ b/scripts/filev2p.py
> @@ -0,0 +1,311 @@
> +#!/usr/bin/env python3
> +#
> +# Map file virtual offset to the offset on the underlying block device.
> +# Works by parsing 'filefrag' output.
> +#
> +# Copyright (c) 2024 Virtuozzo International GmbH.
> +#
> +# This program is free software; you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License as published by
> +# the Free Software Foundation; either version 2 of the License, or
> +# (at your option) any later version.
> +#
> +# This program is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program.  If not, see <http://www.gnu.org/licenses/>.
> +#
> +
> +import argparse
> +import os
> +import subprocess
> +import re
> +import sys
> +
> +from bisect import bisect_right
> +from collections import namedtuple
> +from dataclasses import dataclass
> +from shutil import which
> +from stat import S_ISBLK
> +
> +
> +Partition = namedtuple('Partition', ['partpath', 'diskpath', 'part_offt'])
> +
> +
> +@dataclass
> +class Extent:
> +    '''Class representing an individual file extent.
> +
> +    This is basically a piece of data within the file which is located
> +    consecutively (i.e. not sparsely) on the underlying block device.
> +    '''

Python docstrings should always be triple double quotes """...""" as per
PEP 257.

Some functions below even use a single single quote because they are on
a single line. They should still use the same convention.

> +
> +    log_start:  int
> +    log_end:    int
> +    phys_start: int
> +    phys_end:   int
> +    length:     int
> +    partition:  Partition
> +
> +    @property
> +    def disk_start(self):
> +        'Number of the first byte of this extent on the whole disk (/dev/sda)'
> +        return self.partition.part_offt + self.phys_start
> +
> +    @property
> +    def disk_end(self):
> +        'Number of the last byte of this extent on the whole disk (/dev/sda)'
> +        return self.partition.part_offt + self.phys_end
> +
> +    def __str__(self):
> +        ischunk = self.log_end > self.log_start
> +        maybe_end = lambda s: f'..{s}' if ischunk else ''
> +        return '%s%s (file)  ->  %s%s (%s)  ->  %s%s (%s)' % (
> +            self.log_start, maybe_end(self.log_end),
> +            self.phys_start, maybe_end(self.phys_end), self.partition.partpath,
> +            self.disk_start, maybe_end(self.disk_end), self.partition.diskpath
> +        )
> +
> +    @classmethod
> +    def ext_slice(cls, bigger_ext, start, end):
> +        '''Constructor for the Extent class from a bigger extent.
> +
> +        Return Extent instance which is a slice of @bigger_ext contained
> +        within the range [start, end].
> +        '''
> +
> +        assert start >= bigger_ext.log_start
> +        assert end <= bigger_ext.log_end
> +
> +        if start == bigger_ext.log_start and end == bigger_ext.log_end:
> +            return bigger_ext
> +
> +        phys_start = bigger_ext.phys_start + (start - bigger_ext.log_start)
> +        phys_end = bigger_ext.phys_end - (bigger_ext.log_end - end)
> +        length = end - start + 1
> +
> +        return cls(start, end, phys_start, phys_end, length,
> +                   bigger_ext.partition)
> +
> +
> +def run_cmd(cmd: str) -> str:
> +    '''Wrapper around subprocess.run.
> +
> +    Returns stdout in case of success, emits en error and exits in case
> +    of failure.
> +    '''
> +
> +    proc = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE,
> +                          check=False, shell=True)
> +    if proc.stderr is not None:
> +        stderr = f'\n{proc.stderr.decode().strip()}'
> +    else:
> +        stderr = ''
> +
> +    if proc.returncode:
> +        sys.exit(f'Error: Command "{cmd}" returned {proc.returncode}:{stderr}')
> +
> +    return proc.stdout.decode().strip()
> +
> +
> +def parse_size(offset: str) -> int:
> +    'Convert human readable size to bytes'
> +
> +    suffixes = {
> +        **dict.fromkeys(['k', 'K', 'Kb', 'KB', 'KiB'], 2 ** 10),
> +        **dict.fromkeys(['m', 'M', 'Mb', 'MB', 'MiB'], 2 ** 20),
> +        **dict.fromkeys(['g', 'G', 'Gb', 'GB', 'GiB'], 2 ** 30),
> +        **dict.fromkeys(     ['T', 'Tb', 'TB', 'TiB'], 2 ** 40),
> +        **dict.fromkeys([''],                          1)
> +    }
> +
> +    sizematch = re.match(r'^([0-9]+)\s*([a-zA-Z]*)$', offset)
> +    if not bool(sizematch):
> +        sys.exit(f'Error: Couldn\'t parse size "{offset}". Pass offset '
> +                  'either in bytes or in format 1K, 2M, 3G')
> +
> +    num, suff = sizematch.groups()
> +    num = int(num)
> +
> +    mult = suffixes.get(suff)
> +    if mult is None:
> +        sys.exit(f'Error: Couldn\'t parse size "{offset}": '
> +                 f'unknown suffix {suff}')
> +
> +    return num * mult
> +
> +
> +def fpath2part(filename: str) -> str:
> +    'Get partition on which @filename is located (i.e. /dev/sda1).'
> +
> +    partpath = run_cmd(f'df --output=source {filename} | tail -n+2')

Anything passed to a shell (like {filename}) certainly must have proper
quoting applied to avoid shell injections?

> +    if not os.path.exists(partpath) or not S_ISBLK(os.stat(partpath).st_mode):
> +        sys.exit(f'Error: file {filename} is located on {partpath} which '
> +                 'isn\'t a block device')
> +    return partpath
> +
> +
> +def part2dev(partpath: str, filename: str) -> str:
> +    'Get block device on which @partpath is located (i.e. /dev/sda).'
> +    dev = run_cmd(f'lsblk -no PKNAME {partpath}')

Missing quoting here, too.

> +    diskpath = f'/dev/{dev}'
> +    if not os.path.exists(diskpath) or not S_ISBLK(os.stat(diskpath).st_mode):
> +        sys.exit(f'Error: file {filename} is located on {diskpath} which '
> +                 'isn\'t a block device')
> +    return diskpath
> +
> +
> +def part2disktype(partpath: str) -> str:
> +    'Parse /proc/devices and get block device type for @partpath'
> +
> +    major = os.major(os.stat(partpath).st_rdev)
> +    assert major
> +    with open('/proc/devices', encoding='utf-8') as devf:
> +        for line in reversed(list(devf)):
> +            # Our major cannot be absent among block devs
> +            if line.startswith('Block'):
> +                break
> +            devmajor, devtype = line.strip().split()
> +            if int(devmajor) == major:
> +                return devtype
> +
> +    sys.exit('Error: We haven\'t found major {major} in /proc/devices, '
> +             'and that can\'t be')
> +
> +
> +def get_part_offset(part: str, disk: str) -> int:
> +    'Get offset in bytes of the partition @part on the block device @disk.'
> +
> +    lines = run_cmd(f'fdisk -l {disk} | egrep "^(Units|{part})"').splitlines()

And here.

We should probably also match a space after {part} to avoid selecting
other partitions that have {part} as a prefix (like partition 10 when we
want partition 1). I think we would actually always get the wanted one
first, but it would be cleaner to not even have the others in the
output.

> +
> +    unitmatch = re.match('^.* = ([0-9]+) bytes$', lines[0])
> +    if not bool(unitmatch):
> +        sys.exit(f'Error: Couldn\'t parse "fdisk -l" output:\n{lines[0]}')
> +    secsize = int(unitmatch.group(1))
> +
> +    part_offt = int(lines[1].split()[1])
> +    return part_offt * secsize
> +
> +
> +def parse_frag_line(line: str, partition: Partition) -> Extent:
> +    'Construct Extent instance from a "filefrag" output line.'
> +
> +    nums = [int(n) for n in re.findall(r'[0-9]+', line)]
> +
> +    log_start  = nums[1]
> +    log_end    = nums[2]
> +    phys_start = nums[3]
> +    phys_end   = nums[4]
> +    length     = nums[5]
> +
> +    assert log_start < log_end
> +    assert phys_start < phys_end
> +    assert (log_end - log_start + 1) == (phys_end - phys_start + 1) == length
> +
> +    return Extent(log_start, log_end, phys_start, phys_end, length, partition)
> +
> +
> +def preliminary_checks(args: argparse.Namespace) -> None:
> +    'A bunch of checks to emit an error and exit at the earlier stage.'
> +
> +    if which('filefrag') is None:
> +        sys.exit('Error: Program "filefrag" doesn\'t exist')
> +
> +    if not os.path.exists(args.filename):
> +        sys.exit(f'Error: File {args.filename} doesn\'t exist')
> +
> +    args.filesize = os.path.getsize(args.filename)
> +    if args.offset >= args.filesize:
> +        sys.exit(f'Error: Specified offset {args.offset} exceeds '
> +                 f'file size {args.filesize}')
> +    if args.size and (args.offset + args.size > args.filesize):
> +        sys.exit(f'Error: Chunk of size {args.size} at offset '
> +                 f'{args.offset} exceeds file size {args.filesize}')
> +
> +    args.partpath = fpath2part(args.filename)
> +    args.disktype = part2disktype(args.partpath)
> +    if args.disktype not in ('sd', 'virtblk'):
> +        sys.exit(f'Error: Cannot analyze files on {args.disktype} disks')
> +    args.diskpath = part2dev(args.partpath, args.filename)
> +    args.part_offt = get_part_offset(args.partpath, args.diskpath)
> +
> +
> +def get_extent_maps(args: argparse.Namespace) -> list[Extent]:
> +    'Run "filefrag", parse its output and return a list of Extent instances.'
> +
> +    lines = run_cmd(f'filefrag -b1 -v {args.filename}').splitlines()

And the final missing quoting.

> +
> +    ffinfo_re = re.compile('.* is ([0-9]+) .*of ([0-9]+) bytes')
> +    ff_size, ff_block = re.match(ffinfo_re, lines[1]).groups()
> +
> +    # Paranoia checks
> +    if int(ff_size) != args.filesize:
> +        sys.exit('Error: filefrag and os.path.getsize() report different '
> +                 f'sizes: {ff_size} and {args.filesize}')
> +    if int(ff_block) != 1:
> +        sys.exit(f'Error: "filefrag -b1" invoked, but block size is {ff_block}')
> +
> +    partition = Partition(args.partpath, args.diskpath, args.part_offt)
> +
> +    # Fill extents list from the output
> +    extents = []
> +    for line in lines:
> +        if not re.match(r'^\s*[0-9]+:', line):
> +            continue
> +        extents += [parse_frag_line(line, partition)]
> +
> +    chunk_start = args.offset
> +    chunk_end = args.offset + args.size - 1
> +    ext_offsets = [ext.log_start for ext in extents]
> +    start_ind = bisect_right(ext_offsets, chunk_start) - 1
> +    end_ind = bisect_right(ext_offsets, chunk_end) - 1
> +
> +    res_extents = extents[start_ind : end_ind + 1]
> +    for i, ext in enumerate(res_extents):
> +        start = max(chunk_start, ext.log_start)
> +        end = min(chunk_end, ext.log_end)
> +        res_extents[i] = Extent.ext_slice(ext, start, end)
> +
> +    return res_extents
> +
> +
> +def parse_args() -> argparse.Namespace:
> +    'Define program arguments and parse user input.'
> +
> +    parser = argparse.ArgumentParser(description='''
> +Map file offset to physical offset on the block device
> +
> +With --size provided get a list of mappings for the chunk''',
> +    formatter_class=argparse.RawTextHelpFormatter)
> +
> +    parser.add_argument('filename', type=str, help='filename to process')
> +    parser.add_argument('offset', type=str,
> +                        help='logical offset inside the file')
> +    parser.add_argument('-s', '--size', required=False, type=str,
> +                        help='size of the file chunk to get offsets for')
> +    args = parser.parse_args()
> +
> +    args.offset = parse_size(args.offset)
> +    if args.size:
> +        args.size = parse_size(args.size)
> +    else:
> +        # When no chunk size is provided (only offset), it's equivalent to
> +        # chunk size == 1
> +        args.size = 1
> +
> +    return args
> +
> +
> +def main() -> int:
> +    args = parse_args()
> +    preliminary_checks(args)
> +    extents = get_extent_maps(args)
> +    for ext in extents:
> +        print(ext)
> +
> +
> +if __name__ == '__main__':
> +    sys.exit(main())

Kevin



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 2/3] iotests/298: add testcase for async writes with preallocation filter
  2024-08-05 12:04   ` Kevin Wolf
@ 2024-08-05 12:56     ` Andrey Drobyshev
  2024-08-05 13:50       ` Kevin Wolf
  0 siblings, 1 reply; 14+ messages in thread
From: Andrey Drobyshev @ 2024-08-05 12:56 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: qemu-block, qemu-devel, hreitz, vsementsov, pbonzini, eesposit,
	den

On 8/5/24 3:04 PM, Kevin Wolf wrote:
> Am 16.07.2024 um 16:41 hat Andrey Drobyshev geschrieben:
>> The testcase simply creates a 64G image with 1M clusters, generates a list
>> of 1M aligned offsets and feeds aio_write commands with those offsets to
>> qemu-io run with '--aio native --nocache'.  Then we check the data
>> written at each of the offsets.  Before the previous commit this could
>> result into a race within the preallocation filter which would zeroize
>> some clusters after actually writing data to them.
>>
>> Note: the test doesn't fail in 100% cases as there's a race involved,
>> but the failures are pretty consistent so it should be good enough for
>> detecting the problem.
>>
>> Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
> 
> I left it running in a loop for a while, but couldn't reproduce the bug
> with this test.
> 

Hmmm, it seems to have stopped failing on my setup as well, no matter
how I increase the number of requests.  And it seems to be related to
the interleaving 'write-zeroes' requests.  My initial attempt was to
cover the case described by Vladimir here:
https://lists.nongnu.org/archive/html/qemu-block/2024-07/msg00415.html
Maybe we just leave it and try reproducing the corruption with just
regular write requests?  At least with this version it seems to be
failing pretty stably on my setup:

> +    def test_prealloc_async_writes(self):
> +        requests = 2048 # Number of write/read requests to feed to qemu-io
> +        total_clusters = 64 * 1024 # 64G / 1M
> +
> +        offsets = random.sample(range(0, total_clusters), requests)
> +        aio_write_cmds = [f'aio_write -P 0xaa  {off}M 1M' for off in offsets]
> +        read_cmds = [f'read -P 0xaa {off}M 1M' for off in offsets]
> +
> +        proc = iotests.QemuIoInteractive('--aio', 'native', '--nocache',
> +                                         '--image-opts', drive_opts)
> +        for cmd in aio_write_cmds:
> +            proc.cmd(cmd)
> +        proc.close()
> +
> +        proc = iotests.QemuIoInteractive('-f', iotests.imgfmt, disk)
> +        for cmd in read_cmds:
> +            out = proc.cmd(cmd)
> +            self.assertFalse('Pattern verification failed' in str(out))
> +        proc.close()
> +



>>  tests/qemu-iotests/298     | 49 ++++++++++++++++++++++++++++++++++++++
>>  tests/qemu-iotests/298.out |  4 ++--
>>  2 files changed, 51 insertions(+), 2 deletions(-)
>>
>> diff --git a/tests/qemu-iotests/298 b/tests/qemu-iotests/298
>> index 09c9290711..b7126e9e15 100755
>> --- a/tests/qemu-iotests/298
>> +++ b/tests/qemu-iotests/298
>> @@ -20,8 +20,10 @@
>>  
>>  import os
>>  import iotests
>> +import random
>>  
>>  MiB = 1024 * 1024
>> +GiB = MiB * 1024
>>  disk = os.path.join(iotests.test_dir, 'disk')
>>  overlay = os.path.join(iotests.test_dir, 'overlay')
>>  refdisk = os.path.join(iotests.test_dir, 'refdisk')
>> @@ -176,5 +178,52 @@ class TestTruncate(iotests.QMPTestCase):
>>          self.do_test('off', '150M')
>>  
>>  
>> +class TestPreallocAsyncWrites(iotests.QMPTestCase):
>> +    def setUp(self):
>> +        # Make sure we get reproducible write patterns on each run
>> +        random.seed(42)
>> +        iotests.qemu_img_create('-f', iotests.imgfmt, disk, '-o',
>> +                                f'cluster_size={MiB},lazy_refcounts=on',
>> +                                str(64 * GiB))
>> +
>> +    def tearDown(self):
>> +        os.remove(disk)
>> +
>> +    def test_prealloc_async_writes(self):
>> +        def gen_write_pattern():
>> +            n = 0
>> +            while True:
>> +                yield '-P 0xaa' if n else '-z'
>> +                n = 1 - n
> 
> This looks like a complicated way to write the following?
> 
>     # Alternate between write_zeroes and writing data
>     def gen_write_pattern():
>         while True:
>             yield '-z'
>             yield '-P 0xaa'
>

Agreed, thank you :)  Won't need this chunk though if we end up adopting
the version I posted above.

>> +        def gen_read_pattern():
>> +            n = 0
>> +            while True:
>> +                yield '-P 0xaa' if n else '-P 0x00'
>> +                n = 1 - n
> 
> Same here.
> 
> Kevin
> 



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 3/3] scripts: add filev2p.py script for mapping virtual file offsets mapping
  2024-08-05 12:29   ` Kevin Wolf
@ 2024-08-05 13:02     ` Andrey Drobyshev
  0 siblings, 0 replies; 14+ messages in thread
From: Andrey Drobyshev @ 2024-08-05 13:02 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: qemu-block, qemu-devel, hreitz, vsementsov, pbonzini, eesposit,
	den

On 8/5/24 3:29 PM, Kevin Wolf wrote:
> Am 16.07.2024 um 16:41 hat Andrey Drobyshev geschrieben:
>> The script is basically a wrapper around "filefrag" utility.  This might
>> be used to map virtual offsets within the file to the underlying block
>> device offsets.  In addition, a chunk size might be specified, in which
>> case a list of such mappings will be obtained:
>>
>> $ scripts/filev2p.py -s 100M /sparsefile 1768M
>> 1853882368..1895825407 (file)  ->  16332619776..16374562815 (/dev/sda4)  ->  84492156928..84534099967 (/dev/sda)
>> 1895825408..1958739967 (file)  ->  17213591552..17276506111 (/dev/sda4)  ->  85373128704..85436043263 (/dev/sda)
>>
>> This could come in handy when we need to map a certain piece of data
>> within a file inside VM to the same data within the image on the host
>> (e.g. physical offset on VM's /dev/sda would be the virtual offset
>> within QCOW2 image).
>>
>> Note: as of now the script only works with the files located on plain
>> partitions, i.e. it doesn't work with partitions built on top of LVM.
>> Partitions on LVM would require another level of mapping.
>>
>> Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
>> ---
>>  scripts/filev2p.py | 311 +++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 311 insertions(+)
>>  create mode 100755 scripts/filev2p.py
>>
>> diff --git a/scripts/filev2p.py b/scripts/filev2p.py
>> new file mode 100755
>> index 0000000000..3bd7d18b5e
>> --- /dev/null
>> +++ b/scripts/filev2p.py
>> @@ -0,0 +1,311 @@
>> +#!/usr/bin/env python3
>> +#
>> +# Map file virtual offset to the offset on the underlying block device.
>> +# Works by parsing 'filefrag' output.
>> +#
>> +# Copyright (c) 2024 Virtuozzo International GmbH.
>> +#
>> +# This program is free software; you can redistribute it and/or modify
>> +# it under the terms of the GNU General Public License as published by
>> +# the Free Software Foundation; either version 2 of the License, or
>> +# (at your option) any later version.
>> +#
>> +# This program is distributed in the hope that it will be useful,
>> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
>> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> +# GNU General Public License for more details.
>> +#
>> +# You should have received a copy of the GNU General Public License
>> +# along with this program.  If not, see <http://www.gnu.org/licenses/>.
>> +#
>> +
>> +import argparse
>> +import os
>> +import subprocess
>> +import re
>> +import sys
>> +
>> +from bisect import bisect_right
>> +from collections import namedtuple
>> +from dataclasses import dataclass
>> +from shutil import which
>> +from stat import S_ISBLK
>> +
>> +
>> +Partition = namedtuple('Partition', ['partpath', 'diskpath', 'part_offt'])
>> +
>> +
>> +@dataclass
>> +class Extent:
>> +    '''Class representing an individual file extent.
>> +
>> +    This is basically a piece of data within the file which is located
>> +    consecutively (i.e. not sparsely) on the underlying block device.
>> +    '''
> 
> Python docstrings should always be triple double quotes """...""" as per
> PEP 257.
> 
> Some functions below even use a single single quote because they are on
> a single line. They should still use the same convention.
> 
>> +
>> +    log_start:  int
>> +    log_end:    int
>> +    phys_start: int
>> +    phys_end:   int
>> +    length:     int
>> +    partition:  Partition
>> +
>> +    @property
>> +    def disk_start(self):
>> +        'Number of the first byte of this extent on the whole disk (/dev/sda)'
>> +        return self.partition.part_offt + self.phys_start
>> +
>> +    @property
>> +    def disk_end(self):
>> +        'Number of the last byte of this extent on the whole disk (/dev/sda)'
>> +        return self.partition.part_offt + self.phys_end
>> +
>> +    def __str__(self):
>> +        ischunk = self.log_end > self.log_start
>> +        maybe_end = lambda s: f'..{s}' if ischunk else ''
>> +        return '%s%s (file)  ->  %s%s (%s)  ->  %s%s (%s)' % (
>> +            self.log_start, maybe_end(self.log_end),
>> +            self.phys_start, maybe_end(self.phys_end), self.partition.partpath,
>> +            self.disk_start, maybe_end(self.disk_end), self.partition.diskpath
>> +        )
>> +
>> +    @classmethod
>> +    def ext_slice(cls, bigger_ext, start, end):
>> +        '''Constructor for the Extent class from a bigger extent.
>> +
>> +        Return Extent instance which is a slice of @bigger_ext contained
>> +        within the range [start, end].
>> +        '''
>> +
>> +        assert start >= bigger_ext.log_start
>> +        assert end <= bigger_ext.log_end
>> +
>> +        if start == bigger_ext.log_start and end == bigger_ext.log_end:
>> +            return bigger_ext
>> +
>> +        phys_start = bigger_ext.phys_start + (start - bigger_ext.log_start)
>> +        phys_end = bigger_ext.phys_end - (bigger_ext.log_end - end)
>> +        length = end - start + 1
>> +
>> +        return cls(start, end, phys_start, phys_end, length,
>> +                   bigger_ext.partition)
>> +
>> +
>> +def run_cmd(cmd: str) -> str:
>> +    '''Wrapper around subprocess.run.
>> +
>> +    Returns stdout in case of success, emits en error and exits in case
>> +    of failure.
>> +    '''
>> +
>> +    proc = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE,
>> +                          check=False, shell=True)
>> +    if proc.stderr is not None:
>> +        stderr = f'\n{proc.stderr.decode().strip()}'
>> +    else:
>> +        stderr = ''
>> +
>> +    if proc.returncode:
>> +        sys.exit(f'Error: Command "{cmd}" returned {proc.returncode}:{stderr}')
>> +
>> +    return proc.stdout.decode().strip()
>> +
>> +
>> +def parse_size(offset: str) -> int:
>> +    'Convert human readable size to bytes'
>> +
>> +    suffixes = {
>> +        **dict.fromkeys(['k', 'K', 'Kb', 'KB', 'KiB'], 2 ** 10),
>> +        **dict.fromkeys(['m', 'M', 'Mb', 'MB', 'MiB'], 2 ** 20),
>> +        **dict.fromkeys(['g', 'G', 'Gb', 'GB', 'GiB'], 2 ** 30),
>> +        **dict.fromkeys(     ['T', 'Tb', 'TB', 'TiB'], 2 ** 40),
>> +        **dict.fromkeys([''],                          1)
>> +    }
>> +
>> +    sizematch = re.match(r'^([0-9]+)\s*([a-zA-Z]*)$', offset)
>> +    if not bool(sizematch):
>> +        sys.exit(f'Error: Couldn\'t parse size "{offset}". Pass offset '
>> +                  'either in bytes or in format 1K, 2M, 3G')
>> +
>> +    num, suff = sizematch.groups()
>> +    num = int(num)
>> +
>> +    mult = suffixes.get(suff)
>> +    if mult is None:
>> +        sys.exit(f'Error: Couldn\'t parse size "{offset}": '
>> +                 f'unknown suffix {suff}')
>> +
>> +    return num * mult
>> +
>> +
>> +def fpath2part(filename: str) -> str:
>> +    'Get partition on which @filename is located (i.e. /dev/sda1).'
>> +
>> +    partpath = run_cmd(f'df --output=source {filename} | tail -n+2')
> 
> Anything passed to a shell (like {filename}) certainly must have proper
> quoting applied to avoid shell injections?
> 
>> +    if not os.path.exists(partpath) or not S_ISBLK(os.stat(partpath).st_mode):
>> +        sys.exit(f'Error: file {filename} is located on {partpath} which '
>> +                 'isn\'t a block device')
>> +    return partpath
>> +
>> +
>> +def part2dev(partpath: str, filename: str) -> str:
>> +    'Get block device on which @partpath is located (i.e. /dev/sda).'
>> +    dev = run_cmd(f'lsblk -no PKNAME {partpath}')
> 
> Missing quoting here, too.
> 
>> +    diskpath = f'/dev/{dev}'
>> +    if not os.path.exists(diskpath) or not S_ISBLK(os.stat(diskpath).st_mode):
>> +        sys.exit(f'Error: file {filename} is located on {diskpath} which '
>> +                 'isn\'t a block device')
>> +    return diskpath
>> +
>> +
>> +def part2disktype(partpath: str) -> str:
>> +    'Parse /proc/devices and get block device type for @partpath'
>> +
>> +    major = os.major(os.stat(partpath).st_rdev)
>> +    assert major
>> +    with open('/proc/devices', encoding='utf-8') as devf:
>> +        for line in reversed(list(devf)):
>> +            # Our major cannot be absent among block devs
>> +            if line.startswith('Block'):
>> +                break
>> +            devmajor, devtype = line.strip().split()
>> +            if int(devmajor) == major:
>> +                return devtype
>> +
>> +    sys.exit('Error: We haven\'t found major {major} in /proc/devices, '
>> +             'and that can\'t be')
>> +
>> +
>> +def get_part_offset(part: str, disk: str) -> int:
>> +    'Get offset in bytes of the partition @part on the block device @disk.'
>> +
>> +    lines = run_cmd(f'fdisk -l {disk} | egrep "^(Units|{part})"').splitlines()
> 
> And here.
> 
> We should probably also match a space after {part} to avoid selecting
> other partitions that have {part} as a prefix (like partition 10 when we
> want partition 1). I think we would actually always get the wanted one
> first, but it would be cleaner to not even have the others in the
> output.
> 
>> +
>> +    unitmatch = re.match('^.* = ([0-9]+) bytes$', lines[0])
>> +    if not bool(unitmatch):
>> +        sys.exit(f'Error: Couldn\'t parse "fdisk -l" output:\n{lines[0]}')
>> +    secsize = int(unitmatch.group(1))
>> +
>> +    part_offt = int(lines[1].split()[1])
>> +    return part_offt * secsize
>> +
>> +
>> +def parse_frag_line(line: str, partition: Partition) -> Extent:
>> +    'Construct Extent instance from a "filefrag" output line.'
>> +
>> +    nums = [int(n) for n in re.findall(r'[0-9]+', line)]
>> +
>> +    log_start  = nums[1]
>> +    log_end    = nums[2]
>> +    phys_start = nums[3]
>> +    phys_end   = nums[4]
>> +    length     = nums[5]
>> +
>> +    assert log_start < log_end
>> +    assert phys_start < phys_end
>> +    assert (log_end - log_start + 1) == (phys_end - phys_start + 1) == length
>> +
>> +    return Extent(log_start, log_end, phys_start, phys_end, length, partition)
>> +
>> +
>> +def preliminary_checks(args: argparse.Namespace) -> None:
>> +    'A bunch of checks to emit an error and exit at the earlier stage.'
>> +
>> +    if which('filefrag') is None:
>> +        sys.exit('Error: Program "filefrag" doesn\'t exist')
>> +
>> +    if not os.path.exists(args.filename):
>> +        sys.exit(f'Error: File {args.filename} doesn\'t exist')
>> +
>> +    args.filesize = os.path.getsize(args.filename)
>> +    if args.offset >= args.filesize:
>> +        sys.exit(f'Error: Specified offset {args.offset} exceeds '
>> +                 f'file size {args.filesize}')
>> +    if args.size and (args.offset + args.size > args.filesize):
>> +        sys.exit(f'Error: Chunk of size {args.size} at offset '
>> +                 f'{args.offset} exceeds file size {args.filesize}')
>> +
>> +    args.partpath = fpath2part(args.filename)
>> +    args.disktype = part2disktype(args.partpath)
>> +    if args.disktype not in ('sd', 'virtblk'):
>> +        sys.exit(f'Error: Cannot analyze files on {args.disktype} disks')
>> +    args.diskpath = part2dev(args.partpath, args.filename)
>> +    args.part_offt = get_part_offset(args.partpath, args.diskpath)
>> +
>> +
>> +def get_extent_maps(args: argparse.Namespace) -> list[Extent]:
>> +    'Run "filefrag", parse its output and return a list of Extent instances.'
>> +
>> +    lines = run_cmd(f'filefrag -b1 -v {args.filename}').splitlines()
> 
> And the final missing quoting.
> 
>> +
>> +    ffinfo_re = re.compile('.* is ([0-9]+) .*of ([0-9]+) bytes')
>> +    ff_size, ff_block = re.match(ffinfo_re, lines[1]).groups()
>> +
>> +    # Paranoia checks
>> +    if int(ff_size) != args.filesize:
>> +        sys.exit('Error: filefrag and os.path.getsize() report different '
>> +                 f'sizes: {ff_size} and {args.filesize}')
>> +    if int(ff_block) != 1:
>> +        sys.exit(f'Error: "filefrag -b1" invoked, but block size is {ff_block}')
>> +
>> +    partition = Partition(args.partpath, args.diskpath, args.part_offt)
>> +
>> +    # Fill extents list from the output
>> +    extents = []
>> +    for line in lines:
>> +        if not re.match(r'^\s*[0-9]+:', line):
>> +            continue
>> +        extents += [parse_frag_line(line, partition)]
>> +
>> +    chunk_start = args.offset
>> +    chunk_end = args.offset + args.size - 1
>> +    ext_offsets = [ext.log_start for ext in extents]
>> +    start_ind = bisect_right(ext_offsets, chunk_start) - 1
>> +    end_ind = bisect_right(ext_offsets, chunk_end) - 1
>> +
>> +    res_extents = extents[start_ind : end_ind + 1]
>> +    for i, ext in enumerate(res_extents):
>> +        start = max(chunk_start, ext.log_start)
>> +        end = min(chunk_end, ext.log_end)
>> +        res_extents[i] = Extent.ext_slice(ext, start, end)
>> +
>> +    return res_extents
>> +
>> +
>> +def parse_args() -> argparse.Namespace:
>> +    'Define program arguments and parse user input.'
>> +
>> +    parser = argparse.ArgumentParser(description='''
>> +Map file offset to physical offset on the block device
>> +
>> +With --size provided get a list of mappings for the chunk''',
>> +    formatter_class=argparse.RawTextHelpFormatter)
>> +
>> +    parser.add_argument('filename', type=str, help='filename to process')
>> +    parser.add_argument('offset', type=str,
>> +                        help='logical offset inside the file')
>> +    parser.add_argument('-s', '--size', required=False, type=str,
>> +                        help='size of the file chunk to get offsets for')
>> +    args = parser.parse_args()
>> +
>> +    args.offset = parse_size(args.offset)
>> +    if args.size:
>> +        args.size = parse_size(args.size)
>> +    else:
>> +        # When no chunk size is provided (only offset), it's equivalent to
>> +        # chunk size == 1
>> +        args.size = 1
>> +
>> +    return args
>> +
>> +
>> +def main() -> int:
>> +    args = parse_args()
>> +    preliminary_checks(args)
>> +    extents = get_extent_maps(args)
>> +    for ext in extents:
>> +        print(ext)
>> +
>> +
>> +if __name__ == '__main__':
>> +    sys.exit(main())
> 
> Kevin
> 

Agreed on all the above comments, thank you.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 2/3] iotests/298: add testcase for async writes with preallocation filter
  2024-08-05 12:56     ` Andrey Drobyshev
@ 2024-08-05 13:50       ` Kevin Wolf
  0 siblings, 0 replies; 14+ messages in thread
From: Kevin Wolf @ 2024-08-05 13:50 UTC (permalink / raw)
  To: Andrey Drobyshev
  Cc: qemu-block, qemu-devel, hreitz, vsementsov, pbonzini, eesposit,
	den

Am 05.08.2024 um 14:56 hat Andrey Drobyshev geschrieben:
> On 8/5/24 3:04 PM, Kevin Wolf wrote:
> > Am 16.07.2024 um 16:41 hat Andrey Drobyshev geschrieben:
> >> The testcase simply creates a 64G image with 1M clusters, generates a list
> >> of 1M aligned offsets and feeds aio_write commands with those offsets to
> >> qemu-io run with '--aio native --nocache'.  Then we check the data
> >> written at each of the offsets.  Before the previous commit this could
> >> result into a race within the preallocation filter which would zeroize
> >> some clusters after actually writing data to them.
> >>
> >> Note: the test doesn't fail in 100% cases as there's a race involved,
> >> but the failures are pretty consistent so it should be good enough for
> >> detecting the problem.
> >>
> >> Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
> > 
> > I left it running in a loop for a while, but couldn't reproduce the bug
> > with this test.
> > 
> 
> Hmmm, it seems to have stopped failing on my setup as well, no matter
> how I increase the number of requests.  And it seems to be related to
> the interleaving 'write-zeroes' requests.  My initial attempt was to
> cover the case described by Vladimir here:
> https://lists.nongnu.org/archive/html/qemu-block/2024-07/msg00415.html
> Maybe we just leave it and try reproducing the corruption with just
> regular write requests?  At least with this version it seems to be
> failing pretty stably on my setup:
> 
> > +    def test_prealloc_async_writes(self):
> > +        requests = 2048 # Number of write/read requests to feed to qemu-io
> > +        total_clusters = 64 * 1024 # 64G / 1M
> > +
> > +        offsets = random.sample(range(0, total_clusters), requests)
> > +        aio_write_cmds = [f'aio_write -P 0xaa  {off}M 1M' for off in offsets]
> > +        read_cmds = [f'read -P 0xaa {off}M 1M' for off in offsets]
> > +
> > +        proc = iotests.QemuIoInteractive('--aio', 'native', '--nocache',
> > +                                         '--image-opts', drive_opts)
> > +        for cmd in aio_write_cmds:
> > +            proc.cmd(cmd)
> > +        proc.close()
> > +
> > +        proc = iotests.QemuIoInteractive('-f', iotests.imgfmt, disk)
> > +        for cmd in read_cmds:
> > +            out = proc.cmd(cmd)
> > +            self.assertFalse('Pattern verification failed' in str(out))
> > +        proc.close()
> > +

This doesn't seem to fail for me either. Neither on tmpfs nor in my home
directory (which is XFS on an encrypted LVM volume).

Are you using some more complicated setup than "./check -qcow2 298"?

Do you think we could use blkdebug to deterministically trigger the case
instead of trying to brute force it? If we can suspend the write_zeroes
request on the child node of preallocate, I think that's all we need to
reproduce the bug as described in the commit message of the fix.

Kevin



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-08-05 13:51 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-16 14:41 [PATCH v3 0/3] Fix data corruption within preallocation Andrey Drobyshev
2024-07-16 14:41 ` [PATCH v3 1/3] block: zero data data corruption using prealloc-filter Andrey Drobyshev
2024-07-18 15:51   ` Kevin Wolf
2024-07-18 15:52     ` Denis V. Lunev
2024-07-18 19:46     ` Denis V. Lunev
2024-08-05 11:59       ` Kevin Wolf
2024-08-05 12:13         ` Denis V. Lunev
2024-07-16 14:41 ` [PATCH v3 2/3] iotests/298: add testcase for async writes with preallocation filter Andrey Drobyshev
2024-08-05 12:04   ` Kevin Wolf
2024-08-05 12:56     ` Andrey Drobyshev
2024-08-05 13:50       ` Kevin Wolf
2024-07-16 14:41 ` [PATCH v3 3/3] scripts: add filev2p.py script for mapping virtual file offsets mapping Andrey Drobyshev
2024-08-05 12:29   ` Kevin Wolf
2024-08-05 13:02     ` Andrey Drobyshev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).