From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Kevin Wolf <kwolf@redhat.com>,
Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>,
Alberto Garcia <berto@igalia.com>,
qemu-block@nongnu.org, qemu-devel@nongnu.org,
Max Reitz <mreitz@redhat.com>,
linux-xfs@vger.kernel.org
Subject: Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster
Date: Fri, 21 Aug 2020 07:05:06 -0400 [thread overview]
Message-ID: <20200821110506.GB212879@bfoster> (raw)
In-Reply-To: <20200820215811.GC7941@dread.disaster.area>
On Fri, Aug 21, 2020 at 07:58:11AM +1000, Dave Chinner wrote:
> On Thu, Aug 20, 2020 at 10:03:10PM +0200, Alberto Garcia wrote:
> > Cc: linux-xfs
> >
> > On Wed 19 Aug 2020 07:53:00 PM CEST, Brian Foster wrote:
> > > In any event, if you're seeing unclear or unexpected performance
> > > deltas between certain XFS configurations or other fs', I think the
> > > best thing to do is post a more complete description of the workload,
> > > filesystem/storage setup, and test results to the linux-xfs mailing
> > > list (feel free to cc me as well). As it is, aside from the questions
> > > above, it's not really clear to me what the storage stack looks like
> > > for this test, if/how qcow2 is involved, what the various
> > > 'preallocation=' modes actually mean, etc.
> >
> > (see [1] for a bit of context)
> >
> > I repeated the tests with a larger (125GB) filesystem. Things are a bit
> > faster but not radically different, here are the new numbers:
> >
> > |----------------------+-------+-------|
> > | preallocation mode | xfs | ext4 |
> > |----------------------+-------+-------|
> > | off | 8139 | 11688 |
> > | off (w/o ZERO_RANGE) | 2965 | 2780 |
> > | metadata | 7768 | 9132 |
> > | falloc | 7742 | 13108 |
> > | full | 41389 | 16351 |
> > |----------------------+-------+-------|
> >
> > The numbers are I/O operations per second as reported by fio, running
> > inside a VM.
> >
> > The VM is running Debian 9.7 with Linux 4.9.130 and the fio version is
> > 2.16-1. I'm using QEMU 5.1.0.
> >
> > fio is sending random 4KB write requests to a 25GB virtual drive, this
> > is the full command line:
> >
> > fio --filename=/dev/vdb --direct=1 --randrepeat=1 --eta=always
> > --ioengine=libaio --iodepth=32 --numjobs=1 --name=test --size=25G
> > --io_limit=25G --ramp_time=5 --rw=randwrite --bs=4k --runtime=60
> >
> > The virtual drive (/dev/vdb) is a freshly created qcow2 file stored on
> > the host (on an xfs or ext4 filesystem as the table above shows), and
> > it is attached to QEMU using a virtio-blk-pci device:
> >
> > -drive if=virtio,file=image.qcow2,cache=none,l2-cache-size=200M
>
> You're not using AIO on this image file, so it can't do
> concurrent IO? what happens when you add "aio=native" to this?
>
> > cache=none means that the image is opened with O_DIRECT and
> > l2-cache-size is large enough so QEMU is able to cache all the
> > relevant qcow2 metadata in memory.
>
> What happens when you just use a sparse file (i.e. a raw image) with
> aio=native instead of using qcow2? XFS, ext4, btrfs, etc all support
> sparse files so using qcow2 to provide sparse image file support is
> largely an unnecessary layer of indirection and overhead...
>
> And with XFS, you don't need qcow2 for snapshots either because you
> can use reflink copies to take an atomic copy-on-write snapshot of
> the raw image file... (assuming you made the xfs filesystem with
> reflink support (which is the TOT default now)).
>
> I've been using raw sprase files on XFS for all my VMs for over a
> decade now, and using reflink to create COW copies of golden
> image files iwhen deploying new VMs for a couple of years now...
>
> > The host is running Linux 4.19.132 and has an SSD drive.
> >
> > About the preallocation modes: a qcow2 file is divided into clusters
> > of the same size (64KB in this case). That is the minimum unit of
> > allocation, so when writing 4KB to an unallocated cluster QEMU needs
> > to fill the other 60KB with zeroes. So here's what happens with the
> > different modes:
>
> Which is something that sparse files on filesystems do not need to
> do. If, on XFS, you really want 64kB allocation clusters, use an
> extent size hint of 64kB. Though for image files, I highly recommend
> using 1MB or larger extent size hints.
>
>
> > 1) off: for every write request QEMU initializes the cluster (64KB)
> > with fallocate(ZERO_RANGE) and then writes the 4KB of data.
> >
> > 2) off w/o ZERO_RANGE: QEMU writes the 4KB of data and fills the rest
> > of the cluster with zeroes.
> >
> > 3) metadata: all clusters were allocated when the image was created
> > but they are sparse, QEMU only writes the 4KB of data.
> >
> > 4) falloc: all clusters were allocated with fallocate() when the image
> > was created, QEMU only writes 4KB of data.
> >
> > 5) full: all clusters were allocated by writing zeroes to all of them
> > when the image was created, QEMU only writes 4KB of data.
> >
> > As I said in a previous message I'm not familiar with xfs, but the
> > parts that I don't understand are
> >
> > - Why is (4) slower than (1)?
>
> Because fallocate() is a full IO serialisation barrier at the
> filesystem level. If you do:
>
> fallocate(whole file)
> <IO>
> <IO>
> <IO>
> .....
>
> The IO can run concurrent and does not serialise against anything in
> the filesysetm except unwritten extent conversions at IO completion
> (see answer to next question!)
>
> However, if you just use (4) you get:
>
> falloc(64k)
> <wait for inflight IO to complete>
> <allocates 64k as unwritten>
> <4k io>
> ....
> falloc(64k)
> <wait for inflight IO to complete>
> ....
> <4k IO completes, converts 4k to written>
> <allocates 64k as unwritten>
> <4k io>
> falloc(64k)
> <wait for inflight IO to complete>
> ....
> <4k IO completes, converts 4k to written>
> <allocates 64k as unwritten>
> <4k io>
> ....
>
Option 4 is described above as initial file preallocation whereas option
1 is per 64k cluster prealloc. Prealloc mode mixup aside, Berto is
reporting that the initial file preallocation mode is slower than the
per cluster prealloc mode. Berto, am I following that right?
Brian
> until all the clusters in the qcow2 file are intialised. IOWs, each
> fallocate() call serialises all IO in flight. Compare that to using
> extent size hints on a raw sparse image file for the same thing:
>
> <set 64k extent size hint>
> <4k IO>
> <allocates 64k as unwritten>
> ....
> <4k IO>
> <allocates 64k as unwritten>
> ....
> <4k IO>
> <allocates 64k as unwritten>
> ....
> ...
> <4k IO completes, converts 4k to written>
> <4k IO completes, converts 4k to written>
> <4k IO completes, converts 4k to written>
> ....
>
> See the difference in IO pipelining here? You get the same "64kB
> cluster initialised at a time" behaviour as qcow2, but you don't get
> the IO pipeline stalls caused by fallocate() having to drain all the
> IO in flight before it does the allocation.
>
> > - Why is (5) so much faster than everything else?
>
> The full file allocation in (5) means the IO doesn't have to modify
> the extent map hence all extent mapping is uses shared locking and
> the entire IO path can run concurrently without serialisation at
> all.
>
> Thing is, once your writes into sprase image files regularly start
> hitting written extents, the performance of (1), (2) and (4) will
> trend towards (5) as writes hit already allocated ranges of the file
> and the serialisation of extent mapping changes goes away. This
> occurs with guest filesystems that perform overwrite in place (such
> as XFS) and hence overwrites of existing data will hit allocated
> space in the image file and not require further allocation.
>
> IOWs, typical "write once" benchmark testing indicates the *worst*
> performance you are going to see. As the guest filesytsem ages and
> initialises more of the underlying image file, it will get faster,
> not slower.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
next prev parent reply other threads:[~2020-08-21 11:38 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-08-14 14:57 [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster Alberto Garcia
2020-08-14 14:57 ` [PATCH 1/1] " Alberto Garcia
2020-08-14 18:07 ` Vladimir Sementsov-Ogievskiy
2020-08-14 18:06 ` [PATCH 0/1] " Vladimir Sementsov-Ogievskiy
2020-08-17 10:10 ` Kevin Wolf
2020-08-17 15:31 ` Alberto Garcia
2020-08-17 15:53 ` Kevin Wolf
2020-08-17 15:58 ` Alberto Garcia
2020-08-17 18:18 ` Alberto Garcia
2020-08-18 8:18 ` Kevin Wolf
2020-08-19 14:25 ` Alberto Garcia
2020-08-19 15:07 ` Kevin Wolf
2020-08-19 15:37 ` Alberto Garcia
2020-08-19 15:53 ` Alberto Garcia
2020-08-19 17:53 ` Brian Foster
2020-08-20 20:03 ` Alberto Garcia
2020-08-20 21:58 ` Dave Chinner
2020-08-21 11:05 ` Brian Foster [this message]
2020-08-21 11:42 ` Alberto Garcia
2020-08-21 12:12 ` Alberto Garcia
2020-08-21 17:02 ` Brian Foster
2020-08-25 12:24 ` Alberto Garcia
2020-08-25 16:54 ` Brian Foster
2020-08-25 17:18 ` Alberto Garcia
2020-08-25 19:47 ` Brian Foster
2020-08-26 18:34 ` Alberto Garcia
2020-08-27 16:47 ` Brian Foster
2020-08-23 21:59 ` Dave Chinner
2020-08-24 20:14 ` Alberto Garcia
2020-08-21 12:59 ` Brian Foster
2020-08-21 15:51 ` Alberto Garcia
2020-08-23 22:16 ` Dave Chinner
2020-08-21 16:09 ` Alberto Garcia
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200821110506.GB212879@bfoster \
--to=bfoster@redhat.com \
--cc=berto@igalia.com \
--cc=david@fromorbit.com \
--cc=kwolf@redhat.com \
--cc=linux-xfs@vger.kernel.org \
--cc=mreitz@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
--cc=vsementsov@virtuozzo.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).