qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Kevin Wolf <kwolf@redhat.com>,
	Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>,
	Alberto Garcia <berto@igalia.com>,
	qemu-block@nongnu.org, qemu-devel@nongnu.org,
	Max Reitz <mreitz@redhat.com>,
	linux-xfs@vger.kernel.org
Subject: Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster
Date: Fri, 21 Aug 2020 07:05:06 -0400	[thread overview]
Message-ID: <20200821110506.GB212879@bfoster> (raw)
In-Reply-To: <20200820215811.GC7941@dread.disaster.area>

On Fri, Aug 21, 2020 at 07:58:11AM +1000, Dave Chinner wrote:
> On Thu, Aug 20, 2020 at 10:03:10PM +0200, Alberto Garcia wrote:
> > Cc: linux-xfs
> > 
> > On Wed 19 Aug 2020 07:53:00 PM CEST, Brian Foster wrote:
> > > In any event, if you're seeing unclear or unexpected performance
> > > deltas between certain XFS configurations or other fs', I think the
> > > best thing to do is post a more complete description of the workload,
> > > filesystem/storage setup, and test results to the linux-xfs mailing
> > > list (feel free to cc me as well). As it is, aside from the questions
> > > above, it's not really clear to me what the storage stack looks like
> > > for this test, if/how qcow2 is involved, what the various
> > > 'preallocation=' modes actually mean, etc.
> > 
> > (see [1] for a bit of context)
> > 
> > I repeated the tests with a larger (125GB) filesystem. Things are a bit
> > faster but not radically different, here are the new numbers:
> > 
> > |----------------------+-------+-------|
> > | preallocation mode   |   xfs |  ext4 |
> > |----------------------+-------+-------|
> > | off                  |  8139 | 11688 |
> > | off (w/o ZERO_RANGE) |  2965 |  2780 |
> > | metadata             |  7768 |  9132 |
> > | falloc               |  7742 | 13108 |
> > | full                 | 41389 | 16351 |
> > |----------------------+-------+-------|
> > 
> > The numbers are I/O operations per second as reported by fio, running
> > inside a VM.
> > 
> > The VM is running Debian 9.7 with Linux 4.9.130 and the fio version is
> > 2.16-1. I'm using QEMU 5.1.0.
> > 
> > fio is sending random 4KB write requests to a 25GB virtual drive, this
> > is the full command line:
> > 
> > fio --filename=/dev/vdb --direct=1 --randrepeat=1 --eta=always
> >     --ioengine=libaio --iodepth=32 --numjobs=1 --name=test --size=25G
> >     --io_limit=25G --ramp_time=5 --rw=randwrite --bs=4k --runtime=60
> >   
> > The virtual drive (/dev/vdb) is a freshly created qcow2 file stored on
> > the host (on an xfs or ext4 filesystem as the table above shows), and
> > it is attached to QEMU using a virtio-blk-pci device:
> > 
> >    -drive if=virtio,file=image.qcow2,cache=none,l2-cache-size=200M
> 
> You're not using AIO on this image file, so it can't do
> concurrent IO? what happens when you add "aio=native" to this?
> 
> > cache=none means that the image is opened with O_DIRECT and
> > l2-cache-size is large enough so QEMU is able to cache all the
> > relevant qcow2 metadata in memory.
> 
> What happens when you just use a sparse file (i.e. a raw image) with
> aio=native instead of using qcow2? XFS, ext4, btrfs, etc all support
> sparse files so using qcow2 to provide sparse image file support is
> largely an unnecessary layer of indirection and overhead...
> 
> And with XFS, you don't need qcow2 for snapshots either because you
> can use reflink copies to take an atomic copy-on-write snapshot of
> the raw image file... (assuming you made the xfs filesystem with
> reflink support (which is the TOT default now)).
> 
> I've been using raw sprase files on XFS for all my VMs for over a
> decade now, and using reflink to create COW copies of golden
> image files iwhen deploying new VMs for a couple of years now...
> 
> > The host is running Linux 4.19.132 and has an SSD drive.
> > 
> > About the preallocation modes: a qcow2 file is divided into clusters
> > of the same size (64KB in this case). That is the minimum unit of
> > allocation, so when writing 4KB to an unallocated cluster QEMU needs
> > to fill the other 60KB with zeroes. So here's what happens with the
> > different modes:
> 
> Which is something that sparse files on filesystems do not need to
> do. If, on XFS, you really want 64kB allocation clusters, use an
> extent size hint of 64kB. Though for image files, I highly recommend
> using 1MB or larger extent size hints.
> 
> 
> > 1) off: for every write request QEMU initializes the cluster (64KB)
> >         with fallocate(ZERO_RANGE) and then writes the 4KB of data.
> > 
> > 2) off w/o ZERO_RANGE: QEMU writes the 4KB of data and fills the rest
> >         of the cluster with zeroes.
> > 
> > 3) metadata: all clusters were allocated when the image was created
> >         but they are sparse, QEMU only writes the 4KB of data.
> > 
> > 4) falloc: all clusters were allocated with fallocate() when the image
> >         was created, QEMU only writes 4KB of data.
> > 
> > 5) full: all clusters were allocated by writing zeroes to all of them
> >         when the image was created, QEMU only writes 4KB of data.
> > 
> > As I said in a previous message I'm not familiar with xfs, but the
> > parts that I don't understand are
> > 
> >    - Why is (4) slower than (1)?
> 
> Because fallocate() is a full IO serialisation barrier at the
> filesystem level. If you do:
> 
> fallocate(whole file)
> <IO>
> <IO>
> <IO>
> .....
> 
> The IO can run concurrent and does not serialise against anything in
> the filesysetm except unwritten extent conversions at IO completion
> (see answer to next question!)
> 
> However, if you just use (4) you get:
> 
> falloc(64k)
>   <wait for inflight IO to complete>
>   <allocates 64k as unwritten>
> <4k io>
>   ....
> falloc(64k)
>   <wait for inflight IO to complete>
>   ....
>   <4k IO completes, converts 4k to written>
>   <allocates 64k as unwritten>
> <4k io>
> falloc(64k)
>   <wait for inflight IO to complete>
>   ....
>   <4k IO completes, converts 4k to written>
>   <allocates 64k as unwritten>
> <4k io>
>   ....
> 

Option 4 is described above as initial file preallocation whereas option
1 is per 64k cluster prealloc. Prealloc mode mixup aside, Berto is
reporting that the initial file preallocation mode is slower than the
per cluster prealloc mode. Berto, am I following that right?

Brian

> until all the clusters in the qcow2 file are intialised. IOWs, each
> fallocate() call serialises all IO in flight. Compare that to using
> extent size hints on a raw sparse image file for the same thing:
> 
> <set 64k extent size hint>
> <4k IO>
>   <allocates 64k as unwritten>
>   ....
> <4k IO>
>   <allocates 64k as unwritten>
>   ....
> <4k IO>
>   <allocates 64k as unwritten>
>   ....
> ...
>   <4k IO completes, converts 4k to written>
>   <4k IO completes, converts 4k to written>
>   <4k IO completes, converts 4k to written>
> ....
> 
> See the difference in IO pipelining here? You get the same "64kB
> cluster initialised at a time" behaviour as qcow2, but you don't get
> the IO pipeline stalls caused by fallocate() having to drain all the
> IO in flight before it does the allocation.
> 
> >    - Why is (5) so much faster than everything else?
> 
> The full file allocation in (5) means the IO doesn't have to modify
> the extent map hence all extent mapping is uses shared locking and
> the entire IO path can run concurrently without serialisation at
> all.
> 
> Thing is, once your writes into sprase image files regularly start
> hitting written extents, the performance of (1), (2) and (4) will
> trend towards (5) as writes hit already allocated ranges of the file
> and the serialisation of extent mapping changes goes away. This
> occurs with guest filesystems that perform overwrite in place (such
> as XFS) and hence overwrites of existing data will hit allocated
> space in the image file and not require further allocation.
> 
> IOWs, typical "write once" benchmark testing indicates the *worst*
> performance you are going to see. As the guest filesytsem ages and
> initialises more of the underlying image file, it will get faster,
> not slower.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 



  reply	other threads:[~2020-08-21 11:38 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-14 14:57 [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster Alberto Garcia
2020-08-14 14:57 ` [PATCH 1/1] " Alberto Garcia
2020-08-14 18:07   ` Vladimir Sementsov-Ogievskiy
2020-08-14 18:06 ` [PATCH 0/1] " Vladimir Sementsov-Ogievskiy
2020-08-17 10:10 ` Kevin Wolf
2020-08-17 15:31   ` Alberto Garcia
2020-08-17 15:53     ` Kevin Wolf
2020-08-17 15:58       ` Alberto Garcia
2020-08-17 18:18       ` Alberto Garcia
2020-08-18  8:18         ` Kevin Wolf
2020-08-19 14:25       ` Alberto Garcia
2020-08-19 15:07         ` Kevin Wolf
2020-08-19 15:37           ` Alberto Garcia
2020-08-19 15:53             ` Alberto Garcia
2020-08-19 17:53           ` Brian Foster
2020-08-20 20:03             ` Alberto Garcia
2020-08-20 21:58               ` Dave Chinner
2020-08-21 11:05                 ` Brian Foster [this message]
2020-08-21 11:42                   ` Alberto Garcia
2020-08-21 12:12                     ` Alberto Garcia
2020-08-21 17:02                       ` Brian Foster
2020-08-25 12:24                         ` Alberto Garcia
2020-08-25 16:54                           ` Brian Foster
2020-08-25 17:18                             ` Alberto Garcia
2020-08-25 19:47                               ` Brian Foster
2020-08-26 18:34                                 ` Alberto Garcia
2020-08-27 16:47                                   ` Brian Foster
2020-08-23 21:59                       ` Dave Chinner
2020-08-24 20:14                         ` Alberto Garcia
2020-08-21 12:59                     ` Brian Foster
2020-08-21 15:51                       ` Alberto Garcia
2020-08-23 22:16                       ` Dave Chinner
2020-08-21 16:09                 ` Alberto Garcia

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200821110506.GB212879@bfoster \
    --to=bfoster@redhat.com \
    --cc=berto@igalia.com \
    --cc=david@fromorbit.com \
    --cc=kwolf@redhat.com \
    --cc=linux-xfs@vger.kernel.org \
    --cc=mreitz@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=vsementsov@virtuozzo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).