Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Brian Foster <bfoster@redhat.com>
To: Alberto Garcia <berto@igalia.com>
Cc: Dave Chinner <david@fromorbit.com>, Kevin Wolf <kwolf@redhat.com>,
	qemu-devel@nongnu.org, qemu-block@nongnu.org,
	Max Reitz <mreitz@redhat.com>,
	Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>,
	linux-xfs@vger.kernel.org
Subject: Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster
Date: Fri, 21 Aug 2020 08:59:44 -0400	[thread overview]
Message-ID: <20200821125944.GC212879@bfoster> (raw)
In-Reply-To: <w51364gjkcj.fsf@maestria.local.igalia.com>

On Fri, Aug 21, 2020 at 01:42:52PM +0200, Alberto Garcia wrote:
> On Fri 21 Aug 2020 01:05:06 PM CEST, Brian Foster <bfoster@redhat.com> wrote:
> >> > 1) off: for every write request QEMU initializes the cluster (64KB)
> >> >         with fallocate(ZERO_RANGE) and then writes the 4KB of data.
> >> > 
> >> > 2) off w/o ZERO_RANGE: QEMU writes the 4KB of data and fills the rest
> >> >         of the cluster with zeroes.
> >> > 
> >> > 3) metadata: all clusters were allocated when the image was created
> >> >         but they are sparse, QEMU only writes the 4KB of data.
> >> > 
> >> > 4) falloc: all clusters were allocated with fallocate() when the image
> >> >         was created, QEMU only writes 4KB of data.
> >> > 
> >> > 5) full: all clusters were allocated by writing zeroes to all of them
> >> >         when the image was created, QEMU only writes 4KB of data.
> >> > 
> >> > As I said in a previous message I'm not familiar with xfs, but the
> >> > parts that I don't understand are
> >> > 
> >> >    - Why is (4) slower than (1)?
> >> 
> >> Because fallocate() is a full IO serialisation barrier at the
> >> filesystem level. If you do:
> >> 
> >> fallocate(whole file)
> >> <IO>
> >> <IO>
> >> <IO>
> >> .....
> >> 
> >> The IO can run concurrent and does not serialise against anything in
> >> the filesysetm except unwritten extent conversions at IO completion
> >> (see answer to next question!)
> >> 
> >> However, if you just use (4) you get:
> >> 
> >> falloc(64k)
> >>   <wait for inflight IO to complete>
> >>   <allocates 64k as unwritten>
> >> <4k io>
> >>   ....
> >> falloc(64k)
> >>   <wait for inflight IO to complete>
> >>   ....
> >>   <4k IO completes, converts 4k to written>
> >>   <allocates 64k as unwritten>
> >> <4k io>
> >> falloc(64k)
> >>   <wait for inflight IO to complete>
> >>   ....
> >>   <4k IO completes, converts 4k to written>
> >>   <allocates 64k as unwritten>
> >> <4k io>
> >>   ....
> >> 
> >
> > Option 4 is described above as initial file preallocation whereas
> > option 1 is per 64k cluster prealloc. Prealloc mode mixup aside, Berto
> > is reporting that the initial file preallocation mode is slower than
> > the per cluster prealloc mode. Berto, am I following that right?
> 
> Option (1) means that no qcow2 cluster is allocated at the beginning of
> the test so, apart from updating the relevant qcow2 metadata, each write
> request clears the cluster first (with fallocate(ZERO_RANGE)) then
> writes the requested 4KB of data. Further writes to the same cluster
> don't need changes on the qcow2 metadata so they go directly to the area
> that was cleared with fallocate().
> 
> Option (4) means that all clusters are allocated when the image is
> created and they are initialized with fallocate() (actually with
> posix_fallocate() now that I read the code, I suppose it's the same for
> xfs?). Only after that the test starts. All write requests are simply
> forwarded to the disk, there is no need to touch any qcow2 metadata nor
> do anything else.
> 

Ok, I think that's consistent with what I described above (sorry, I find
the preallocation mode names rather confusing so I was trying to avoid
using them). Have you confirmed that posix_fallocate() in this case
translates directly to fallocate()? I suppose that's most likely the
case, otherwise you'd see numbers more like with preallocation=full
(file preallocated via writing zeroes).

> And yes, (4) is a bit slower than (1) in my tests. On ext4 I get 10%
> more IOPS.
> 
> I just ran the tests with aio=native and with a raw image instead of
> qcow2, here are the results:
> 
> qcow2:
> |----------------------+-------------+------------|
> | preallocation        | aio=threads | aio=native |
> |----------------------+-------------+------------|
> | off                  |        8139 |       7649 |
> | off (w/o ZERO_RANGE) |        2965 |       2779 |
> | metadata             |        7768 |       8265 |
> | falloc               |        7742 |       7956 |
> | full                 |       41389 |      56668 |
> |----------------------+-------------+------------|
> 

So this seems like Dave's suggestion to use native aio produced more
predictable results with full file prealloc being a bit faster than per
cluster prealloc. Not sure why that isn't the case with aio=threads. I
was wondering if perhaps the threading affects something indirectly like
the qcow2 metadata allocation itself, but I guess that would be
inconsistent with ext4 showing a notable jump from (1) to (4) (assuming
the previous ext4 numbers were with aio=threads).

> raw:
> |---------------+-------------+------------|
> | preallocation | aio=threads | aio=native |
> |---------------+-------------+------------|
> | off           |        7647 |       7928 |
> | falloc        |        7662 |       7856 |
> | full          |       45224 |      58627 |
> |---------------+-------------+------------|
> 
> A qcow2 file with preallocation=metadata is more or less similar to a
> sparse raw file (and the numbers are indeed similar).
> 
> preallocation=off on qcow2 does not have an equivalent on raw files.
> 

It sounds like preallocation=off for qcow2 would be roughly equivalent
to a raw file with a 64k extent size hint (on XFS).

Brian

> Berto
>

WARNING: multiple messages have this Message-ID (diff)

From: Brian Foster <bfoster@redhat.com>
To: Alberto Garcia <berto@igalia.com>
Cc: Kevin Wolf <kwolf@redhat.com>,
	Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>,
	qemu-block@nongnu.org, Dave Chinner <david@fromorbit.com>,
	qemu-devel@nongnu.org, Max Reitz <mreitz@redhat.com>,
	linux-xfs@vger.kernel.org
Subject: Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster
Date: Fri, 21 Aug 2020 08:59:44 -0400	[thread overview]
Message-ID: <20200821125944.GC212879@bfoster> (raw)
In-Reply-To: <w51364gjkcj.fsf@maestria.local.igalia.com>

On Fri, Aug 21, 2020 at 01:42:52PM +0200, Alberto Garcia wrote:
> On Fri 21 Aug 2020 01:05:06 PM CEST, Brian Foster <bfoster@redhat.com> wrote:
> >> > 1) off: for every write request QEMU initializes the cluster (64KB)
> >> >         with fallocate(ZERO_RANGE) and then writes the 4KB of data.
> >> > 
> >> > 2) off w/o ZERO_RANGE: QEMU writes the 4KB of data and fills the rest
> >> >         of the cluster with zeroes.
> >> > 
> >> > 3) metadata: all clusters were allocated when the image was created
> >> >         but they are sparse, QEMU only writes the 4KB of data.
> >> > 
> >> > 4) falloc: all clusters were allocated with fallocate() when the image
> >> >         was created, QEMU only writes 4KB of data.
> >> > 
> >> > 5) full: all clusters were allocated by writing zeroes to all of them
> >> >         when the image was created, QEMU only writes 4KB of data.
> >> > 
> >> > As I said in a previous message I'm not familiar with xfs, but the
> >> > parts that I don't understand are
> >> > 
> >> >    - Why is (4) slower than (1)?
> >> 
> >> Because fallocate() is a full IO serialisation barrier at the
> >> filesystem level. If you do:
> >> 
> >> fallocate(whole file)
> >> <IO>
> >> <IO>
> >> <IO>
> >> .....
> >> 
> >> The IO can run concurrent and does not serialise against anything in
> >> the filesysetm except unwritten extent conversions at IO completion
> >> (see answer to next question!)
> >> 
> >> However, if you just use (4) you get:
> >> 
> >> falloc(64k)
> >>   <wait for inflight IO to complete>
> >>   <allocates 64k as unwritten>
> >> <4k io>
> >>   ....
> >> falloc(64k)
> >>   <wait for inflight IO to complete>
> >>   ....
> >>   <4k IO completes, converts 4k to written>
> >>   <allocates 64k as unwritten>
> >> <4k io>
> >> falloc(64k)
> >>   <wait for inflight IO to complete>
> >>   ....
> >>   <4k IO completes, converts 4k to written>
> >>   <allocates 64k as unwritten>
> >> <4k io>
> >>   ....
> >> 
> >
> > Option 4 is described above as initial file preallocation whereas
> > option 1 is per 64k cluster prealloc. Prealloc mode mixup aside, Berto
> > is reporting that the initial file preallocation mode is slower than
> > the per cluster prealloc mode. Berto, am I following that right?
> 
> Option (1) means that no qcow2 cluster is allocated at the beginning of
> the test so, apart from updating the relevant qcow2 metadata, each write
> request clears the cluster first (with fallocate(ZERO_RANGE)) then
> writes the requested 4KB of data. Further writes to the same cluster
> don't need changes on the qcow2 metadata so they go directly to the area
> that was cleared with fallocate().
> 
> Option (4) means that all clusters are allocated when the image is
> created and they are initialized with fallocate() (actually with
> posix_fallocate() now that I read the code, I suppose it's the same for
> xfs?). Only after that the test starts. All write requests are simply
> forwarded to the disk, there is no need to touch any qcow2 metadata nor
> do anything else.
> 

Ok, I think that's consistent with what I described above (sorry, I find
the preallocation mode names rather confusing so I was trying to avoid
using them). Have you confirmed that posix_fallocate() in this case
translates directly to fallocate()? I suppose that's most likely the
case, otherwise you'd see numbers more like with preallocation=full
(file preallocated via writing zeroes).

> And yes, (4) is a bit slower than (1) in my tests. On ext4 I get 10%
> more IOPS.
> 
> I just ran the tests with aio=native and with a raw image instead of
> qcow2, here are the results:
> 
> qcow2:
> |----------------------+-------------+------------|
> | preallocation        | aio=threads | aio=native |
> |----------------------+-------------+------------|
> | off                  |        8139 |       7649 |
> | off (w/o ZERO_RANGE) |        2965 |       2779 |
> | metadata             |        7768 |       8265 |
> | falloc               |        7742 |       7956 |
> | full                 |       41389 |      56668 |
> |----------------------+-------------+------------|
> 

So this seems like Dave's suggestion to use native aio produced more
predictable results with full file prealloc being a bit faster than per
cluster prealloc. Not sure why that isn't the case with aio=threads. I
was wondering if perhaps the threading affects something indirectly like
the qcow2 metadata allocation itself, but I guess that would be
inconsistent with ext4 showing a notable jump from (1) to (4) (assuming
the previous ext4 numbers were with aio=threads).

> raw:
> |---------------+-------------+------------|
> | preallocation | aio=threads | aio=native |
> |---------------+-------------+------------|
> | off           |        7647 |       7928 |
> | falloc        |        7662 |       7856 |
> | full          |       45224 |      58627 |
> |---------------+-------------+------------|
> 
> A qcow2 file with preallocation=metadata is more or less similar to a
> sparse raw file (and the numbers are indeed similar).
> 
> preallocation=off on qcow2 does not have an equivalent on raw files.
> 

It sounds like preallocation=off for qcow2 would be roughly equivalent
to a raw file with a 64k extent size hint (on XFS).

Brian

> Berto
>

next prev parent reply	other threads:[~2020-08-21 13:00 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-14 14:57 [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster Alberto Garcia
2020-08-14 14:57 ` [PATCH 1/1] " Alberto Garcia
2020-08-14 18:07   ` Vladimir Sementsov-Ogievskiy
2020-08-14 18:06 ` [PATCH 0/1] " Vladimir Sementsov-Ogievskiy
2020-08-17 10:10 ` Kevin Wolf
2020-08-17 15:31   ` Alberto Garcia
2020-08-17 15:53     ` Kevin Wolf
2020-08-17 15:58       ` Alberto Garcia
2020-08-17 18:18       ` Alberto Garcia
2020-08-18  8:18         ` Kevin Wolf
2020-08-19 14:25       ` Alberto Garcia
2020-08-19 15:07         ` Kevin Wolf
2020-08-19 15:37           ` Alberto Garcia
2020-08-19 15:53             ` Alberto Garcia
2020-08-19 17:53           ` Brian Foster
2020-08-20 20:03             ` Alberto Garcia
2020-08-20 20:03               ` Alberto Garcia
2020-08-20 21:58               ` Dave Chinner
2020-08-20 21:58                 ` Dave Chinner
2020-08-21 11:05                 ` Brian Foster
2020-08-21 11:05                   ` Brian Foster
2020-08-21 11:42                   ` Alberto Garcia
2020-08-21 11:42                     ` Alberto Garcia
2020-08-21 12:12                     ` Alberto Garcia
2020-08-21 17:02                       ` Brian Foster
2020-08-21 17:02                         ` Brian Foster
2020-08-25 12:24                         ` Alberto Garcia
2020-08-25 12:24                           ` Alberto Garcia
2020-08-25 16:54                           ` Brian Foster
2020-08-25 16:54                             ` Brian Foster
2020-08-25 17:18                             ` Alberto Garcia
2020-08-25 17:18                               ` Alberto Garcia
2020-08-25 19:47                               ` Brian Foster
2020-08-25 19:47                                 ` Brian Foster
2020-08-26 18:34                                 ` Alberto Garcia
2020-08-26 18:34                                   ` Alberto Garcia
2020-08-27 16:47                                   ` Brian Foster
2020-08-27 16:47                                     ` Brian Foster
2020-08-23 21:59                       ` Dave Chinner
2020-08-23 21:59                         ` Dave Chinner
2020-08-24 20:14                         ` Alberto Garcia
2020-08-24 20:14                           ` Alberto Garcia
2020-08-21 12:59                     ` Brian Foster [this message]
2020-08-21 12:59                       ` Brian Foster
2020-08-21 15:51                       ` Alberto Garcia
2020-08-21 15:51                         ` Alberto Garcia
2020-08-23 22:16                       ` Dave Chinner
2020-08-23 22:16                         ` Dave Chinner
2020-08-21 16:09                 ` Alberto Garcia
2020-08-21 16:09                   ` Alberto Garcia

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200821125944.GC212879@bfoster \
    --to=bfoster@redhat.com \
    --cc=berto@igalia.com \
    --cc=david@fromorbit.com \
    --cc=kwolf@redhat.com \
    --cc=linux-xfs@vger.kernel.org \
    --cc=mreitz@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=vsementsov@virtuozzo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.