Re: [PATCH 2/2] qemu-img: align next status sector on destination alignment.

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Maxim Levitsky <mlevitsk@redhat.com>
To: Eric Blake <eblake@redhat.com>, Peter Lieven <pl@kamp.de>,
	 qemu-devel@nongnu.org
Cc: Kevin Wolf <kwolf@redhat.com>, Jan Kara <jack@suse.cz>,
	qemu-block@nongnu.org,
	"Darrick J . Wong" <darrick.wong@oracle.com>,
	Max Reitz <mreitz@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: [PATCH 2/2] qemu-img: align next status sector on destination alignment.
Date: Thu, 12 Nov 2020 17:04:24 +0200	[thread overview]
Message-ID: <621d739815b25f8b6bc1b8cdd266a89a20bdd97a.camel@redhat.com> (raw)
In-Reply-To: <b4c49375-451b-1bc2-8c98-9d4e6fc62347@redhat.com>

On Thu, 2020-11-12 at 07:45 -0600, Eric Blake wrote:
> On 11/12/20 6:40 AM, Peter Lieven wrote:
> 
> > >          /*
> > > -         * Avoid that s->sector_next_status becomes unaligned to the source
> > > -         * request alignment and/or cluster size to avoid unnecessary read
> > > -         * cycles.
> > > +         * Avoid that s->sector_next_status becomes unaligned to the
> > > +         * source/destination request alignment and/or cluster size to avoid
> > > +         * unnecessary read/write cycles.
> > >           */
> > > -        tail = (sector_num - src_cur_offset + n) % s->src_alignment[src_cur];
> > > +        alignment = MAX(s->src_alignment[src_cur], s->alignment);
> > > +        assert(is_power_of_2(alignment));
> > > +
> > > +        tail = (sector_num - src_cur_offset + n) % alignment;
> > >          if (n > tail) {
> > >              n -= tail;
> > >          }
> > 
> > I was also considering including the s->alignment when adding this chance. However, you need the least common multiple of both alignments, not the maximum, otherwise
> > 
> > you might get misaligned to either source or destination.
> > 
> > 
> > Why exactly do you need the power of two requirement?
> 
> The power of two requirement ensures that you h ave the least common
> multiple of both alignments ;)
-
Yes, and in addition to that both alignments are already power of two because we align them
to it. The assert I added is just in case.

This is how we calculate the destination alignment:
 
s.alignment = MAX(pow2floor(s.min_sparse),
                      DIV_ROUND_UP(out_bs->bl.request_alignment,
                                   BDRV_SECTOR_SIZE));
 
 
This is how we calculate the source alignments (it is possible to have several source files)
 
s.src_alignment[bs_i] = DIV_ROUND_UP(src_bs->bl.request_alignment,
                                             BDRV_SECTOR_SIZE);
if (!bdrv_get_info(src_bs, &bdi)) {
    s.src_alignment[bs_i] = MAX(s.src_alignment[bs_i], bdi.cluster_size / BDRV_SECTOR_SIZE);
}


The bl.request_alignment comment mentions that it must be power of 2,
and cluster sizes are also very likely to be power of 2 in all drivers
we support. An assert for s.src_alignment won't hurt though.

 
Note though that we don't really read the discard alignment.
We have max_pdiscard, and pdiscard_alignment, but max_pdiscard
is only used to split 'too large' discard requests, and pdiscard_alignment
is advisory and used only in a couple of places.
Neither are set by file-posix.
 
We do have request_alignment, which file-posix tries to align to the logical block
size which is still often 512 for backward compatibility on many devices (even nvme)
 
Now both source and destination alignments in qemu-img convert are based on request_aligment
and on min_sparse (which is by default 4K, controlled by qemu-img -S parameter
(which otherwise controls the minimum number of blocks to be zero, to consider
discarding them in convert)
 

This means that there is no guarantee to have 4K alignment, and besides,
with sufficiently fragmented source file, the bdrv_block_status_above
can return arbitrary short and unaligned extents which can't be aligned, 
thus as I said this patch alone can't guarantee that we won't have 
write and discard sharing the same page.
 
Another thing that can be discussed is why is_allocated_sectors was patched
to convert short discards to writes. Probably because underlying hardware
ignores them or something? In theory we should detect this and fail
back to regular zeroing in this case.
Again though, while in case of converting an empty file, this is the only
source of writes, and eliminating it, also 'fixes' this case, with sufficiently
fragmented source file we can even without this code get a write and discard
sharing a page.


Best regards,
	Maxim Levitsky

> 
> However, there ARE devices in practice that have advertised a
> non-power-of-2 discard granularity, such as 15MiB (see commits 430b26a8,
> 63188c24).  Which means you probably don't want to assert power-of-2,
> and in turn need to worry about least common multiple.
>

next prev parent reply	other threads:[~2020-11-12 15:05 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-11 15:39 [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT Maxim Levitsky
2020-11-11 15:39 ` [PATCH 1/2] file-posix: allow -EBUSY errors during write zeros on raw block devices Maxim Levitsky
2020-11-16 14:48   ` Kevin Wolf
2021-01-07 12:44     ` Maxim Levitsky
2020-11-11 15:39 ` [PATCH 2/2] qemu-img: align next status sector on destination alignment Maxim Levitsky
2020-11-12 12:40   ` Peter Lieven
2020-11-12 13:45     ` Eric Blake
2020-11-12 15:04       ` Maxim Levitsky [this message]
2021-03-18  9:57         ` Maxim Levitsky
2020-11-11 15:44 ` [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT Maxim Levitsky
2020-11-12 11:19   ` Jan Kara
2020-11-12 12:00     ` Jan Kara
2020-11-12 22:08       ` Darrick J. Wong
2020-12-07 17:23         ` Maxim Levitsky
2020-11-12 15:38     ` Maxim Levitsky
2020-11-13 10:07       ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=621d739815b25f8b6bc1b8cdd266a89a20bdd97a.camel@redhat.com \
    --to=mlevitsk@redhat.com \
    --cc=darrick.wong@oracle.com \
    --cc=eblake@redhat.com \
    --cc=jack@suse.cz \
    --cc=kwolf@redhat.com \
    --cc=mreitz@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=pl@kamp.de \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).