From: Jason Dillaman <jdillama@redhat.com>
To: Max Reitz <mreitz@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>, Josh Durgin <jdurgin@redhat.com>,
qemu-block <qemu-block@nongnu.org>, John Snow <jsnow@redhat.com>,
qemu-devel <qemu-devel@nongnu.org>,
dillaman <dillaman@redhat.com>,
Stefano Garzarella <sgarzare@redhat.com>
Subject: Re: [Qemu-devel] [PATCH v3] block/rbd: implement .bdrv_get_allocated_file_size callback
Date: Tue, 9 Jul 2019 08:55:19 -0400 [thread overview]
Message-ID: <CA+aFP1AW9yv+_4H6M4Pq-E2ehP7KHvULm3xogqnvvZUzDdEw0g@mail.gmail.com> (raw)
In-Reply-To: <9438ac63-949b-9e08-42a2-4b4ff2c778f8@redhat.com>
On Tue, Jul 9, 2019 at 5:45 AM Max Reitz <mreitz@redhat.com> wrote:
>
> On 09.07.19 10:55, Max Reitz wrote:
> > On 09.07.19 05:08, Jason Dillaman wrote:
> >> On Fri, Jul 5, 2019 at 6:43 AM Stefano Garzarella <sgarzare@redhat.com> wrote:
> >>>
> >>> On Fri, Jul 05, 2019 at 11:58:43AM +0200, Max Reitz wrote:
> >>>> On 05.07.19 11:32, Stefano Garzarella wrote:
> >>>>> This patch allows 'qemu-img info' to show the 'disk size' for
> >>>>> the RBD images that have the fast-diff feature enabled.
> >>>>>
> >>>>> If this feature is enabled, we use the rbd_diff_iterate2() API
> >>>>> to calculate the allocated size for the image.
> >>>>>
> >>>>> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
> >>>>> ---
> >>>>> v3:
> >>>>> - return -ENOTSUP instead of -1 when fast-diff is not available
> >>>>> [John, Jason]
> >>>>> v2:
> >>>>> - calculate the actual usage only if the fast-diff feature is
> >>>>> enabled [Jason]
> >>>>> ---
> >>>>> block/rbd.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>> 1 file changed, 54 insertions(+)
> >>>>
> >>>> Well, the librbd documentation is non-existing as always, but while
> >>>> googling, I at least found that libvirt has exactly the same code. So I
> >>>> suppose it must be quite correct, then.
> >>>>
> >>>
> >>> While I wrote this code I took a look at libvirt implementation and also
> >>> at the "rbd" tool in the ceph repository: compute_image_disk_usage() in
> >>> src/tools/rbd/action/DiskUsage.cc
> >>>
> >>>>> diff --git a/block/rbd.c b/block/rbd.c
> >>>>> index 59757b3120..b6bed683e5 100644
> >>>>> --- a/block/rbd.c
> >>>>> +++ b/block/rbd.c
> >>>>> @@ -1084,6 +1084,59 @@ static int64_t qemu_rbd_getlength(BlockDriverState *bs)
> >>>>> return info.size;
> >>>>> }
> >>>>>
> >>>>> +static int rbd_allocated_size_cb(uint64_t offset, size_t len, int exists,
> >>>>> + void *arg)
> >>>>> +{
> >>>>> + int64_t *alloc_size = (int64_t *) arg;
> >>>>> +
> >>>>> + if (exists) {
> >>>>> + (*alloc_size) += len;
> >>>>> + }
> >>>>> +
> >>>>> + return 0;
> >>>>> +}
> >>>>> +
> >>>>> +static int64_t qemu_rbd_get_allocated_file_size(BlockDriverState *bs)
> >>>>> +{
> >>>>> + BDRVRBDState *s = bs->opaque;
> >>>>> + uint64_t flags, features;
> >>>>> + int64_t alloc_size = 0;
> >>>>> + int r;
> >>>>> +
> >>>>> + r = rbd_get_flags(s->image, &flags);
> >>>>> + if (r < 0) {
> >>>>> + return r;
> >>>>> + }
> >>>>> +
> >>>>> + r = rbd_get_features(s->image, &features);
> >>>>> + if (r < 0) {
> >>>>> + return r;
> >>>>> + }
> >>>>> +
> >>>>> + /*
> >>>>> + * We use rbd_diff_iterate2() only if the RBD image have fast-diff
> >>>>> + * feature enabled. If it is disabled, rbd_diff_iterate2() could be
> >>>>> + * very slow on a big image.
> >>>>> + */
> >>>>> + if (!(features & RBD_FEATURE_FAST_DIFF) ||
> >>>>> + (flags & RBD_FLAG_FAST_DIFF_INVALID)) {
> >>>>> + return -ENOTSUP;
> >>>>> + }
> >>>>> +
> >>>>> + /*
> >>>>> + * rbd_diff_iterate2(), if the source snapshot name is NULL, invokes
> >>>>> + * the callback on all allocated regions of the image.
> >>>>> + */
> >>>>> + r = rbd_diff_iterate2(s->image, NULL, 0,
> >>>>> + bs->total_sectors * BDRV_SECTOR_SIZE, 0, 1,
> >>>>> + &rbd_allocated_size_cb, &alloc_size);
> >>>>
> >>>> But I have a question. This is basically block_status, right? So it
> >>>> gives us information on which areas are allocated and which are not.
> >>>> The result thus gives us a lower bound on the allocation size, but is it
> >>>> really exactly the allocation size?
> >>>>
> >>>> There are two things I’m concerned about:
> >>>>
> >>>> 1. What about metadata?
> >>>
> >>> Good question, I don't think it includes the size used by metadata and I
> >>> don't know if there is a way to know it. I'll check better.
> >>
> >> It does not include the size of metadata, the "rbd_diff_iterate2"
> >> function is literally just looking for touched data blocks within the
> >> RBD image.
> >>
> >>>>
> >>>> 2. If you have multiple snapshots, this will only report the overall
> >>>> allocation information, right? So say there is something like this:
> >>>>
> >>>> (“A” means an allocated MB, “-” is an unallocated MB)
> >>>>
> >>>> Snapshot 1: AAAA---
> >>>> Snapshot 2: --AAAAA
> >>>> Snapshot 3: -AAAA--
> >>>>
> >>>> I think the allocated data size is the number of As in total (13 MB).
> >>>> But I suppose this API will just return 7 MB, because it looks on
> >>>> everything an it sees the whole image range (7 MB) to be allocated. It
> >>>> doesn’t report in how many snapshots some region is allocated.
> >>
> >> It should return 13 dirty data blocks (multipled by the size of the
> >> data block) since when you don't provide a "from snapshot" name, it
> >> will iterate from the first snapshot to the HEAD revision.
> >
> > Have you tested that?
> >
> > I‘m so skeptical because the callback function interface has no way of
> > distinguishing between different layers of snapshots.
> >
> > And also because we have the bdrv_block_status_above() function which
> > just looks strikingly similar (with the difference that it does not
> > invoke a callback but just returns the next allocated range). If you
> > pass base=NULL to it, it will also “interpret that as the beginning of
> > time and return all allocated regions of the image” (or rather image
> > chain, in our case). But it would just return 7 MB as allocated. (Even
> > though it does in fact return layer information, i.e. where a given
> > continuous chunk of data is allocated.)
> >
> > Sure, there is no good reason for why our interface should by chance be
> > the same as librbd’s interface. But without having tested it, the fact
> > that the callback cannot detect which layer a chunk is allocated on just
> > makes me wary.
>
> And the librbd code doesn’t alleviate my concerns.
>
> From what I can see, it first creates a bitmap (two bits per entry) that
> covers the whole image and says which objects are allocated and which
> aren‘t. Through the whole chain, that is. So in the above example, the
> bitmap would report everything as allocated. (Or rather “updated“ in
> librbd‘s terms.)
>
> Then it has this piece:
>
> > uint64_t off = m_offset;
> > uint64_t left = m_length;
> >
> > while (left > 0) {
> > uint64_t period_off = off - (off % period);
> > uint64_t read_len = min(period_off + period - off, left);
> >
> > // map to extents
> > map<object_t,vector<ObjectExtent> > object_extents;
> > Striper::file_to_extents(cct, m_image_ctx.format_string,
> > &m_image_ctx.layout, off, read_len, 0,
> > object_extents, 0);
> >
> > // get snap info for each object
> > for (map<object_t,vector<ObjectExtent> >::iterator p =
> > object_extents.begin();
> > p != object_extents.end(); ++p)
> [...]
> > for (std::vector<ObjectExtent>::iterator q = p->second.begin();
> > q != p->second.end(); ++q) {
> > r = m_callback(off + q->offset, q->length, updated, m_callback_arg);
> [...]
> > }
> [...]
> > }
> >> left -= read_len;
> > off += read_len;
> > }
>
> So that iterates over the whole image (in our case, because m_offset is
> 0 and m_length is the image length), then picks out a chunk of read_len
> length, converts that to objects, and then iterates over all of those
> objects and all of their extents.
>
> file_to_extents looks like it’s just an arithmetic operation. So it
> doesn‘t look like that function returns extents in multiple snapshots.
> It just converts a range into subranges called “objects” and “extents”
> (at least that’s the way it looks to me).
>
> So overall, this looks awfully like it simply iterates over the whole
> image and then returns allocation information gathered as a top-down
> view through all of the snapshots, but not for each snapshot individually.
Sorry, you're correct. The API function was originally designed to
support delta extents for supporting diff exports, so while it does
open each snapshot's object-map, it only returns a unioned set of
delta extents. The rbd CLI's "disk-usage" action behaves how I
described by counting the actual dirty block usage between snapshots.
> Max
>
--
Jason
next prev parent reply other threads:[~2019-07-09 12:57 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-07-05 9:32 [Qemu-devel] [PATCH v3] block/rbd: implement .bdrv_get_allocated_file_size callback Stefano Garzarella
2019-07-05 9:58 ` Max Reitz
2019-07-05 10:43 ` Stefano Garzarella
2019-07-09 3:08 ` Jason Dillaman
2019-07-09 8:55 ` Max Reitz
2019-07-09 9:45 ` Max Reitz
2019-07-09 12:55 ` Jason Dillaman [this message]
2019-07-09 13:09 ` Stefano Garzarella
2019-07-09 15:32 ` Max Reitz
2019-07-10 1:42 ` Jason Dillaman
2019-07-10 15:28 ` Stefano Garzarella
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CA+aFP1AW9yv+_4H6M4Pq-E2ehP7KHvULm3xogqnvvZUzDdEw0g@mail.gmail.com \
--to=jdillama@redhat.com \
--cc=dillaman@redhat.com \
--cc=jdurgin@redhat.com \
--cc=jsnow@redhat.com \
--cc=kwolf@redhat.com \
--cc=mreitz@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
--cc=sgarzare@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).