From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:37890) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YImaF-0000cd-Kr for qemu-devel@nongnu.org; Tue, 03 Feb 2015 18:08:42 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YImaB-0004qf-HT for qemu-devel@nongnu.org; Tue, 03 Feb 2015 18:08:39 -0500 Received: from mx1.redhat.com ([209.132.183.28]:42364) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YImaB-0004qT-7L for qemu-devel@nongnu.org; Tue, 03 Feb 2015 18:08:35 -0500 Message-ID: <54D154EA.7000708@redhat.com> Date: Tue, 03 Feb 2015 16:08:26 -0700 From: Eric Blake MIME-Version: 1.0 References: <1416844620-17717-1-git-send-email-mreitz@redhat.com> <1416844620-17717-2-git-send-email-mreitz@redhat.com> In-Reply-To: <1416844620-17717-2-git-send-email-mreitz@redhat.com> Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="RjN17lhV2QPKB7sdxPxQG4BeNU65D2QjS" Subject: Re: [Qemu-devel] [PATCH v2 01/12] qcow2: Add new overlap check functions List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Max Reitz , qemu-devel@nongnu.org Cc: Kevin Wolf , Peter Lieven , Stefan Hajnoczi This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --RjN17lhV2QPKB7sdxPxQG4BeNU65D2QjS Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable On 11/24/2014 08:56 AM, Max Reitz wrote: > The existing qcow2 metadata overlap detection function used existing > structures to determine the location of the image metadata, from plain > fields such as l1_table_offset and l1_size in the BDRVQcowState, over > image structures in memory such as the L1 table for the L2 tables' > positions, or it even read the required data directly from disk for > every requested check, such as the snapshot L1 tables for the inactive > L2 tables' positions. >=20 > These new functions instead keep a dedicated structure for keeping trac= k > of the metadata positions in memory. It consists of two parts: First, > there is one structure which is basically a list of all metadata > structures. Each entry has a bitmask of types (because some metadata > structures may actually overlap, such as active and inactive L2 tables)= , > a number of clusters occupied and the offset from the previous entry in= > clusters. This structure requires relatively few memory, but checking a= s/few/little/ > certain point may take relatively long. Each entry is called a > "fragment". >=20 > Therefore, there is another representation which is a bitmap, or rather= > a bytemap, of metadata types. The previously described list is split > into multiple windows with each describing a constant number of cluster= s > (WINDOW_SIZE). If the list is to be queried or changed, the respective > window is selected in constant time and the bitmap is generated from th= e > fragments belonging to the window. This bitmap can then be queried in > constant time and easily be changed. >=20 > Because the bitmap representation requires more memory, it is only used= > as a cache. Whenever a window is removed from the cache, the fragment > list will be rebuilt from the bitmap if the latter has been modified. > Therefore, the fragment list is only used as the background > representation to save memory, whereas the bitmap is used whenever > possible. >=20 > Regarding the size of the fragment list in memory: As one refcount bloc= k > can handle cluster_size / 2 entries and one L2 table can handle > cluster_size / 8 entries, for a qcow2 image with the standard cluster > size of 64 kB, there is a ratio of data to metadata of about 1/6000 > (1/32768 refblocks and 1/8192 L2 tables) if one ignores the fact that > not every cluster requires an L2 table entry. The refcount table and th= e > L1 table is generally negligible. At the worst, each metadata cluster > requires its own entry in the fragment list; each entry takes up four > bytes, therefore, at the worst, the fragment list should take up (for a= n > image with 64 kB clusters) (4 B) / (64 kB * 6000) of the image size, > which is about 1.e-8 (i.e., 11 kB for a 1 TB image, or 11 MB for a 1 PB= > image). >=20 > Signed-off-by: Max Reitz > --- > block/Makefile.objs | 3 +- > block/qcow2-overlap.c | 404 ++++++++++++++++++++++++++++++++++++++++++= ++++++++ > block/qcow2.h | 13 ++ > 3 files changed, 419 insertions(+), 1 deletion(-) > create mode 100644 block/qcow2-overlap.c Are you still hoping to get this in 2.3? > +++ b/block/qcow2-overlap.c > @@ -0,0 +1,404 @@ > +/* > + * QCOW2 runtime metadata overlap detection > + * > + * Copyright (c) 2014 Max Reitz Slow review means it is now 2015. > +/* Number of clusters which are covered by each metadata window; > + * note that this may not exceed 2^16 as long as > + * Qcow2MetadataFragment::relative_start is a uint16_t */ > +#define WINDOW_SIZE 4096 So this says that for every 4096 clusters, we have one bytemap representation, as well as a chain of up to 4096 fragment descriptors? > + > +/* Describes a fragment of a or a whole metadata range; does not neces= sarily s/of a or/of or/ > + * describe the whole range because it needs to be split on window bou= ndaries */ > +typedef struct Qcow2MetadataFragment { > + /* Bitmask of QCow2MetadataOverlap values */ > + uint8_t types; > + uint8_t nb_clusters; > + /* Number of clusters between the start of the window and this ran= ge */ > + uint16_t relative_start; So even I have a file with 4096 consecutive sectors all tied up in the same purpose within a given window, I have to represent it as 16 Qcow2MetadataFragments rather than 1 fragment, because the uint8_t size of nb_clusters limits me to at most 256 clusters per Fragment entry? And worst case, a window will have a list of 4096 fragments if every cluster alternates between some other type. > +} QEMU_PACKED Qcow2MetadataFragment; Is QEMU_PACKED really essential here? If I'm not mistaken, this struct is only ever kept in memory and not written out to disk. On the other hand, I understand that you are trying to ensure that the compiler packed this into 32 bits rather than injecting any padding. Would a BUG_ON(sizeof(Qcow2MetadataFragment) =3D=3D 4) be any better at represent= ing that fact? > + > +typedef struct Qcow2MetadataWindow { > + Qcow2MetadataFragment *fragments; > + int nb_fragments, fragments_array_size; > + > + /* If not NULL, this is an expanded version of the "RLE" version g= iven by > + * the fragments array; there are WINDOW_SIZE entries */ > + uint8_t *bitmap; > + bool bitmap_modified; > + > + /* Time of last access */ > + unsigned age; > + > + /* Index in Qcow2MetadataList::cached_windows */ > + int cached_windows_index; > +} Qcow2MetadataWindow; > + > +struct Qcow2MetadataList { > + Qcow2MetadataWindow *windows; > + uint64_t nb_windows; > + > + unsigned current_age; > + > + /* Index into the windows array */ > + int *cached_windows; > + size_t nb_cached_windows; > +}; Is there a maximum size for nb_cached_windows before you start evicting cached windows? > + > +/** > + * Destroys the cached window bitmap. If it has been modified, the fra= gment list > + * will be rebuilt accordingly. > + */ > +static void destroy_window_bitmap(Qcow2MetadataList *mdl, > + Qcow2MetadataWindow *window) > +{ > + if (!window->bitmap) { > + return; > + } > + > + if (window->bitmap_modified) { > + int bitmap_i, fragment_i =3D 0; > + QCow2MetadataOverlap current_types =3D 0; > + int current_nb_clusters =3D 0; > + > + /* Rebuild the fragment list; the case bitmap_i =3D=3D WINDOW_= SIZE is for > + * entering the last fragment at the bitmap end */ > + > + for (bitmap_i =3D 0; bitmap_i <=3D WINDOW_SIZE; bitmap_i++) { > + /* Qcow2MetadataFragment::nb_clusters is a uint8_t, so > + * current_nb_clusters may not exceed 255 */ Wait. Why 255 and not 256? Can't you use nb_clusters=3D=3D0 as a modulo f= or 256 consecutive clusters, as the fragments should never encode a 0-length run? That way, you can represent 4096 consecutive clusters in 16 fragments of 256 each, instead of 17 (16 of 255, and 1 of 16). > + if (bitmap_i < WINDOW_SIZE && > + current_types =3D=3D window->bitmap[bitmap_i] && > + current_nb_clusters < 255) > + { > + current_nb_clusters++; > + } else { > + if (current_types && current_nb_clusters) { > + if (fragment_i >=3D window->fragments_array_size) = { > + window->fragments_array_size =3D > + 3 * window->fragments_array_size / 2 + 1; > + > + /* new_nb_fragments should be small enough, an= d there is > + * nothing we can do on failure anyway, so do = not use > + * g_try_renew() here */ > + window->fragments =3D > + g_renew(Qcow2MetadataFragment, window->fra= gments, > + window->fragments_array_size); > + } > + > + window->fragments[fragment_i++] =3D (Qcow2Metadata= Fragment){ > + .types =3D current_types, > + .nb_clusters =3D current_nb_clusters, > + .relative_start =3D bitmap_i - current_nb_clus= ters, > + }; > + } > + > + current_nb_clusters =3D 0; > + if (bitmap_i < WINDOW_SIZE) { > + current_types =3D window->bitmap[bitmap_i]; > + } > + } > + } > + > + window->nb_fragments =3D fragment_i; Any need to clear window->bitmap_modified at this point? Or are you careful to only rely on it when window->bitmap is non-NULL. > + } > + > + g_free(window->bitmap); > + window->bitmap =3D NULL; > +} > + > +/** > + * Creates a bitmap from the fragment list. > + */ > +static void build_window_bitmap(Qcow2MetadataList *mdl, > + Qcow2MetadataWindow *window) > +{ > + int cache_i, oldest_cache_i =3D -1, i; > + unsigned oldest_cache_age =3D 0; > + > + for (cache_i =3D 0; cache_i < mdl->nb_cached_windows; cache_i++) {= > + unsigned age; > + > + if (mdl->cached_windows[cache_i] < 0) { > + break; > + } > + > + age =3D mdl->current_age - mdl->windows[mdl->cached_windows[ca= che_i]].age; > + if (age > oldest_cache_age) { > + oldest_cache_age =3D age; > + oldest_cache_i =3D cache_i; > + } > + } > + > + if (cache_i >=3D mdl->nb_cached_windows) { > + destroy_window_bitmap(mdl, > + &mdl->windows[mdl->cached_windows[oldest_cache_i]]); > + cache_i =3D oldest_cache_i; > + } > + > + assert(cache_i >=3D 0); > + mdl->cached_windows[cache_i] =3D window - mdl->windows; > + window->cached_windows_index =3D cache_i; > + > + window->age =3D mdl->current_age++; > + > + window->bitmap =3D g_new0(uint8_t, WINDOW_SIZE); > + > + for (i =3D 0; i < window->nb_fragments; i++) { > + Qcow2MetadataFragment *fragment =3D &window->fragments[i]; > + > + memset(&window->bitmap[fragment->relative_start], fragment->ty= pes, > + fragment->nb_clusters); Hmm. If you do use my idea of nb_clusters=3D=3D0 for 256, this needs a special case. Another option would be storing number of clusters - 1 (so a value of 0 is 1 cluster, a value of 255 is 256 clusters). > + } > + > + window->bitmap_modified =3D false; > +} > + > +/** > + * Enters a new range into the metadata list. > + */ > +void qcow2_metadata_list_enter(BlockDriverState *bs, uint64_t offset, > + int nb_clusters, QCow2MetadataOverlap t= ypes) > +{ > + BDRVQcowState *s =3D bs->opaque; > + uint64_t start_cluster =3D offset >> s->cluster_bits; > + uint64_t end_cluster =3D start_cluster + nb_clusters; > + uint64_t current_cluster =3D start_cluster; > + > + types &=3D s->overlap_check; > + if (!types) { > + return; > + } > + > + if (offset_into_cluster(s, offset)) { > + /* Do not enter apparently broken metadata ranges */ > + return; > + } > + > + while (current_cluster < end_cluster) { > + int bitmap_i; > + int bitmap_i_start =3D current_cluster % WINDOW_SIZE; > + int bitmap_i_end =3D MIN(WINDOW_SIZE, > + end_cluster - current_cluster + bitmap_= i_start); > + uint64_t window_i =3D current_cluster / WINDOW_SIZE; > + Qcow2MetadataWindow *window; > + > + if (window_i >=3D s->metadata_list->nb_windows) { > + /* This should not be happening too often, so it is fine t= o resize > + * the array to exactly the required size */ > + Qcow2MetadataWindow *new_windows; > + > + new_windows =3D g_try_renew(Qcow2MetadataWindow, > + s->metadata_list->windows, > + window_i + 1); > + if (!new_windows) { > + return; > + } > + > + memset(new_windows + s->metadata_list->nb_windows, 0, > + (window_i + 1 - s->metadata_list->nb_windows) * > + sizeof(Qcow2MetadataWindow)); > + > + s->metadata_list->windows =3D new_windows; > + s->metadata_list->nb_windows =3D window_i + 1; > + } > + > + window =3D &s->metadata_list->windows[window_i]; > + if (!window->bitmap) { > + build_window_bitmap(s->metadata_list, window); > + } > + > + for (bitmap_i =3D bitmap_i_start; bitmap_i < bitmap_i_end; bit= map_i++) { > + window->bitmap[bitmap_i] |=3D types; This adds in new types but keeps existing types listed. Is that okay? > + } > + > + window->age =3D s->metadata_list->current_age++; > + window->bitmap_modified =3D true; > + > + /* Go to the next window */ > + current_cluster +=3D WINDOW_SIZE - bitmap_i_start; > + } > +} =2E.. > +++ b/block/qcow2.h > @@ -159,6 +159,9 @@ typedef struct QCowSnapshot { > struct Qcow2Cache; > typedef struct Qcow2Cache Qcow2Cache; > =20 > +struct Qcow2MetadataList; > +typedef struct Qcow2MetadataList Qcow2MetadataList; > + > typedef struct Qcow2UnknownHeaderExtension { > uint32_t magic; > uint32_t len; > @@ -261,6 +264,7 @@ typedef struct BDRVQcowState { > =20 > bool discard_passthrough[QCOW2_DISCARD_MAX]; > =20 > + Qcow2MetadataList *metadata_list; > int overlap_check; /* bitmask of Qcow2MetadataOverlap values */ > bool signaled_corruption; > =20 > @@ -576,4 +580,13 @@ int qcow2_cache_get_empty(BlockDriverState *bs, Qc= ow2Cache *c, uint64_t offset, > void **table); > int qcow2_cache_put(BlockDriverState *bs, Qcow2Cache *c, void **table)= ; > =20 > +/* qcow2-overlap.c functions */ > +int qcow2_create_empty_metadata_list(BlockDriverState *bs, size_t cach= e_size, > + Error **errp); > +void qcow2_metadata_list_destroy(BlockDriverState *bs); > +void qcow2_metadata_list_enter(BlockDriverState *bs, uint64_t offset, > + int nb_clusters, QCow2MetadataOverlap t= ype); > +void qcow2_metadata_list_remove(BlockDriverState *bs, uint64_t offset,= > + int nb_clusters, QCow2MetadataOverlap = type); > + > #endif >=20 Looks reasonable in general. --=20 Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org --RjN17lhV2QPKB7sdxPxQG4BeNU65D2QjS Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 Comment: Public key at http://people.redhat.com/eblake/eblake.gpg Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBCAAGBQJU0VTqAAoJEKeha0olJ0NqP6sH/2uKXu+E8WiWiXNt+tQBnl2H cZxSxa6Taf4OuNZPzVjuxoIuJk/HtgMwy/fMCB4nH6YDvgL38KRTVo1b4tGrKHP5 468W5RpsA/Wj7YEAAZBbzGsyfdyQYPhgWRRWA8z33H3u4/fVYUDxwIlTYyNhwv67 +8TDoIlNQIDSIWT1Nbey55tuR/nFCSBpxHOUK8+yUdpqw48SfHOOPGanNXzJzbVV nsbb6CfgBHyoH8js04E5YjM01odYXZEs/MUi17GInTnn0ua5Qg7QG6fHdXdP3IV8 Rzwf8Cm2vzC2yUSHyLVAwFQ52uwkPuhSlFEnSf5eVgRzkVIHpG9s5hYVzphCc8E= =8JUg -----END PGP SIGNATURE----- --RjN17lhV2QPKB7sdxPxQG4BeNU65D2QjS--