From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:36538) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bAi3k-0001y3-Gp for qemu-devel@nongnu.org; Wed, 08 Jun 2016 14:18:34 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bAi3f-0007E6-G1 for qemu-devel@nongnu.org; Wed, 08 Jun 2016 14:18:32 -0400 Received: from mail-lf0-x244.google.com ([2a00:1450:4010:c07::244]:34899) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bAi3e-0007E2-Ox for qemu-devel@nongnu.org; Wed, 08 Jun 2016 14:18:27 -0400 Received: by mail-lf0-x244.google.com with SMTP id s186so1503246lfs.2 for ; Wed, 08 Jun 2016 11:18:26 -0700 (PDT) References: <1464138802-23503-1-git-send-email-cota@braap.org> <1464138802-23503-9-git-send-email-cota@braap.org> <5749E02A.3080909@gmail.com> <20160607010545.GB4418@flamenco> <5756EEC0.8090502@gmail.com> <20160608000224.GB16255@flamenco> <5758273B.7090603@gmail.com> <20160608180645.GA14106@flamenco> From: Sergey Fedorov Message-ID: <5758616E.8070202@gmail.com> Date: Wed, 8 Jun 2016 21:18:22 +0300 MIME-Version: 1.0 In-Reply-To: <20160608180645.GA14106@flamenco> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Subject: Re: [Qemu-devel] [PATCH v6 08/15] qdist: add module to represent frequency distributions of data List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Emilio G. Cota" Cc: QEMU Developers , MTTCG Devel , =?UTF-8?Q?Alex_Benn=c3=a9e?= , Paolo Bonzini , Richard Henderson On 08/06/16 21:06, Emilio G. Cota wrote: > On Wed, Jun 08, 2016 at 17:10:03 +0300, Sergey Fedorov wrote: >> On 08/06/16 03:02, Emilio G. Cota wrote: >>> - dist->entries = g_realloc(dist->entries, >>> - sizeof(*dist->entries) * (dist->n + 1)); >>> + if (unlikely(dist->n == dist->size)) { >>> + dist->size = dist->size ? dist->size * 2 : 1; >> We could initialize 'dist->size' to 1 and allocate a 1-entry >> 'dist->entries' array in qdist_init() to avoid this ternary operation ;-) > Done. This resulted in quite a few modifications, since dist->entries == NULL > had been used as an equivalent to dist->n == 0. > >>>>>> (snip) >>>> So our scale is not linear. I think some users might get confused by this. >>> That's correct. I think special-casing 0 makes sense though, since >>> it increases the signal-to-noise ratio of the histogram. For example: >>> >>> 1) 0 as ' ': >>> TB hash occupancy 31.84% avg chain occ. Histogram: [0,10)%|▆ █ ▅▁▃▁▁|[90,100]% >>> TB hash avg chain 1.015 buckets. Histogram: 1|█▁▁|3 >>> >>> 2) 0 as '1/8': >>> TB hash occupancy 32.07% avg chain occ. Histogram: [0,10)%|▆▁█▁▁▅▁▃▁▁|[90,100]% >>> TB hash avg chain 1.015 buckets. Histogram: 1|▇▁▁|3 >>> >>> I think in these examples most users would be less confused by 1) than by 2). >> I was meaning to represent all bars whose value < 1/8 as a space, not >> only whose value is pure zero. Otherwise we can see 1/8 bar where the >> actual value is negligibly differ from zero as in the second example. > I see. That would be (3): > > TB hash occupancy 32.79% avg chain occ. Histogram: [0,10)%|▅ █ ▅ ▂ |[90,100]% > TB hash avg chain 1.017 buckets. Histogram: 1|█ |3 > > I still think (1) is the representation that gives the most information. > IMO it's valuable that "close to zero" and "zero" are represented differently, > in the same way that max and "close to max" are represented differently > as well (only max gets 8/8). Anyhow, these histograms are only for rough estimation and to see some nice unicode output :) > > BTW, while looking into this I fixed a bug; sometimes we'd print 7/8 instead > of 8/8 for the max value, due to the ordering of FP computations > [see 1-3 in (2) above; it's 7/8 instead of 8/8]. Fixed with: > > diff --git a/util/qdist.c b/util/qdist.c > index cfe09e6..7842d34 100644 > --- a/util/qdist.c > +++ b/util/qdist.c > @@ -103,7 +103,7 @@ static const gunichar qdist_blocks[] = { > */ > static char *qdist_pr_internal(const struct qdist *dist) > { > - double min, max, step; > + double min, max; > GString *s = g_string_new(""); > size_t i; > > @@ -131,16 +131,14 @@ static char *qdist_pr_internal(const struct qdist *dist) > } > } > > - /* floor((count - min) * step) will give us the block index */ > - step = (QDIST_NR_BLOCK_CODES - 1) / (max - min); > - > for (i = 0; i < dist->n; i++) { > struct qdist_entry *e = &dist->entries[i]; > int index; > > /* make an exception with 0; instead of using block[0], print a space */ > if (e->count) { > - index = (int)((e->count - min) * step); > + /* divide first to avoid loss of precision when e->count == max */ > + index = (e->count - min) / (max - min) * (QDIST_NR_BLOCK_CODES - 1); > g_string_append_unichar(s, qdist_blocks[index]); > } else { > g_string_append_c(s, ' '); > > I also added a test to test-qdist (called "test_bin_precision") that > checks for this. > >>> The behaviour isn't the same though. With this we have >>> that the two outer bins (leftmost and rightmost) are unnecessarily >>> large (since they're out of the range of the input data). >>> >>> For example, assume the data is between 0 and 100 and n=5 (i.e. step=25), >>> it makes no sense to report the first bin as [-12.5,12.5). If we >>> then truncate the unnecessary edges, we'd have [0,12.5), but >>> then the second bin is [12.5,37.5). Bins of unequal size are >>> possible (although a bit unusual) in histograms, but given >>> our Unicode-based representation, we're limited to same-width bars. >> That is why I noted that I'm not sure what is the most correct from >> mathematical point of view. Maybe consider the second option? I.e. >> rounding to the middle of each bin with: >> >> x = left + step / 2; >> >> which would give the picture like this: >> >> >> xmin [*---*---*---*---*] xmax -- from >> | | | | | >> \ / \ / \ / \ / >> | | | | >> V V V V >> [* * * *] -- to > This binning is equivalent to what we do right now. > > The only difference is where the value is set (either at the left > of the bin, or at the center as above); this, however, isn't too > important, since this value is only used when printing > the labels, i.e. we could print [left, left+step) or > [center-step/2, center+step/2) and still get the same results. > >> Anyway, you may consider if you like whether it's possible to apply some >> simplifications from my code to the final version. > OK. This is how it looks like: > > diff --git a/util/qdist.c b/util/qdist.c > index 7842d34..3ca2227 100644 > --- a/util/qdist.c > +++ b/util/qdist.c > @@ -163,7 +163,7 @@ void qdist_bin__internal(struct qdist *to, const struct qdist *from, size_t n) > { > double xmin, xmax; > double step; > - size_t i, j, j_min; > + size_t i, j; > > qdist_init(to); > > @@ -194,7 +194,7 @@ void qdist_bin__internal(struct qdist *to, const struct qdist *from, size_t n) > } > > rebin: > - j_min = 0; > + j = 0; > for (i = 0; i < n; i++) { > double x; > double left, right; > @@ -210,19 +210,13 @@ void qdist_bin__internal(struct qdist *to, const struct qdist *from, size_t n) > * To avoid double-counting we capture [left, right) ranges, except for > * the righmost bin, which captures a [left, right] range. > */ > - for (j = j_min; j < from->n; j++) { > + while (j < from->n && > + (from->entries[j].x < right || > + (i == n - 1 && from->entries[j].x == right))) { If "i == n - 1" then we have to put everything what's left in 'from', right? If so, then: while (j < from->n && (from->entries[j].x < right || i == n - 1)) { Kind regards, Sergey > struct qdist_entry *o = &from->entries[j]; > > - /* entries are ordered so do not check beyond right */ > - if (o->x > right) { > - break; > - } > - if (o->x >= left && (o->x < right || > - (i == n - 1 && o->x == right))) { > - qdist_add(to, x, o->count); > - /* don't check this entry again */ > - j_min = j + 1; > - } > + qdist_add(to, x, o->count); > + j++; > } > } > } > > Thanks, > > Emilio