Re: [Qemu-devel] [PATCH v6 08/15] qdist: add module to represent frequency distributions of data

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Sergey Fedorov <serge.fdrv@gmail.com>
To: "Emilio G. Cota" <cota@braap.org>
Cc: "QEMU Developers" <qemu-devel@nongnu.org>,
	"MTTCG Devel" <mttcg@listserver.greensocs.com>,
	"Alex Bennée" <alex.bennee@linaro.org>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Richard Henderson" <rth@twiddle.net>
Subject: Re: [Qemu-devel] [PATCH v6 08/15] qdist: add module to represent frequency distributions of data
Date: Wed, 8 Jun 2016 17:10:03 +0300	[thread overview]
Message-ID: <5758273B.7090603@gmail.com> (raw)
In-Reply-To: <20160608000224.GB16255@flamenco>

On 08/06/16 03:02, Emilio G. Cota wrote:
> On Tue, Jun 07, 2016 at 18:56:48 +0300, Sergey Fedorov wrote:
>> On 07/06/16 04:05, Emilio G. Cota wrote:
>>> On Sat, May 28, 2016 at 21:15:06 +0300, Sergey Fedorov wrote:
>>>> On 25/05/16 04:13, Emilio G. Cota wrote:
>>>>> diff --git a/util/qdist.c b/util/qdist.c
>>>>> new file mode 100644
>>>>> index 0000000..3343640
>>>>> --- /dev/null
>>>>> +++ b/util/qdist.c
>>>>> @@ -0,0 +1,386 @@
>>>> (snip)
>>>>> +
>>>>> +void qdist_add(struct qdist *dist, double x, long count)
>>>>> +{
>>>>> +    struct qdist_entry *entry = NULL;
>>>>> +
>>>>> +    if (dist->entries) {
>>>>> +        struct qdist_entry e;
>>>>> +
>>>>> +        e.x = x;
>>>>> +        entry = bsearch(&e, dist->entries, dist->n, sizeof(e), qdist_cmp);
>>>>> +    }
>>>>> +
>>>>> +    if (entry) {
>>>>> +        entry->count += count;
>>>>> +        return;
>>>>> +    }
>>>>> +
>>>>> +    dist->entries = g_realloc(dist->entries,
>>>>> +                              sizeof(*dist->entries) * (dist->n + 1));
>>>> Repeated doubling?
>>> Can you please elaborate?
>> I mean dynamic array with a growth factor of 2
>> [https://en.wikipedia.org/wiki/Dynamic_array].
> Changed to:
>
> diff --git a/include/qemu/qdist.h b/include/qemu/qdist.h
> index 6d8b701..f30050c 100644
> --- a/include/qemu/qdist.h
> +++ b/include/qemu/qdist.h
> @@ -29,6 +29,7 @@ struct qdist_entry {
>  struct qdist {
>      struct qdist_entry *entries;
>      size_t n;
> +    size_t size;
>  };
>  
>  #define QDIST_PR_BORDER     BIT(0)
> diff --git a/util/qdist.c b/util/qdist.c
> index dc9dbd1..3b54354 100644
> --- a/util/qdist.c
> +++ b/util/qdist.c
> @@ -16,6 +16,7 @@
>  void qdist_init(struct qdist *dist)
>  {
>      dist->entries = NULL;
> +    dist->size = 0;
>      dist->n = 0;
>  }
>  
> @@ -58,8 +59,11 @@ void qdist_add(struct qdist *dist, double x, long count)
>          return;
>      }
>  
> -    dist->entries = g_realloc(dist->entries,
> -                              sizeof(*dist->entries) * (dist->n + 1));
> +    if (unlikely(dist->n == dist->size)) {
> +        dist->size = dist->size ? dist->size * 2 : 1;

We could initialize 'dist->size' to 1 and allocate a 1-entry
'dist->entries' array in qdist_init() to avoid this ternary operation ;-)

Otherwise looks good.

> +        dist->entries = g_realloc(dist->entries,
> +                                  sizeof(*dist->entries) * (dist->size));
> +    }
>      dist->n++;
>      entry = &dist->entries[dist->n - 1];
>      entry->x = x;
>
>
>>>> (snip)
>>>>> +static char *qdist_pr_internal(const struct qdist *dist)
>>>>> +{
>>>>> +    double min, max, step;
>>>>> +    GString *s = g_string_new("");
>>>>> +    size_t i;
>>>>> +
>>>>> +    /* if only one entry, its printout will be either full or empty */
>>>>> +    if (dist->n == 1) {
>>>>> +        if (dist->entries[0].count) {
>>>>> +            g_string_append_unichar(s, qdist_blocks[QDIST_NR_BLOCK_CODES - 1]);
>>>>> +        } else {
>>>>> +            g_string_append_c(s, ' ');
>>>>> +        }
>>>>> +        goto out;
>>>>> +    }
>>>>> +
>>>>> +    /* get min and max counts */
>>>>> +    min = dist->entries[0].count;
>>>>> +    max = min;
>>>>> +    for (i = 0; i < dist->n; i++) {
>>>>> +        struct qdist_entry *e = &dist->entries[i];
>>>>> +
>>>>> +        if (e->count < min) {
>>>>> +            min = e->count;
>>>>> +        }
>>>>> +        if (e->count > max) {
>>>>> +            max = e->count;
>>>>> +        }
>>>>> +    }
>>>>> +
>>>>> +    /* floor((count - min) * step) will give us the block index */
>>>>> +    step = (QDIST_NR_BLOCK_CODES - 1) / (max - min);
>>>>> +
>>>>> +    for (i = 0; i < dist->n; i++) {
>>>>> +        struct qdist_entry *e = &dist->entries[i];
>>>>> +        int index;
>>>>> +
>>>>> +        /* make an exception with 0; instead of using block[0], print a space */
>>>>> +        if (e->count) {
>>>>> +            index = (int)((e->count - min) * step);
>>>> So "e->count == min" gives us one eighth block instead of just space?
>>> Yes, only 0 can print a space.
>> So our scale is not linear. I think some users might get confused by this.
> That's correct. I think special-casing 0 makes sense though, since
> it increases the signal-to-noise ratio of the histogram. For example:
>
> 1) 0 as ' ':
> TB hash occupancy   31.84% avg chain occ. Histogram: [0,10)%|▆ █  ▅▁▃▁▁|[90,100]%
> TB hash avg chain   1.015 buckets. Histogram: 1|█▁▁|3
>
> 2) 0 as '1/8':
> TB hash occupancy   32.07% avg chain occ. Histogram: [0,10)%|▆▁█▁▁▅▁▃▁▁|[90,100]%
> TB hash avg chain   1.015 buckets. Histogram: 1|▇▁▁|3
>
> I think in these examples most users would be less confused by 1) than by 2).

I was meaning to represent all bars whose value < 1/8 as a space, not
only whose value is pure zero. Otherwise we can see 1/8 bar where the
actual value is negligibly differ from zero as in the second example.

>
> (snip)
>>>>> +        to->n = from->n;
>>>>> +        memcpy(to->entries, from->entries, sizeof(*to->entries) * to->n);
>>>>> +        return;
>>>>> +    }
>>>>> +
>>>>> + rebin:
>> By the way, here's a space before the 'rebin' label.
> Yes, I always do this.
> It prevents diff from mistaking the label for a function definition,
> and thus wrongly using the label as context. See:
>   https://lkml.org/lkml/2010/6/16/312

Cool!

>
>
>>>>> +    j_min = 0;
>>>>> +    for (i = 0; i < n; i++) {
>>>>> +        double x;
>>>>> +        double left, right;
>>>>> +
>>>>> +        left = xmin + i * step;
>>>>> +        right = xmin + (i + 1) * step;
>>>>> +
>>>>> +        /* Add x, even if it might not get any counts later */
>>>>> +        x = left;
>>>> This way we round down to the left margin of each bin like this:
>>>>
>>>>     xmin [*---*---*---*---*] xmax   -- from
>>>>           |  /|  /|  /|  /
>>>>           | / | / | / | /
>>>>           |/  |/  |/  |/
>>>>           |   |   |   |
>>>>           V   V   V   V
>>>>          [*   *   *   *]            -- to
>>> (snip)
>>>>     xmin [*----*----*----*] xmax    -- from
>>>>         \   /\   /\   /\   /
>>>>          \ /  \ /  \ /  \ /
>>>>           |    |    |    |
>>>>           V    V    V    V
>>>>          [*    *    *    *]         -- to
>>>>
>>>> I'm not sure which is the more correct option from the mathematical
>>>> point of view; but multiple-binning with the last variant of the
>>>> algorithm we would still give the same result.
>>> There's no "right" or "wrong" way as long as we're consistent
>>> and we print the right counts in the right bins. I think the
>>> convention I chose is simple enough, and leads to simple printing
>>> of the labels. But yes other alternatives would be OK here.
>> Well, if we go ahead with my last suggestion the code would look like this:
>>
>> rebin:
>>     /* We do the binning using the following scheme:
>>      *
>>      *  xmin [*----*----*----*] xmax    -- from
>>      *      \   /\   /\   /\   /
>>      *       \ /  \ /  \ /  \ /
>>      *        |    |    |    |
>>      *        V    V    V    V
>>      *       [*    *    *    *]         -- to
>>      *
>>      */
>>     step = (xmax - xmin) / (n - 1);
>>     j = 0;
>>     for (i = 0; i < n; i++) {
>>         double x;
>>         double right;
>>
>>         x = xmin + i * step;
>>         right = x + 0.5 * step;
>>
>>         /* Add x, even if it might not get any counts later */
>>         qdist_add(to, x, 0);
>>
>>         /* To avoid double-counting we capture [left, right) ranges */
>>         while (from->entries[j].x < right && j < from->n) {
>>             qdist_add(to, x, from->entries[j].count);
>>             j++;
>>         }
>>     }
>>     assert(j == from->n);
>> }
>>
>> Actually it's simpler than current version.
> The behaviour isn't the same though. With this we have
> that the two outer bins (leftmost and rightmost) are unnecessarily
> large (since they're out of the range of the input data).
>
> For example, assume the data is between 0 and 100 and n=5 (i.e. step=25),
> it makes no sense to report the first bin as [-12.5,12.5). If we
> then truncate the unnecessary edges, we'd have [0,12.5), but
> then the second bin is [12.5,37.5). Bins of unequal size are
> possible (although a bit unusual) in histograms, but given
> our Unicode-based representation, we're limited to same-width bars.

That is why I noted that I'm not sure what is the most correct from
mathematical point of view. Maybe consider the second option? I.e.
rounding to the middle of each bin with:

    x = left + step / 2;

which would give the picture like this:


    xmin [*---*---*---*---*] xmax   -- from
          |   |   |   |   |
           \ / \ / \ / \ /
            |   |   |   |
            V   V   V   V
           [*   *   *   *]          -- to

Anyway, you may consider if you like whether it's possible to apply some
simplifications from my code to the final version.

Kind regards,
Sergey

next prev parent reply	other threads:[~2016-06-08 14:10 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-25  1:13 [Qemu-devel] [PATCH v6 00/15] tb hash improvements Emilio G. Cota
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 01/15] compiler.h: add QEMU_ALIGNED() to enforce struct alignment Emilio G. Cota
2016-05-27 19:54   ` Sergey Fedorov
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 02/15] seqlock: remove optional mutex Emilio G. Cota
2016-05-27 19:55   ` Sergey Fedorov
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 03/15] seqlock: rename write_lock/unlock to write_begin/end Emilio G. Cota
2016-05-27 19:59   ` Sergey Fedorov
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 04/15] include/processor.h: define cpu_relax() Emilio G. Cota
2016-05-27 20:53   ` Sergey Fedorov
2016-05-27 21:10     ` Emilio G. Cota
2016-05-28 12:35       ` Sergey Fedorov
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 05/15] qemu-thread: add simple test-and-set spinlock Emilio G. Cota
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 06/15] exec: add tb_hash_func5, derived from xxhash Emilio G. Cota
2016-05-28 12:36   ` Sergey Fedorov
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 07/15] tb hash: hash phys_pc, pc, and flags with xxhash Emilio G. Cota
2016-05-28 12:39   ` Sergey Fedorov
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 08/15] qdist: add module to represent frequency distributions of data Emilio G. Cota
2016-05-28 18:15   ` Sergey Fedorov
2016-06-03 17:22     ` Emilio G. Cota
2016-06-03 17:29       ` Sergey Fedorov
2016-06-03 17:46         ` Sergey Fedorov
2016-06-06 23:40           ` Emilio G. Cota
2016-06-07 14:06             ` Sergey Fedorov
2016-06-07 22:53               ` Emilio G. Cota
2016-06-08 13:09                 ` Sergey Fedorov
2016-06-07  1:05     ` Emilio G. Cota
2016-06-07 15:56       ` Sergey Fedorov
2016-06-08  0:02         ` Emilio G. Cota
2016-06-08 14:10           ` Sergey Fedorov [this message]
2016-06-08 18:06             ` Emilio G. Cota
2016-06-08 18:18               ` Sergey Fedorov
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 09/15] qdist: add test program Emilio G. Cota
2016-05-28 18:56   ` Sergey Fedorov
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 10/15] qht: QEMU's fast, resizable and scalable Hash Table Emilio G. Cota
2016-05-29 19:52   ` Sergey Fedorov
2016-05-29 19:55     ` Sergey Fedorov
2016-05-31  7:46     ` Alex Bennée
2016-06-01 20:53       ` Sergey Fedorov
2016-06-03  9:18     ` Emilio G. Cota
2016-06-03 15:19       ` Sergey Fedorov
2016-06-03 11:01     ` Emilio G. Cota
2016-06-03 15:34       ` Sergey Fedorov
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 11/15] qht: add test program Emilio G. Cota
2016-05-29 20:15   ` Sergey Fedorov
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 12/15] qht: add qht-bench, a performance benchmark Emilio G. Cota
2016-05-29 20:45   ` Sergey Fedorov
2016-06-03 11:41     ` Emilio G. Cota
2016-06-03 15:41       ` Sergey Fedorov
2016-05-31 15:12   ` Alex Bennée
2016-05-31 16:44     ` Emilio G. Cota
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 13/15] qht: add test-qht-par to invoke qht-bench from 'check' target Emilio G. Cota
2016-05-29 20:53   ` Sergey Fedorov
2016-06-03 11:07     ` Emilio G. Cota
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 14/15] tb hash: track translated blocks with qht Emilio G. Cota
2016-05-29 21:09   ` Sergey Fedorov
2016-05-31  8:39   ` Alex Bennée
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 15/15] translate-all: add tb hash bucket info to 'info jit' dump Emilio G. Cota
2016-05-29 21:14   ` Sergey Fedorov
2016-06-08  6:25 ` [Qemu-devel] [PATCH v6 00/15] tb hash improvements Alex Bennée
2016-06-08 15:16   ` Emilio G. Cota
2016-06-08 15:35   ` Richard Henderson
2016-06-08 15:37     ` Sergey Fedorov
2016-06-08 16:45       ` Alex Bennée

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5758273B.7090603@gmail.com \
    --to=serge.fdrv@gmail.com \
    --cc=alex.bennee@linaro.org \
    --cc=cota@braap.org \
    --cc=mttcg@listserver.greensocs.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=rth@twiddle.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).