Re: [Qemu-devel] [PATCH v6 08/15] qdist: add module to represent frequency distributions of data

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Sergey Fedorov <serge.fdrv@gmail.com>
To: "Emilio G. Cota" <cota@braap.org>
Cc: "QEMU Developers" <qemu-devel@nongnu.org>,
	"MTTCG Devel" <mttcg@listserver.greensocs.com>,
	"Alex Bennée" <alex.bennee@linaro.org>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Richard Henderson" <rth@twiddle.net>
Subject: Re: [Qemu-devel] [PATCH v6 08/15] qdist: add module to represent frequency distributions of data
Date: Wed, 8 Jun 2016 17:10:03 +0300	[thread overview]
Message-ID: <5758273B.7090603@gmail.com> (raw)
In-Reply-To: <20160608000224.GB16255@flamenco>

On 08/06/16 03:02, Emilio G. Cota wrote:
> On Tue, Jun 07, 2016 at 18:56:48 +0300, Sergey Fedorov wrote:
>> On 07/06/16 04:05, Emilio G. Cota wrote:
>>> On Sat, May 28, 2016 at 21:15:06 +0300, Sergey Fedorov wrote:
>>>> On 25/05/16 04:13, Emilio G. Cota wrote:
>>>>> diff --git a/util/qdist.c b/util/qdist.c
>>>>> new file mode 100644
>>>>> index 0000000..3343640
>>>>> --- /dev/null
>>>>> +++ b/util/qdist.c
>>>>> @@ -0,0 +1,386 @@
>>>> (snip)
>>>>> +
>>>>> +void qdist_add(struct qdist *dist, double x, long count)
>>>>> +{
>>>>> +    struct qdist_entry *entry = NULL;
>>>>> +
>>>>> +    if (dist->entries) {
>>>>> +        struct qdist_entry e;
>>>>> +
>>>>> +        e.x = x;
>>>>> +        entry = bsearch(&e, dist->entries, dist->n, sizeof(e), qdist_cmp);
>>>>> +    }
>>>>> +
>>>>> +    if (entry) {
>>>>> +        entry->count += count;
>>>>> +        return;
>>>>> +    }
>>>>> +
>>>>> +    dist->entries = g_realloc(dist->entries,
>>>>> +                              sizeof(*dist->entries) * (dist->n + 1));
>>>> Repeated doubling?
>>> Can you please elaborate?
>> I mean dynamic array with a growth factor of 2
>> [https://en.wikipedia.org/wiki/Dynamic_array].
> Changed to:
>
> diff --git a/include/qemu/qdist.h b/include/qemu/qdist.h
> index 6d8b701..f30050c 100644
> --- a/include/qemu/qdist.h
> +++ b/include/qemu/qdist.h
> @@ -29,6 +29,7 @@ struct qdist_entry {
>  struct qdist {
>      struct qdist_entry *entries;
>      size_t n;
> +    size_t size;
>  };
>  
>  #define QDIST_PR_BORDER     BIT(0)
> diff --git a/util/qdist.c b/util/qdist.c
> index dc9dbd1..3b54354 100644
> --- a/util/qdist.c
> +++ b/util/qdist.c
> @@ -16,6 +16,7 @@
>  void qdist_init(struct qdist *dist)
>  {
>      dist->entries = NULL;
> +    dist->size = 0;
>      dist->n = 0;
>  }
>  
> @@ -58,8 +59,11 @@ void qdist_add(struct qdist *dist, double x, long count)
>          return;
>      }
>  
> -    dist->entries = g_realloc(dist->entries,
> -                              sizeof(*dist->entries) * (dist->n + 1));
> +    if (unlikely(dist->n == dist->size)) {
> +        dist->size = dist->size ? dist->size * 2 : 1;

We could initialize 'dist->size' to 1 and allocate a 1-entry
'dist->entries' array in qdist_init() to avoid this ternary operation ;-)

Otherwise looks good.

> +        dist->entries = g_realloc(dist->entries,
> +                                  sizeof(*dist->entries) * (dist->size));
> +    }
>      dist->n++;
>      entry = &dist->entries[dist->n - 1];
>      entry->x = x;
>
>
>>>> (snip)
>>>>> +static char *qdist_pr_internal(const struct qdist *dist)
>>>>> +{
>>>>> +    double min, max, step;
>>>>> +    GString *s = g_string_new("");
>>>>> +    size_t i;
>>>>> +
>>>>> +    /* if only one entry, its printout will be either full or empty */
>>>>> +    if (dist->n == 1) {
>>>>> +        if (dist->entries[0].count) {
>>>>> +            g_string_append_unichar(s, qdist_blocks[QDIST_NR_BLOCK_CODES - 1]);
>>>>> +        } else {
>>>>> +            g_string_append_c(s, ' ');
>>>>> +        }
>>>>> +        goto out;
>>>>> +    }
>>>>> +
>>>>> +    /* get min and max counts */
>>>>> +    min = dist->entries[0].count;
>>>>> +    max = min;
>>>>> +    for (i = 0; i < dist->n; i++) {
>>>>> +        struct qdist_entry *e = &dist->entries[i];
>>>>> +
>>>>> +        if (e->count < min) {
>>>>> +            min = e->count;
>>>>> +        }
>>>>> +        if (e->count > max) {
>>>>> +            max = e->count;
>>>>> +        }
>>>>> +    }
>>>>> +
>>>>> +    /* floor((count - min) * step) will give us the block index */
>>>>> +    step = (QDIST_NR_BLOCK_CODES - 1) / (max - min);
>>>>> +
>>>>> +    for (i = 0; i < dist->n; i++) {
>>>>> +        struct qdist_entry *e = &dist->entries[i];
>>>>> +        int index;
>>>>> +
>>>>> +        /* make an exception with 0; instead of using block[0], print a space */
>>>>> +        if (e->count) {
>>>>> +            index = (int)((e->count - min) * step);
>>>> So "e->count == min" gives us one eighth block instead of just space?
>>> Yes, only 0 can print a space.
>> So our scale is not linear. I think some users might get confused by this.
> That's correct. I think special-casing 0 makes sense though, since
> it increases the signal-to-noise ratio of the histogram. For example:
>
> 1) 0 as ' ':
> TB hash occupancy   31.84% avg chain occ. Histogram: [0,10)%|▆ █  ▅▁▃▁▁|[90,100]%
> TB hash avg chain   1.015 buckets. Histogram: 1|█▁▁|3
>
> 2) 0 as '1/8':
> TB hash occupancy   32.07% avg chain occ. Histogram: [0,10)%|▆▁█▁▁▅▁▃▁▁|[90,100]%
> TB hash avg chain   1.015 buckets. Histogram: 1|▇▁▁|3
>
> I think in these examples most users would be less confused by 1) than by 2).

I was meaning to represent all bars whose value < 1/8 as a space, not
only whose value is pure zero. Otherwise we can see 1/8 bar where the
actual value is negligibly differ from zero as in the second example.

>
> (snip)
>>>>> +        to->n = from->n;
>>>>> +        memcpy(to->entries, from->entries, sizeof(*to->entries) * to->n);
>>>>> +        return;
>>>>> +    }
>>>>> +
>>>>> + rebin:
>> By the way, here's a space before the 'rebin' label.
> Yes, I always do this.
> It prevents diff from mistaking the label for a function definition,
> and thus wrongly using the label as context. See:
>   https://lkml.org/lkml/2010/6/16/312

Cool!

>
>
>>>>> +    j_min = 0;
>>>>> +    for (i = 0; i < n; i++) {
>>>>> +        double x;
>>>>> +        double left, right;
>>>>> +
>>>>> +        left = xmin + i * step;
>>>>> +        right = xmin + (i + 1) * step;
>>>>> +
>>>>> +        /* Add x, even if it might not get any counts later */
>>>>> +        x = left;
>>>> This way we round down to the left margin of each bin like this:
>>>>
>>>>     xmin [*---*---*---*---*] xmax   -- from
>>>>           |  /|  /|  /|  /
>>>>           | / | / | / | /
>>>>           |/  |/  |/  |/
>>>>           |   |   |   |
>>>>           V   V   V   V
>>>>          [*   *   *   *]            -- to
>>> (snip)
>>>>     xmin [*----*----*----*] xmax    -- from
>>>>         \   /\   /\   /\   /
>>>>          \ /  \ /  \ /  \ /
>>>>           |    |    |    |
>>>>           V    V    V    V
>>>>          [*    *    *    *]         -- to
>>>>
>>>> I'm not sure which is the more correct option from the mathematical
>>>> point of view; but multiple-binning with the last variant of the
>>>> algorithm we would still give the same result.
>>> There's no "right" or "wrong" way as long as we're consistent
>>> and we print the right counts in the right bins. I think the
>>> convention I chose is simple enough, and leads to simple printing
>>> of the labels. But yes other alternatives would be OK here.
>> Well, if we go ahead with my last suggestion the code would look like this:
>>
>> rebin:
>>     /* We do the binning using the following scheme:
>>      *
>>      *  xmin [*----*----*----*] xmax    -- from
>>      *      \   /\   /\   /\   /
>>      *       \ /  \ /  \ /  \ /
>>      *        |    |    |    |
>>      *        V    V    V    V
>>      *       [*    *    *    *]         -- to
>>      *
>>      */
>>     step = (xmax - xmin) / (n - 1);
>>     j = 0;
>>     for (i = 0; i < n; i++) {
>>         double x;
>>         double right;
>>
>>         x = xmin + i * step;
>>         right = x + 0.5 * step;
>>
>>         /* Add x, even if it might not get any counts later */
>>         qdist_add(to, x, 0);
>>
>>         /* To avoid double-counting we capture [left, right) ranges */
>>         while (from->entries[j].x < right && j < from->n) {
>>             qdist_add(to, x, from->entries[j].count);
>>             j++;
>>         }
>>     }
>>     assert(j == from->n);
>> }
>>
>> Actually it's simpler than current version.
> The behaviour isn't the same though. With this we have
> that the two outer bins (leftmost and rightmost) are unnecessarily
> large (since they're out of the range of the input data).
>
> For example, assume the data is between 0 and 100 and n=5 (i.e. step=25),
> it makes no sense to report the first bin as [-12.5,12.5). If we
> then truncate the unnecessary edges, we'd have [0,12.5), but
> then the second bin is [12.5,37.5). Bins of unequal size are
> possible (although a bit unusual) in histograms, but given
> our Unicode-based representation, we're limited to same-width bars.

That is why I noted that I'm not sure what is the most correct from
mathematical point of view. Maybe consider the second option? I.e.
rounding to the middle of each bin with:

    x = left + step / 2;

which would give the picture like this:


    xmin [*---*---*---*---*] xmax   -- from
          |   |   |   |   |
           \ / \ / \ / \ /
            |   |   |   |
            V   V   V   V
           [*   *   *   *]          -- to

Anyway, you may consider if you like whether it's possible to apply some
simplifications from my code to the final version.

Kind regards,
Sergey

next prev parent reply	other threads:[~2016-06-08 14:10 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-25  1:13 [Qemu-devel] [PATCH v6 00/15] tb hash improvements Emilio G. Cota
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 01/15] compiler.h: add QEMU_ALIGNED() to enforce struct alignment Emilio G. Cota
2016-05-27 19:54   ` Sergey Fedorov
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 02/15] seqlock: remove optional mutex Emilio G. Cota
2016-05-27 19:55   ` Sergey Fedorov
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 03/15] seqlock: rename write_lock/unlock to write_begin/end Emilio G. Cota
2016-05-27 19:59   ` Sergey Fedorov
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 04/15] include/processor.h: define cpu_relax() Emilio G. Cota
2016-05-27 20:53   ` Sergey Fedorov
2016-05-27 21:10     ` Emilio G. Cota
2016-05-28 12:35       ` Sergey Fedorov
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 05/15] qemu-thread: add simple test-and-set spinlock Emilio G. Cota
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 06/15] exec: add tb_hash_func5, derived from xxhash Emilio G. Cota
2016-05-28 12:36   ` Sergey Fedorov
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 07/15] tb hash: hash phys_pc, pc, and flags with xxhash Emilio G. Cota
2016-05-28 12:39   ` Sergey Fedorov
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 08/15] qdist: add module to represent frequency distributions of data Emilio G. Cota
2016-05-28 18:15   ` Sergey Fedorov
2016-06-03 17:22     ` Emilio G. Cota
2016-06-03 17:29       ` Sergey Fedorov
2016-06-03 17:46         ` Sergey Fedorov
2016-06-06 23:40           ` Emilio G. Cota
2016-06-07 14:06             ` Sergey Fedorov
2016-06-07 22:53               ` Emilio G. Cota
2016-06-08 13:09                 ` Sergey Fedorov
2016-06-07  1:05     ` Emilio G. Cota
2016-06-07 15:56       ` Sergey Fedorov
2016-06-08  0:02         ` Emilio G. Cota
2016-06-08 14:10           ` Sergey Fedorov [this message]
2016-06-08 18:06             ` Emilio G. Cota
2016-06-08 18:18               ` Sergey Fedorov
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 09/15] qdist: add test program Emilio G. Cota
2016-05-28 18:56   ` Sergey Fedorov
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 10/15] qht: QEMU's fast, resizable and scalable Hash Table Emilio G. Cota
2016-05-29 19:52   ` Sergey Fedorov
2016-05-29 19:55     ` Sergey Fedorov
2016-05-31  7:46     ` Alex Bennée
2016-06-01 20:53       ` Sergey Fedorov
2016-06-03  9:18     ` Emilio G. Cota
2016-06-03 15:19       ` Sergey Fedorov
2016-06-03 11:01     ` Emilio G. Cota
2016-06-03 15:34       ` Sergey Fedorov
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 11/15] qht: add test program Emilio G. Cota
2016-05-29 20:15   ` Sergey Fedorov
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 12/15] qht: add qht-bench, a performance benchmark Emilio G. Cota
2016-05-29 20:45   ` Sergey Fedorov
2016-06-03 11:41     ` Emilio G. Cota
2016-06-03 15:41       ` Sergey Fedorov
2016-05-31 15:12   ` Alex Bennée
2016-05-31 16:44     ` Emilio G. Cota
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 13/15] qht: add test-qht-par to invoke qht-bench from 'check' target Emilio G. Cota
2016-05-29 20:53   ` Sergey Fedorov
2016-06-03 11:07     ` Emilio G. Cota
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 14/15] tb hash: track translated blocks with qht Emilio G. Cota
2016-05-29 21:09   ` Sergey Fedorov
2016-05-31  8:39   ` Alex Bennée
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 15/15] translate-all: add tb hash bucket info to 'info jit' dump Emilio G. Cota
2016-05-29 21:14   ` Sergey Fedorov
2016-06-08  6:25 ` [Qemu-devel] [PATCH v6 00/15] tb hash improvements Alex Bennée
2016-06-08 15:16   ` Emilio G. Cota
2016-06-08 15:35   ` Richard Henderson
2016-06-08 15:37     ` Sergey Fedorov
2016-06-08 16:45       ` Alex Bennée

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5758273B.7090603@gmail.com \
    --to=serge.fdrv@gmail.com \
    --cc=alex.bennee@linaro.org \
    --cc=cota@braap.org \
    --cc=mttcg@listserver.greensocs.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=rth@twiddle.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.