Re: [Buildroot] [PATCH] utils/genrandconfig: reduce the maximum "size" of random configurations

From: "Yann E. MORIN" <yann.morin.1998@free.fr>
To: Thomas Petazzoni <thomas.petazzoni@bootlin.com>
Cc: Romain Naour <romain.naour@gmail.com>,
	Buildroot List <buildroot@buildroot.org>
Subject: Re: [Buildroot] [PATCH] utils/genrandconfig: reduce the maximum "size" of random configurations
Date: Sun, 14 Nov 2021 19:35:38 +0100	[thread overview]
Message-ID: <20211114183538.GX247986@scaer> (raw)
In-Reply-To: <20211114140055.61997bd6@windsurf>

Thomas, All,

On 2021-11-14 14:00 +0100, Thomas Petazzoni spake thusly:
> On Sat, 13 Nov 2021 19:02:06 +0100
> "Yann E. MORIN" <yann.morin.1998@free.fr> wrote:
> > On IRC, we discussed about using a normal distrobution rather than the
> > uniform distribution that randint() provides.
> Yeah, but after thinking more about it, I'm wondering whether this
> makes sense or not. Indeed, why are we more interested by
> configurations having ~10% of options enabled compared to the ones
> having 1% or 20% ? All of them are equally interesting, aren't they?

Well, you said yourself that it is most probably that configurations
have few options turned on, so we'd probably favour smaller
configurations over bigger ones.

Especially since smaller configurations are more prone to exhibit
missing or otherwise-inherited dependencies. Conversely, though, bigger
configurations are also more prone to exhibit build failures on
unexpected dependencies.

So YMMV...

> > And here is the (10,7) gaussian (which has my preference):
[--SNIP--]
> I think the reality of Buildroot configurations is probably even more
> biased towards having just a few percents of options enabled.

We can maybe refine the parameters, like (3,7).

> > Since a gaussian can output values all over ℝ, we must limit the range
> > explicitly (note that the 0% above are not representing 0 occurences!):
> > 
> >     proba = 0
> >     while proba < 1 or proba > 100:
> >         proba = int(random.gauss(15, 5))
> 
> How much CPU is this loop going to consume before it gives a value that
> is within 1 and 100 ? Would a % 100 be better ?

Well, the normal distribution is very predictable [0]. Here, the mean µ
is 10, and the standard deviation σ is 7. This means that:

  - 50% of values are below µ, 50% above
  - ~68.29% of values are between µ-σ=3 and µ+σ=17
  - ~95.45% of values are between µ-2σ=-4 and µ+2σ=24
  - ~99.73% of values are between µ-3σ=-11 and µ+3σ=31
  - ~84.1% of values are greater than µ-σ=3
  - ~15.4% of values are below µ-σ=3
  - ~2.2% of values are below µ-2σ=-4

And with the following test:

    >>> import random
    >>> restart = (0, 0)
    >>> for i in range(100*1000*1000):  # 100M iterations
    ...     r = int(random.gauss(10, 7)+0.5)  # Round to nearest int
    ...     if r < 1:
    ...         restart = (restart[0]+1, restart[1])
    ...     elif r > 100:
    ...         restart = (restart[0], restart[1]+1)
    ...
    >>> print('restart={}'.format(restart))
    restart=(8736128, 0)

So, it means there were ~8.736% of values below 1, and none above
100, which is consistent with the theory. So, we would usually need
just a very few loops to find an appropriate value:

                 Cumulative probability the loop generates:
                | <1                    | >=1
    ------------+-----------------------+-----------------
    1st iter    | 0.08736               | 0.91264
    2nd iter    | 0.00763 =0.08736^2    | 0.99236
    3rd iter    | 0.00066 =0.08736^3    | 0.99933
    4th iter    | 0.00006 =0.08736^4    | 0.99994
    5th iter    | 0.000005 =0.08736^5   | 0.99999

I.e. the odds to get something less than one after the 5th iteration are
less than in in 100,000.

For the other extremity of the range, 100, it is so far away (almost
13σ !) that we did not even get one in 100 million iterations. That's
less than the odds of a dinosaur-killing asteroid hitting Earth in the
next million years (or a very big number like that!)...

Oh, and doing those 100M iterations took ~54s, which means the loop is,
well, cheap. Very cheap.

[0] https://en.wikipedia.org/wiki/Normal_distribution

(Yeah, I love maths and probablities. Even though I'm pretty bad at it!)

Regards,
Yann E. MORIN.

-- 
.-----------------.--------------------.------------------.--------------------.
|  Yann E. MORIN  | Real-Time Embedded | /"\ ASCII RIBBON | Erics' conspiracy: |
| +33 662 376 056 | Software  Designer | \ / CAMPAIGN     |  ___               |
| +33 561 099 427 `------------.-------:  X  AGAINST      |  \e/  There is no  |
| http://ymorin.is-a-geek.org/ | _/*\_ | / \ HTML MAIL    |   v   conspiracy.  |
'------------------------------^-------^------------------^--------------------'
_______________________________________________
buildroot mailing list
buildroot@buildroot.org
https://lists.buildroot.org/mailman/listinfo/buildroot