From mboxrd@z Thu Jan  1 00:00:00 1970
From: Edward Shishkin <edward@namesys.com>
Subject: Re: Linux Gazette benchmark Reiser 4
Date: Mon, 09 Jan 2006 01:07:46 +0300
Message-ID: <43C18D32.8020106@namesys.com>
References: <e50d039c0601061010k51b103e4qb799090d52e7b744@mail.gmail.com> <op.s2y0t2t2cigqcu@apollo13> <43BECFF3.10204@namesys.com>
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
Return-path: <reiserfs-list-return-27629-reiserfs=m.gmane.org@namesys.com>
list-help: <mailto:reiserfs-list-help@namesys.com>
list-unsubscribe: <mailto:reiserfs-list-unsubscribe@namesys.com>
list-post: <mailto:reiserfs-list@namesys.com>
Errors-To: flx@namesys.com
In-Reply-To: <43BECFF3.10204@namesys.com>
List-Id: <reiserfs-devel.vger.kernel.org>
Content-Type: text/plain; charset="us-ascii"; format="flowed"
To: Hans Reiser <reiser@namesys.com>
Cc: PFC <lists@peufeu.com>, jpiszcz@lucidpixels.com, reiserfs-list@namesys.com, Alexander Zarochentcev <zam@namesys.com>

Hans Reiser wrote:

>PFC wrote:
>
>  
>
>>    Hehe. Wow. Sure, a benchmark that runs in 0.03 seconds for the
>>fastest  one and 0.07 seconds for the slowest one looks pretty
>>reliable to me. How  much time does it take to spawn the "touch"
>>process 10k times ? Hm... I'd  guess most of the benchmark time ?
>>    
>>
>
>  
>

Let's consider this important aspect of benchmarking more carefully.
So there is an interesting question: how much should be a difference
in order to approve that some fs really wins at this statistics? Is
there any guarantee you won't get, say, 0.05 and 0.02 after next run?
Sorry, but I didn't find any answer in Justin's notes, NOTE5 (Tests
Performed) says that questionable tests were re-run, but it seems we
need something kinda research here instead of re-run.

Below are some comments for how this problem is resolved (1*) in mongo
benchmark. Look for example at this table:
http://www.namesys.com/benchmarks.html#mongo.2.6.11
Fractions like 0.982 (D/A), 1.017 (C/A) are in black color, it means
that we _can not_ do any assumptions about winner because
|1 - X/A| < 0.02. What the magic M = 0.02 is?
Let's run the same phase for the same settings (file system, file set,
etc..) 10 times. We will obtain for the same statistics X a set of
different (because of errors) values x1, x2, ..., x10. Suppose that
X has a normal distribution (any objections?). It means that we can
calculate its trusted interval for a single measurement (2*) as
[X - d(P), X + d(P)], where d(P) = D*U(P), D is dispersion and U(P)
should be found from the standard table by any nominated value of
trusted probability P (3*).
Now we have the following simple criterion (*4):

|A - X| >= 2d(P), i.e. |1 - X/A| >= 2D*U(P)/A

|           |<-d->|    |<-d->|
------<-----|----->----<-----|----->------
            A                X

The magic M = 0.02 for mongo benchmark was calculated as 2D*U(P)/A
for the trusted probability P=0.85 (5*).
Now it is clear from the formula above why statistics shouldn't be
too small: because the criterion becomes false. I am sure (and it
is easy to check) 2d(P=0.85) is much more then |0.07 - 0.03| as it
is in the case of find 10000 files. By the way, some settings, which
provide a small values (~5 sec) of the mongo STATS statistics also
make this criterion false.


(1*) Maybe this is not a perfect way, but it is better then nothing
(2*) For N measurements the expression for boundaries becomes a bit
     complicated.
(3*) For P=0.85 (as we can found in any scientific book) U(P)=1.44
(4*) One more assumption here about identical distributions of A and X
(5*) Actually D = max(D_create, D_copy, D_read, D_delete, D_dd), where
     D_each_phase was estimated once by 10 measurements with some fixed
     settings by the standard way:
     D^2 = ((x - x1)^2 + ... + (x - x10)^2)/(10 - 1), where
     x = (x1 + ... + x10)/10 is an average value.

Edward.