All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Đoàn Trần Công Danh" <congdanhqx@gmail.com>
To: Derrick Stolee <stolee@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: test-tool: bloom: generate_filter for multiple string?
Date: Tue, 5 Jan 2021 20:34:56 +0700	[thread overview]
Message-ID: <X/RrAAXsQXdMJbpo@danh.dev> (raw)
In-Reply-To: <dfe3fbf9-4708-cd53-d78e-780375224b8e@gmail.com>

On 2020-12-31 06:31:52-0500, Derrick Stolee <stolee@gmail.com> wrote:
> On 12/30/2020 10:54 PM, Đoàn Trần Công Danh wrote:
> > I'm reading the code for Bloom Filter to see if arXiv:2012.00472
> > could be an improvement.
> > 
> > I'm not sure if I'm missing something or it's our test-bloom generate_filter
> > doesn't really support testing for multiple inputs.
> 
> It doesn't support creating a _realistic_ filter of the given size. The tests
> at this level are more about keeping the filters consistent so we can trust
> that we are not accidentally changing the hashing algorithm and its file format.

(Sorry for this late reply)

Yes, make sense.

> The test works if we provide multiple inputs, the problem is that the resulting
> filter has a higher density of bits than we expect, since we didn't size it
> according to the number of inputs.

OK.

> > If I understand correctly, we should either:
> > * allocate more entry for inputs; or
> > * abort if argc != 3
> 
> Either approach would be fine. I wonder what your goal is for testing the
> multiple inputs. Are you expecting a larger filter to demonstrate some
> behavior that is not tested by a small filter?

I was expecting a larger filter to trying with some ideas that I had
in mind before implement it in real code.
However, after toying with my idea, I think it's more of a solution
looking for problem.
(The original post was posted before I realised that).

> > diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
> > index 46e97b04eb..1026facc59 100644
> > --- a/t/helper/test-bloom.c
> > +++ b/t/helper/test-bloom.c
> > @@ -68,12 +68,14 @@ int cmd__bloom(int argc, const char **argv)
> >  	if (!strcmp(argv[1], "generate_filter")) {
> >  		struct bloom_filter filter;
> >  		int i = 2;
> > -		filter.len =  (settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
> > -		filter.data = xcalloc(filter.len, sizeof(unsigned char));
> >  
> >  		if (argc - 1 < i)
> >  			usage(bloom_usage);
> >  
> > +		filter.len = (settings.bits_per_entry * (argc - 2) + BITS_PER_WORD - 1)
> > +			/ BITS_PER_WORD;
> > +		filter.data = xcalloc(filter.len, sizeof(unsigned char));
> > +
> 
> Whitespace issues aside, this is the right approach to make the filter grow with
> the input.
> 
> >  		while (argv[i]) {
> >  			add_string_to_filter(argv[i], &filter);
> >  			i++;
> > diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
> > index 7e4ab1795f..6d83927c86 100755
> > --- a/t/t0095-bloom.sh
> > +++ b/t/t0095-bloom.sh
> > @@ -67,6 +67,17 @@ test_expect_success 'compute bloom key for test string 2' '
> >  	test_cmp expect actual
> >  '
> >  
> > +test_expect_success 'compute bloom key for multiple string' '
> > +	cat >expect <<-\EOF &&
> > +	Hashes:0xb270de9b|0x1bb6f26e|0x84fd0641|0xee431a14|0x57892de7|0xc0cf41ba|0x2a15558d|
> > +	Hashes:0x20ab385b|0xf5237fe2|0xc99bc769|0x9e140ef0|0x728c5677|0x47049dfe|0x1b7ce585|
> 
> Multiple hash outputs work already, helping us build confidence that our
> hash algorithm hasn't changed.
> 
> > +	Filter_Length:3
> > +	Filter_Data:45|ba|ac|
> 
> And this is the part of the output that you are changing.
> 
> > +	EOF
> > +	test-tool bloom generate_filter "Hello world!" file.txt >actual &&
> > +	test_cmp expect actual
> > +'
> > +
> 
> So, your suggested change has merit. I just wonder what value it provides on top
> of the current implementation? If your goal is to create a way to inspect a full-sized
> filter, then that would be interesting to explore.

Yes, originally, I was trying to creat some way to inspect a full-size
filter, i.e. a bloom filter for some set of object_id during fetch
negotiation.
But, I realise it's not a good idea, now.

-- 
Danh

  reply	other threads:[~2021-01-05 13:35 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-31  3:54 test-tool: bloom: generate_filter for multiple string? Đoàn Trần Công Danh
2020-12-31 11:31 ` Derrick Stolee
2021-01-05 13:34   ` Đoàn Trần Công Danh [this message]
2021-01-12 19:53 ` Taylor Blau
2021-01-13 11:59   ` Đoàn Trần Công Danh
2021-01-13 12:06     ` Derrick Stolee
2021-01-13 12:13       ` Đoàn Trần Công Danh
2021-01-13 15:13     ` Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=X/RrAAXsQXdMJbpo@danh.dev \
    --to=congdanhqx@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.