Libgit2 on the Summer of Code

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Libgit2 on the Summer of Code
@ 2010-05-27 16:25 Vicent Marti
  2010-05-27 17:46 ` Ilari Liusvaara
  0 siblings, 1 reply; 7+ messages in thread
From: Vicent Marti @ 2010-05-27 16:25 UTC (permalink / raw)
  To: git, srabbelier

Hello fine gentlemen,

I believe a short introduction is due. My name is Vicent Marti and I'm
the Google Summer of Code student who will be working on libgit2 this
year. If you haven't heard of it before, libgit2 [1] is a C library
which intents to re implement all of Git's core in a saner, reentrant
manner, with proper error handling and all the usual stuff you'd
expect from an actual library (including a less restrictive license).

I'm being kindly mentored by Scott Chacon on this task, and it seems I
might have been a *tad* too much mentored (;d), because I've spent so
much time discussing things with Scott and the other libgit2
developers that Sverre Rabbelier had to remind me that there's an
actual Git mailing list and I haven't even introduced myself to it
yet.

Well, fear do not. I'm here (a couple weeks late, but here after all)
and although I have been slacking on the community interaction part of
the Summer of Code, the actual project is well under tracks:

All the milestones for the first coding period are mostly finished
(yay for that) and you can find the code and follow my progress on my
public github repo [2]. There is some documentation in place, and
there are some tests in place. In the following days I'll aim for
about 90% code coverage on the tests, to finish the documentation, and
then prepare the patch series for review on this list.

Meanwhile, you are very much welcome to start flaming away before the
patch series are ready. We have roughly one month and a half to fix up
my code to the project's standards; hopefully it won't be *that* bad
and I will be able to implement some extra features before the
evaluation.

Oh, and before I forget, my awesome mentor Scott is writing some Ruby
bindings for the library [3], which at the moment are mostly in-sync
with all the new functionality I've added. Those will come in handy if
you want to try out the new features of Libgit2 with a more friendly
interface... as friendly as Ruby can be, that is. ;/

Thanks for your time, I'm looking forward to your input and I promise
I'll be much more active on this list from now on.

Cheers,
Vicent Martí
http://www.bellverde.org

[1]: http://repo.or.cz/w/libgit2.git
[2]: http://github.com/tanoku/libgit2-gsoc2010
[3]: http://github.com/schacon/ribbit

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Libgit2 on the Summer of Code
  2010-05-27 16:25 Libgit2 on the Summer of Code Vicent Marti
@ 2010-05-27 17:46 ` Ilari Liusvaara
  2010-05-27 18:05   ` Shawn O. Pearce
  0 siblings, 1 reply; 7+ messages in thread
From: Ilari Liusvaara @ 2010-05-27 17:46 UTC (permalink / raw)
  To: Vicent Marti; +Cc: git, srabbelier

On Thu, May 27, 2010 at 06:25:32PM +0200, Vicent Marti wrote:
> 
> All the milestones for the first coding period are mostly finished
> (yay for that) and you can find the code and follow my progress on my
> public github repo [2]. There is some documentation in place, and
> there are some tests in place. In the following days I'll aim for
> about 90% code coverage on the tests, to finish the documentation, and
> then prepare the patch series for review on this list.
> 
> Meanwhile, you are very much welcome to start flaming away before the
> patch series are ready. We have roughly one month and a half to fix up
> my code to the project's standards; hopefully it won't be *that* bad
> and I will be able to implement some extra features before the
> evaluation.

I start the flaming... :->

* I noticed that you seem to format if/while/for like this:

if (foo)
{
    something1;
    something2;
}
else
{
    something3;
    something4;
}

While the existing percedent seems to be:

if (foo) {
	something1;
	something2;
} else {
	something3;
	something4;
}

* Also, you appear to be using 4 spaces for indent, whereas the existing
percedent appears to be indent by tab. Some functions appear to use mixed
indents (sometimes even in one block construct).

* There seems to be some trailing whitespace. I don't know the policies of
libgit2 on that, but I suppose it is not supposed to be there.

* git_commit_list_push_back() fails silently if memory allocation fails. Is
it supposed to? Same for git_commit_list_push_front().

* Where algorithm in git_revpool_table__hash() is from? Since it appears to
hash binary object IDs, wouldn't just simple sum/xor over words be sufficient
(all SHA-1 output bits are very nearly independent). Or do you need to be
compatible with some other implementation (doesn't appear so, because hash
is computed differently depending on endianess)?

* gitrp_push() just returns if git_commit_parse_existing() fails. But causes
of that failing seemingly can include ODB read errors, which are fairly
serious...

* Is gitrp__enroot() supposed to just ignore failures of
git_commit_parse_existing()? is 'commit->parents.head' valid even in this
case?

* There are numerious cases where function that suspiciously lacks error
code is called (if error code is added, it presumably needs to be bubbled
back to caller).

-Ilari

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Libgit2 on the Summer of Code
  2010-05-27 17:46 ` Ilari Liusvaara
@ 2010-05-27 18:05   ` Shawn O. Pearce
  2010-05-27 18:22     ` Alex Riesen
  2010-05-27 18:35     ` Ilari Liusvaara
  0 siblings, 2 replies; 7+ messages in thread
From: Shawn O. Pearce @ 2010-05-27 18:05 UTC (permalink / raw)
  To: Ilari Liusvaara; +Cc: Vicent Marti, git, srabbelier

Ilari Liusvaara <ilari.liusvaara@elisanet.fi> wrote:
> * Where algorithm in git_revpool_table__hash() is from? Since it appears to
> hash binary object IDs, wouldn't just simple sum/xor over words be sufficient
> (all SHA-1 output bits are very nearly independent). Or do you need to be
> compatible with some other implementation (doesn't appear so, because hash
> is computed differently depending on endianess)?

If you need a hash value for a SHA-1, why not just cast the unsigned
char* to unsigned int* and load the first int as the hash code?
The output of SHA-1 is pretty evenly distributed, using the first
few bytes as an int should yield a sufficient distribution throughout
the hashtable.
 
-- 
Shawn.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Libgit2 on the Summer of Code
  2010-05-27 18:05   ` Shawn O. Pearce
@ 2010-05-27 18:22     ` Alex Riesen
  2010-06-02 18:42       ` Junio C Hamano
  2010-05-27 18:35     ` Ilari Liusvaara
  1 sibling, 1 reply; 7+ messages in thread
From: Alex Riesen @ 2010-05-27 18:22 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Ilari Liusvaara, Vicent Marti, git, srabbelier

On Thu, May 27, 2010 at 20:05, Shawn O. Pearce <spearce@spearce.org> wrote:
> Ilari Liusvaara <ilari.liusvaara@elisanet.fi> wrote:
>> * Where algorithm in git_revpool_table__hash() is from? Since it appears to
>> hash binary object IDs, wouldn't just simple sum/xor over words be sufficient
>> (all SHA-1 output bits are very nearly independent). Or do you need to be
>> compatible with some other implementation (doesn't appear so, because hash
>> is computed differently depending on endianess)?
>
> If you need a hash value for a SHA-1, why not just cast the unsigned
> char* to unsigned int* and load the first int as the hash code?
> The output of SHA-1 is pretty evenly distributed, using the first
> few bytes as an int should yield a sufficient distribution throughout
> the hashtable.

Just make sure the SHA1 data are properly aligned for your platform

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Libgit2 on the Summer of Code
  2010-05-27 18:22     ` Alex Riesen
@ 2010-06-02 18:42       ` Junio C Hamano
  0 siblings, 0 replies; 7+ messages in thread
From: Junio C Hamano @ 2010-06-02 18:42 UTC (permalink / raw)
  To: Alex Riesen
  Cc: Shawn O. Pearce, Ilari Liusvaara, Vicent Marti, git, srabbelier

Alex Riesen <raa.lkml@gmail.com> writes:

> On Thu, May 27, 2010 at 20:05, Shawn O. Pearce <spearce@spearce.org> wrote:
>> Ilari Liusvaara <ilari.liusvaara@elisanet.fi> wrote:
>>> * Where algorithm in git_revpool_table__hash() is from? Since it appears to
>>> hash binary object IDs, wouldn't just simple sum/xor over words be sufficient
>>> (all SHA-1 output bits are very nearly independent). Or do you need to be
>>> compatible with some other implementation (doesn't appear so, because hash
>>> is computed differently depending on endianess)?
>>
>> If you need a hash value for a SHA-1, why not just cast the unsigned
>> char* to unsigned int* and load the first int as the hash code?
>> The output of SHA-1 is pretty evenly distributed, using the first
>> few bytes as an int should yield a sufficient distribution throughout
>> the hashtable.
>
> Just make sure the SHA1 data are properly aligned for your platform

Also I'd prefer to see the code watch out for reproducibility across
platforms with different endianness and integer size.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Libgit2 on the Summer of Code
  2010-05-27 18:05   ` Shawn O. Pearce
  2010-05-27 18:22     ` Alex Riesen
@ 2010-05-27 18:35     ` Ilari Liusvaara
  2010-05-28  0:15       ` Vicent Marti
  1 sibling, 1 reply; 7+ messages in thread
From: Ilari Liusvaara @ 2010-05-27 18:35 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Vicent Marti, git, srabbelier

On Thu, May 27, 2010 at 11:05:54AM -0700, Shawn O. Pearce wrote:
> Ilari Liusvaara <ilari.liusvaara@elisanet.fi> wrote:
> > * Where algorithm in git_revpool_table__hash() is from? Since it appears to
> > hash binary object IDs, wouldn't just simple sum/xor over words be sufficient
> > (all SHA-1 output bits are very nearly independent). Or do you need to be
> > compatible with some other implementation (doesn't appear so, because hash
> > is computed differently depending on endianess)?
> 
> If you need a hash value for a SHA-1, why not just cast the unsigned
> char* to unsigned int* and load the first int as the hash code?
> The output of SHA-1 is pretty evenly distributed, using the first
> few bytes as an int should yield a sufficient distribution throughout
> the hashtable.

Yeah, With pseudorandom function[*], all ways of reducing the output to n bits are
at most as good as just taking first n bits. But if reducing output to [0, m),
the best way (distribution-wise, not speed-wise) is to take remainder of the whole
value divided by m...

[*] SHA-1 is not pseudorandom function, but for virtually all practical purposes
it is.

-Ilari

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Libgit2 on the Summer of Code
  2010-05-27 18:35     ` Ilari Liusvaara
@ 2010-05-28  0:15       ` Vicent Marti
  0 siblings, 0 replies; 7+ messages in thread
From: Vicent Marti @ 2010-05-28  0:15 UTC (permalink / raw)
  To: Ilari Liusvaara; +Cc: Shawn O. Pearce, git, srabbelier

Hey again,

thanks for the bites, Ilari. I've fixed all the formatting issues in
the latest batch of commits [1], and improved the error handling quite
a bit.

Regarding the hashing of OIDs, it's not a random hash, it's an
adaptation of MurmurHash [2], which is made of win -- specially when
it comes to dispersing stuff on hash tables. You are right however
that SHA1 raw bytes are already random enough, and probably not worth
the performance hit of hashing again on top of it -- I've dropped it
for the time being.

More work tomorrow,

Cheers,
Vicent Martí
http://www.bellverde.org

[1]: http://github.com/tanoku/libgit2-gsoc2010/commits/master
[2]: http://sites.google.com/site/murmurhash/

On Thu, May 27, 2010 at 8:35 PM, Ilari Liusvaara
<ilari.liusvaara@elisanet.fi> wrote:
> On Thu, May 27, 2010 at 11:05:54AM -0700, Shawn O. Pearce wrote:
>> Ilari Liusvaara <ilari.liusvaara@elisanet.fi> wrote:
>> > * Where algorithm in git_revpool_table__hash() is from? Since it appears to
>> > hash binary object IDs, wouldn't just simple sum/xor over words be sufficient
>> > (all SHA-1 output bits are very nearly independent). Or do you need to be
>> > compatible with some other implementation (doesn't appear so, because hash
>> > is computed differently depending on endianess)?
>>
>> If you need a hash value for a SHA-1, why not just cast the unsigned
>> char* to unsigned int* and load the first int as the hash code?
>> The output of SHA-1 is pretty evenly distributed, using the first
>> few bytes as an int should yield a sufficient distribution throughout
>> the hashtable.
>
> Yeah, With pseudorandom function[*], all ways of reducing the output to n bits are
> at most as good as just taking first n bits. But if reducing output to [0, m),
> the best way (distribution-wise, not speed-wise) is to take remainder of the whole
> value divided by m...
>
> [*] SHA-1 is not pseudorandom function, but for virtually all practical purposes
> it is.
>
> -Ilari
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2010-06-02 18:42 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-27 16:25 Libgit2 on the Summer of Code Vicent Marti
2010-05-27 17:46 ` Ilari Liusvaara
2010-05-27 18:05   ` Shawn O. Pearce
2010-05-27 18:22     ` Alex Riesen
2010-06-02 18:42       ` Junio C Hamano
2010-05-27 18:35     ` Ilari Liusvaara
2010-05-28  0:15       ` Vicent Marti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).