Git development

Git development
 help / color / mirror / Atom feed

* Re: Possible --remove-empty bug
From: Junio C Hamano @ 2006-03-13  1:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Marco Costalba, git
In-Reply-To: <Pine.LNX.4.64.0603121450210.3618@g5.osdl.org>

Linus Torvalds <torvalds@osdl.org> writes:

> On Sun, 12 Mar 2006, Junio C Hamano wrote:
>> 
>> To be honest, I do not know how --remove-empty is intended to
>> work.
>
> It's supposed to stop traversing the tree once a pathname disappears.

Then what we should simplify is the parent commit that does not
have those pathnames (i.e. remove parents from that parent
commit).  In other words, currently the code removes the parent
commit that makes the tree disappear, but we would want to keep
that parent, remove the grandparents, and then mark the parent
uninteresting.

-- >8 --
[PATCH] revision traversal: --remove-empty fix (take #2).

Marco Costalba reports that --remove-empty omits the commit that
created paths we are interested in.  try_to_simplify_commit()
logic was dropping a parent we introduced those paths against,
which I think is not what we meant.  Instead, this makes such
parent parentless.

Signed-off-by: Junio C Hamano <junkio@cox.net>

---

diff --git a/revision.c b/revision.c
index c8d93ff..73fba5d 100644
--- a/revision.c
+++ b/revision.c
@@ -315,9 +315,18 @@ static void try_to_simplify_commit(struc
 			return;

 		case TREE_NEW:
-			if (revs->remove_empty_trees && same_tree_as_empty(p->tree)) {
-				*pp = parent->next;
-				continue;
+			if (revs->remove_empty_trees &&
+			    same_tree_as_empty(p->tree)) {
+				/* We are adding all the specified
+				 * paths from this parent, so the
+				 * history beyond this parent is not
+				 * interesting.  Remove its parents
+				 * (they are grandparents for us).
+				 * IOW, we pretend this parent is a
+				 * "root" commit.
+				 */
+				parse_commit(p);
+				p->parents = NULL;
 			}
 		/* fallthrough */
 		case TREE_DIFFERENT:

^ permalink raw reply related

* Re: git-diff-tree -M performance regression in 'next'
From: Linus Torvalds @ 2006-03-13  1:09 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Fredrik Kuivinen, git
In-Reply-To: <7vwtezw4ye.fsf@assigned-by-dhcp.cox.net>

On Sun, 12 Mar 2006, Junio C Hamano wrote:
> 
> To reduce wasted memory, wait until the hash fills up more
> densely before we rehash.  This reduces the working set size a
> bit further.

Umm. Why do you rehash at all?

Just take the size of the "src" file as the initial hash size. 

Also, I think it is likely really wasteful to try to actually hash at 
_each_ character. Instead, let's say that the chunk-size is 8 bytes (like 
you do now), and let's say that you have a 32-bit good hash of those 8 
bytes. What you can do is:

 - for each 8 bytes in the source, hash those 8 bytes (not every byte)
 - for each byte in the other file, hash 8 next bytes. IF it matches a 
   hash in the source with a non-zero count, subtract the count for that 
   hash and move up by _eight_ characters! If it doesn't, add one to 
   "characters not matched" counter, and move up _one_ character, and try 
   again.

At the end of this, you have two counts: the count of characters that you 
couldn't match in the other file, and the count of 8-byte hash-chunks that 
you couldn't match in the first one. Use those two counts to decide if 
it's close or not.

Especially for good matches, this should basically cut your work into an 
eight of what you do now.

Actually, even for bad matches, you cut the first source overhead into one 
eight (the second file will obviously do the "update by 1 byte" most of 
the time).

Don't you think that would be as accurate as what you're doing now (it's 
the same basic notion, after all), and noticeably faster?

			Linus

^ permalink raw reply

* Re: git-diff-tree -M performance regression in 'next'
From: Junio C Hamano @ 2006-03-13  1:22 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Fredrik Kuivinen, git
In-Reply-To: <Pine.LNX.4.64.0603121700410.3618@g5.osdl.org>

Linus Torvalds <torvalds@osdl.org> writes:

> Umm. Why do you rehash at all?
>
> Just take the size of the "src" file as the initial hash size. 

The code uses close to 16-bit hash and I had 65k flat array as a
hashtable.  That one was what you commented as "4-times as many
page misses".

Interestingly enough, that kind of flat array representation
seems to be too sparse and gives very bad performance behaviour.

The improvement I mentioned in the message you are replying to
is the result of making it into smaller (starting at (1<<9) or
something like that) linear-overflowing hash.

The latter suggestion I need to think about it a bit more.

^ permalink raw reply

* Re: git-diff-tree -M performance regression in 'next'
From: Linus Torvalds @ 2006-03-13  1:29 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Fredrik Kuivinen, git
In-Reply-To: <Pine.LNX.4.64.0603121700410.3618@g5.osdl.org>

On Sun, 12 Mar 2006, Linus Torvalds wrote:
> 
> Also, I think it is likely really wasteful to try to actually hash at 
> _each_ character. Instead, let's say that the chunk-size is 8 bytes (like 
> you do now), and let's say that you have a 32-bit good hash of those 8 
> bytes. What you can do is:

Side note: regardless, your new algorithm clearly does seem faster.

However, it worries me a bit that you don't check the source strings, 
especially since the hash you use seems somewhat suspect (why limit it to 
essentially just 16 bits? Wouldn't it be best to have the _biggest_ prime 
that fits in the "hashval" thing, which is at least 32 bits? Also, 
shouldn't you make that spanhash thing use a "unsigned int" instead of 
"unsigned long"?)

So I would suggest instead the hash function to be:

	typedef unsigned long long u64;

	/* Biggest prime in 32 bits */
	#define HASHVAL (4294967291u)

	u64 value = *(u64 *)src;
	src += 8;
	hash = value % 4294967291u;

which does a 64-bit modulus, but hey, 64-bit hw isn't _that_ uncommon any 
more, and it's not _excessively_ slow on x86 (gcc doesn't generate good 
code, but we could actually use the kernel "do_div()" routine for much 
faster division of 64 % 32 than what gcc can do - since the dividend is 
32-bit, you actually only need to do one 32/32 division and one 64/32 
division, so the optimized hash function on a good x86 will be just in 
the teens of cycles for the 64-bit modulus).

		Linus

^ permalink raw reply

* Re: git-diff-tree -M performance regression in 'next'
From: Linus Torvalds @ 2006-03-13  1:31 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Fredrik Kuivinen, git
In-Reply-To: <Pine.LNX.4.64.0603121710110.3618@g5.osdl.org>

On Sun, 12 Mar 2006, Linus Torvalds wrote:
>
> 	u64 value = *(u64 *)src;
> 	src += 8;
> 	hash = value % 4294967291u;

Btw, this assumes the "only hash every 8 bytes" at the source, in which 
case this is ok even on architectures that need aligned reads. For the 
non-aligned reads, you'd need your "shift the value" approach.

		Linus

^ permalink raw reply

* Re: git-diff-tree -M performance regression in 'next'
From: Linus Torvalds @ 2006-03-13  1:39 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Fredrik Kuivinen, git
In-Reply-To: <7vhd63w33n.fsf@assigned-by-dhcp.cox.net>

On Sun, 12 Mar 2006, Junio C Hamano wrote:
> 
> The code uses close to 16-bit hash and I had 65k flat array as a
> hashtable.  That one was what you commented as "4-times as many
> page misses".

Ahh. That explains the limited bits in the hash function too. I only 
looked at the current sources, not at the historic ones.

Btw, the page misses may come from the fact that you allocated and 
re-allocated the flat array all the time. That can be very expensive for 
big allocations, since most libraries may decide that it's a big enough 
area that it should be map/unmap'ed in order to give memory back to the 
system (without realizing that there's another allocation coming soon 
afterwards of the same size).

If you map/unmap, the kernel will end up having to not just use new pages, 
but obviously also clear them for security reasons. So it ends up sucking 
on many levels. In contrast, if you just have a 64k-entry array of "int", 
and allocate it _once_ (instead of once per file-pair) you'll still have 
to clear it in between file-pairs, but at least you won't have the 
overhead of mapping/unmapping it.

The clearing can still be pretty expensive (64k "int" entries is 256kB, 
and since most _files_ are just in the ~4-8kB range, you're spending a 
_lot_ of time just memset'ing). Which is why it's probably a good idea to 
instead default to having just "filesize / 8" entries, but then you can 
obviously not use the hash as the index any more.

		Linus

^ permalink raw reply

* Re: git-diff-tree -M performance regression in 'next'
From: Linus Torvalds @ 2006-03-13  2:29 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Fredrik Kuivinen, git
In-Reply-To: <Pine.LNX.4.64.0603121710110.3618@g5.osdl.org>



On Sun, 12 Mar 2006, Linus Torvalds wrote:
> 
> So I would suggest instead the hash function to be:
> 
> 	typedef unsigned long long u64;
> 
> 	/* Biggest prime in 32 bits */
> 	#define HASHVAL (4294967291u)
> 
> 
> 	u64 value = *(u64 *)src;
> 	src += 8;
> 	hash = value % 4294967291u;
> 
> which does a 64-bit modulus, but hey, 64-bit hw isn't _that_ uncommon any 
> more, and it's not _excessively_ slow on x86 (gcc doesn't generate good 
> code, but we could actually use the kernel "do_div()" routine for much 
> faster division of 64 % 32 than what gcc can do - since the dividend is 
> 32-bit, you actually only need to do one 32/32 division and one 64/32 
> division, so the optimized hash function on a good x86 will be just in 
> the teens of cycles for the 64-bit modulus).

Actually, on x86, where we can do the 64-by-32 division with a single 
instruction, this seems to be the best possible code:

	#define HASH_VAL (4294967291u)

	static inline unsigned int hash64x32(unsigned long long a)
	{
		unsigned int high, low;
		unsigned int modulus;

		asm(""
			:"=a" (low), "=d" (high)
			:"A" (a));
		if (high > HASH_VAL)
			high -= HASH_VAL;
		asm("divl %2"
			:"=a" (low), "=d" (modulus)
			:"rm" (HASH_VAL), "0" (low), "1" (high));
		return modulus;
	}

at least as far as I can think.

The magic is the reduction of the high 32 bits: for the general case you
want a modulus for that reduction, but since we're dividing with a
really big value, the modulus of the high bits really does end up
becoming just that simple

	if (high > HASH_VAL)
		high -= HASH_VAL;

and then we just need to do a single "divl" instruction.

So if you want a 32-bit hash of an 8-byte area, this should be a pretty
good and fast way to calculate it. 

Of course, maybe just doing an adler32() is simpler/better.  But with
the above, at least on x86, I suspect you get an even better
distribution at a lower cost (of course, different coress do differently
well on large divisions, but integer division being pretty important for
some codes, modern cores actually tend to be pretty good at it). 

			Linus

^ permalink raw reply

* Re: git-diff-tree -M performance regression in 'next'
From: Linus Torvalds @ 2006-03-13  2:53 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Fredrik Kuivinen, git
In-Reply-To: <Pine.LNX.4.64.0603121810140.3618@g5.osdl.org>



On Sun, 12 Mar 2006, Linus Torvalds wrote:
>
> 		if (high > HASH_VAL)
> 			high -= HASH_VAL;

That should be ">= HASH_VAL", of course. I'm a retard. 

		Linus

^ permalink raw reply

* Re: [PATCH] Trivial warning fix for imap-send.c
From: H. Peter Anvin @ 2006-03-13  2:59 UTC (permalink / raw)
  To: gitzilla; +Cc: Mark Wooding, git
In-Reply-To: <4414747B.7040700@gmail.com>

A Large Angry SCM wrote:
> Mark Wooding wrote:
> 
>> Linus Torvalds <torvalds@osdl.org> wrote:
>>
>>> So in modern C, using NULL at the end of a varargs array as a pointer 
>>> is perfectly sane, and the extra cast is just ugly and bowing to bad 
>>> programming practices and makes no sense to anybody who never saw the 
>>> horror that is K&R.
>>
>> No!  You can still get bitten.  You're lucky that on common platforms
>> all pointers look the same, but if you find one where `char *' (and
>> hence `void *') isn't the same as `struct foo *' then, under appropriate
>> circumstances you /will/ unless you put the casts in.
> 
> Please explain how malloc() can work on such a platform. My reading of 
> the '89 ANSI C spec. finds that _ALL_ (non function) pointers _are_ 
> cast-able to/from a void * and that NULL should be #defined as (void *). 
> See 3.2.2.3 and 4.1.5 if interested.

Consider the non-hypothetical example of a word-addressed machine, which 
has to have extra bits in a subword pointer like char *.  The C standard 
requires that void * has those bits as well, but it doesn't means that 
any void * can be cast to any arbitrary pointer -- the opposite, 
however, is required.

	-hpa

^ permalink raw reply

* Re: [PATCH] Use explicit pointers for execl...() sentinels.
From: Jeff King @ 2006-03-13  3:31 UTC (permalink / raw)
  To: git
In-Reply-To: <20060312200812.3fb04638.tihirvon@gmail.com>

On Sun, Mar 12, 2006 at 08:08:12PM +0200, Timo Hirvonen wrote:

> NULL pointer does not point to any data, it just says it's 'empty'.  So
> it doesn't need to be same type pointer as specified in the function
> prototype.  Pointers are just addresses, it doesn't matter from to code
> generation point of view whether it is (char *)0 or (void *)0.

Sorry, but I think you're wrong according to the C standard. Pointers of 
different types do NOT have to share the same representation (e.g.,
there have been some platforms where char* and int* were different
sizes). A void pointer must be capable of representing any type of
pointer (for example, holding the largest possible type). However, if
sizeof(void *) == 8 and sizeof(char *) == 4, you have a problem with
variadic functions which are expecting to pull 4 byte off the stack. 

In a non-variadic function, the compiler would do the right implicit
casting. In a variadic function, it can't. 

The real question is, does git want to care about portability to such
platforms.

If you remain unconvinced, I can try to find chapter and verse of the
standard.

> sizeof(unsigned long) is sizeof(void *) in real world.

Are you saying that because it encompasses all of the platforms you've
worked on, or do you have some evidence that it is largely the case? It
certainly isn't guaranteed by the C standard.

> > Because, according to the C and POSIX specs, they're not wrong.
> They didn't think of 64-bit architectures back then, I suppose.

No, they did think of those issues; they intentionally left such sizing
up to the implementation to allow C to grow with the hardware. Mostly
you don't have to care, but as I said, typing with variadic functions is
a pain.

-Peff

^ permalink raw reply

* Re: [PATCH] Trivial warning fix for imap-send.c
From: Jeff King @ 2006-03-13  3:38 UTC (permalink / raw)
  To: git
In-Reply-To: <4414747B.7040700@gmail.com>

On Sun, Mar 12, 2006 at 11:20:27AM -0800, A Large Angry SCM wrote:

> >No!  You can still get bitten.  You're lucky that on common platforms
> >all pointers look the same, but if you find one where `char *' (and
> >hence `void *') isn't the same as `struct foo *' then, under appropriate
> >circumstances you /will/ unless you put the casts in.
> 
> Please explain how malloc() can work on such a platform. My reading of 
> the '89 ANSI C spec. finds that _ALL_ (non function) pointers _are_ 
> cast-able to/from a void * and that NULL should be #defined as (void *). 
> See 3.2.2.3 and 4.1.5 if interested.

I think Linus has cut to the heart of the discussion (that it's worth
git maintainers' sanity not to worry about such problems). However, for
pedantry's sake, this is how malloc works:

A void pointer is guaranteed to be able to hold any type of pointer
(either char * or struct foo * or whatever). The declaration of malloc
indicates a return of void *. On a platform where it matters, the
compiler generates code so that 
  struct foo *bar = malloc(100);
converts the void * pointer into the correct size (in the same way that
assigning between differently sized integers works).

This breaks down with variadic functions, which have no typing
information. So doing this:
  execl("foo", "bar", my_struct_foo);
doesn't give the compiler a chance to do the implicit cast and you get
subtle breakage (in the same way that you would if you passed a long to
a variadic function expecting a short).

-Peff

^ permalink raw reply

* Re: Any news on an Eclipse plugin?
From: Shawn Pearce @ 2006-03-13  4:09 UTC (permalink / raw)
  To: lamikr; +Cc: git
In-Reply-To: <44147B8B.4050503@cc.jyu.fi>

lamikr <lamikr@cc.jyu.fi> wrote:
> Have you yet made any kind of planning of the features that would be
> available or put up the repository?
> I use novadays Eclipse basically for all of my editing and something
> like CVS/Subclipse plug-ins for git would be cool.
> (cdt cross-indexing is still a little bit slow with the amount of files
> in kernel so with kernel I have turned that off)

Right now I'm just trying to get the basic Eclipse team plugin
plumbing into place.  For my first `release' however I am planning
on implementing the following:

1) read the HEAD tree-ish into an in-memory index within Eclipse from
   loose objects only
2) add files/subtrees to that tree
3) delete files/subtrees from that tree
4) commit the modified tree as a single parent commit
5) update HEAD with the new commit

But that's actually a lot of work.  :-)

There's a lot of really critical stuff not in that list:

6) reading objects from packs
7) pulling or pushing from a remote
8) merges and multi-parent commits
9) synchronization of the workspace with the native git 'index' file
... plus lots of other useful things ...

I'm working on that first list (1-5) this week and will post
a complete repository as soon as I have that working such that
another Eclipse developer might be able to do something with it.
Then I'll likely start in on 6 and 7 and be forced to deal with
8 when 7 is doing pulls in an interesting way.  That's certainly
weeks away from being remotely useful.

> Noel Grandin wrote:
> >The subversion plugin (subclipse.tigris.org) might be a good starting
> >point since it delegates a lot of it's low-level work through an
> >interface called svnClientAdapter. Re-implementing that to talk to git
> >should get you something useful in a reasonable time-frame.
> >
> >Note that an eclipse team plugin is a pretty complicated beast.
>
>
> Yes, but very powerfull for the people like me who have who just have
> never bothered to learn VI/Emacs/sed properly
> and feel with them like having 5 thumps, code finders, search tools,
> refactoring tools, etc. available in Eclipse are very cool.
> 
> So if the repository for git plug-ins goes up somewhere I could try to
> help a little bit.

The more the merrier.  As has been pointed out an Eclipse team
plugin is not a trivial chunk of code.

As a side effect of this effort I'd also like to see a set of
Ant tasks written.  I'm building a non-Eclipse specific GIT API
in pure Java to provide implementation to the Eclipse plugins.
Some functions are likely going to just fork/exec the core GIT
tools as I'm not planning on implementing pack deltification or
rename tracking in Java.

Anyhoo - if you are still interested in this project look for
an email from me later this week.  I should have a repository
available then.

-- 
Shawn.

^ permalink raw reply

* Re: git-diff-tree -M performance regression in 'next'
From: Junio C Hamano @ 2006-03-13  4:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Fredrik Kuivinen, git
In-Reply-To: <Pine.LNX.4.64.0603121710110.3618@g5.osdl.org>

Linus Torvalds <torvalds@osdl.org> writes:

> However, it worries me a bit that you don't check the source strings, 
> especially since the hash you use seems somewhat suspect (why limit it to 
> essentially just 16 bits? Wouldn't it be best to have the _biggest_ prime 
> that fits in the "hashval" thing, which is at least 32 bits? Also, 
> shouldn't you make that spanhash thing use a "unsigned int" instead of 
> "unsigned long"?)
>
> So I would suggest instead the hash function to be:
>
> 	typedef unsigned long long u64;
>
> 	/* Biggest prime in 32 bits */
> 	#define HASHVAL (4294967291u)

Actually what you are seeing is the best compromise I've come up
with.  There were about a dozen crappoid that turned out to be
suboptimal I threw away.

The hashsize is an interesting example of such a compromise
between precision and performance.  It is true that full 32-bit
hash gives us more precise match detection.  If you change the
current hash function to divide with (4294967291u), the rename
similarity score assigned to some (but not all) paths in the
example dataset we have been using (diff between v2.6.12 and
v2.6.14) decreases by about 1% -- this comes from the fact that
the hashvalue capped to 16-bit gives some false hits that larger
hashbase value can tell apart.  Because of it, some paths near
the rename detection threshold stop being recognized as renames.

However, the runtime of the same dataset increases quite a bit
when we do this.  I think this is because we need to keep more
different hash values in the hashtable and the cost to randomly
access into a huge table starts to hurt, unlike the 16-bit
capped version whose hash table never needs to grow beyond 65k
entries.

        next    39.77user 0.22system 0:40.78elapsed
                0inputs+0outputs (0major+18855minor)

        32-bit  66.32user 0.15system 1:07.00elapsed
                0inputs+0outputs (0major+21057minor)

Admittedly, we should not optimize for one particular test case,
but we should start from somewhere.

Decreasing the hashsize to 14-bit or less had noticeable
degradation on the quality of renames the algorithm detects and
misses to detect, both in v2.6.12..v2.6.14 test and some tests
in git.git repository.

One obvious change we could do is to make the hashsize to
0x1800D (a prime halfway between 16- and 17-bit); currently the
hashtable grows double every time when the table of the current
size fills up sufficiently, but the current hashbase cannot fit
in 16-bit, so we _are_ already using a table with 128K entries
in some cases.

^ permalink raw reply

* Re: [BUG] imap-send.c fails to build on OSX
From: Shawn Pearce @ 2006-03-13  4:25 UTC (permalink / raw)
  To: Randal L. Schwartz; +Cc: git
In-Reply-To: <863bhnlo3r.fsf@blue.stonehenge.com>

Workarounds for compiling on MacOS X.

Signed-off-by: Shawn O. Pearce <spearce@spearce.org>

---
 "Randal L. Schwartz" <merlyn@stonehenge.com> wrote:
 > 
 >     gcc -o imap-send.o -c -g -O2 -Wall -I/sw/include -I/opt/local/include -DSHA1_HEADER='<openssl/sha.h>'  imap-send.c
 >     imap-send.c:376: error: static declaration of 'vasprintf' follows non-static declaration
 >     /usr/include/stdio.h:297: error: previous declaration of 'vasprintf' was here
 >     make: *** [imap-send.o] Error 1

 This should fix the build issues on Mac OS X.  Not tested on other
 (possibly) more important platforms, like Linux.  :-)

 generate-cmdlist.sh |    2 +-
 imap-send.c         |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

base 57f4a482a7fc31e31cafd31fe23f4a4779b2c2f0
last 6854050e1cdaf1b7ab352e1d4757ff5b29a5db93
diff --git a/generate-cmdlist.sh b/generate-cmdlist.sh
index 6ee85d5..76ba49c 100755
--- a/generate-cmdlist.sh
+++ b/generate-cmdlist.sh
@@ -42,7 +42,7 @@ EOF
 while read cmd
 do
     sed -n "/NAME/,/git-$cmd/H;
-	    \$ {x; s/.*git-$cmd - \\(.*\\)/  {\"$cmd\", \"\1\"},/; p}" \
+	    \$ {x; s/.*git-$cmd - \\(.*\\)/  {\"$cmd\", \"\1\"},/; p;}" \
 	"Documentation/git-$cmd.txt"
 done
 echo "};"
diff --git a/imap-send.c b/imap-send.c
index fddaac0..376929f 100644
--- a/imap-send.c
+++ b/imap-send.c
@@ -370,7 +370,7 @@ free_generic_messages( message_t *msgs )
 	}
 }
 
-static int
+int
 vasprintf( char **strp, const char *fmt, va_list ap )
 {
 	int len;
-- 
1.2.4.g806b

^ permalink raw reply related

* Re: [BUG] imap-send.c fails to build on OSX
From: Randal L. Schwartz @ 2006-03-13  4:29 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git
In-Reply-To: <20060313042552.GA18136@spearce.org>

>>>>> "Shawn" == Shawn Pearce <spearce@spearce.org> writes:

Shawn> Workarounds for compiling on MacOS X.
Shawn> Signed-off-by: Shawn O. Pearce <spearce@spearce.org>

Shawn>  This should fix the build issues on Mac OS X.  Not tested on other
Shawn>  (possibly) more important platforms, like Linux.  :-)

My C is a bit rusty, but wouldn't that provide a definition for vasprintf when
the libc is *also* providing a definition?  Might that not also break other
apps that want to link with this?

Maybe the right solution is to rename this local implementation so that it
can't conflict, like "git_vasprintf", or to include it only when the libc
doesn't provide it.

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<merlyn@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!

^ permalink raw reply

* Re: [PATCH] Trivial warning fix for imap-send.c
From: A Large Angry SCM @ 2006-03-13  4:36 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Mark Wooding, git
In-Reply-To: <4414E000.9030902@zytor.com>

H. Peter Anvin wrote:
> A Large Angry SCM wrote:
>> Mark Wooding wrote:
>>
>>> Linus Torvalds <torvalds@osdl.org> wrote:
>>>
>>>> So in modern C, using NULL at the end of a varargs array as a 
>>>> pointer is perfectly sane, and the extra cast is just ugly and 
>>>> bowing to bad programming practices and makes no sense to anybody 
>>>> who never saw the horror that is K&R.
>>>
>>> No!  You can still get bitten.  You're lucky that on common platforms
>>> all pointers look the same, but if you find one where `char *' (and
>>> hence `void *') isn't the same as `struct foo *' then, under appropriate
>>> circumstances you /will/ unless you put the casts in.
>>
>> Please explain how malloc() can work on such a platform. My reading of 
>> the '89 ANSI C spec. finds that _ALL_ (non function) pointers _are_ 
>> cast-able to/from a void * and that NULL should be #defined as (void 
>> *). See 3.2.2.3 and 4.1.5 if interested.
> 
> Consider the non-hypothetical example of a word-addressed machine, which 
> has to have extra bits in a subword pointer like char *.  The C standard 
> requires that void * has those bits as well, but it doesn't means that 
> any void * can be cast to any arbitrary pointer -- the opposite, 
> however, is required.

ANSI X3.159-1989

3.2.2.3 Pointers
	A pointer to *void* may be converted to or from a pointer to any 
incomplete or object type. A pointer to any incomplete or object type 
may be converted to a pointer to *void* and back again; the result shall 
compare equal to the original pointer.

For any qualifier /q/, a pointer to a non-/q/-qualified type may be 
converted to a pointer to the /q/-qualified version of the type; the 
values stored in the original and converted pointers shall compare equal.

In integral constant expression with value 0, or such an expression cast 
to type <bold>void *</bold>, is called a /null pointer constant.[*33*] 
If a null pointer constant is assigned to or compared for equality to a 
pointer, the constant is converted to a pointer of that type. Such a 
pointer, called a /null pointer/, is guaranteed to compare unequal to a 
pointer to any object or function.

Two null pointers, converted through possibly different sequences of 
casts to pointer types, shall compare equal.

[*33*] the macro *NULL* is defined in <stddef.h> as a null pointer 
constant; see 4.1.5.

^ permalink raw reply

* Re: What's in git.git
From: Junio C Hamano @ 2006-03-13  5:01 UTC (permalink / raw)
  To: Fredrik Kuivinen; +Cc: git, Johannes Schindelin, Martin Langhoff
In-Reply-To: <20060310104443.GA4491@c165.ib.student.liu.se>

Fredrik Kuivinen <freku045@student.liu.se> writes:

> On Mon, Mar 06, 2006 at 10:05:41PM +1300, Martin Langhoff wrote:
>> 
>> Add fuel to the fire  ;-) Can git-blame take cached git-rev-list
>> output like annotate does with -S?
>
> Currently it cannot do that. How is that option used? If you want to
> make annotate/blame faster for certain files you might as well cache
> the output of annotate/blame instead of the git-rev-list output, no?

There are two reasons Martin's git-cvsserver uses -S to feed you
the revision list.  One is that he already has that ancestry
chain information, and there is no point for him to have the
git-annotate command to recompute it.

But there is another, more important reason.  He is giving his
clients a modified world view where the branch he is exposing to
have _no_ merges -- just a single strand of pearls.  So what is
fed to git-annotate using -S from git-cvsserver has either one
object name (single root commit) or two (the commit and its sole
parent).  IOW, he does not want you to look at other parents
when dealing with a merge commit.

What this means is that in cases where your algorithm looked at
second and subsequent parents to pass remaining blame on after
looking at the first parent for a merge, the algorithm now needs
to assign the blame to the merge commit itself.  Your
process_commit() currently reads the commit object and loops
over its true parents, but the -S flag wants to supply its own
notion of who are the parents of whom (and in the case of
git-cvsserver, it always supplies at most one parent) so you
would need to honor that instead of looking at the real
ancestry.

^ permalink raw reply

* Re: [BUG] imap-send.c fails to build on OSX
From: Shawn Pearce @ 2006-03-13  5:03 UTC (permalink / raw)
  To: Randal L. Schwartz; +Cc: git
In-Reply-To: <86zmjvhsru.fsf@blue.stonehenge.com>

"Randal L. Schwartz" <merlyn@stonehenge.com> wrote:
> >>>>> "Shawn" == Shawn Pearce <spearce@spearce.org> writes:
> 
> Shawn> Workarounds for compiling on MacOS X.
> Shawn> Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
> 
> Shawn>  This should fix the build issues on Mac OS X.  Not tested on other
> Shawn>  (possibly) more important platforms, like Linux.  :-)
> 
> My C is a bit rusty, but wouldn't that provide a definition for vasprintf when
> the libc is *also* providing a definition?  Might that not also break other
> apps that want to link with this?
> 
> Maybe the right solution is to rename this local implementation so that it
> can't conflict, like "git_vasprintf", or to include it only when the libc
> doesn't provide it.

Your right.  But I figured it wasn't a big deal since this is just
an application and not library code.  Nobody should be linking
against it.

Perhaps the better solution is to rename the provided implementation
to git_; maybe even better to move it off to the compat library
area where other should-have-been-provided-by-the-OS definitions are.

-- 
Shawn.

^ permalink raw reply

* Re: [PATCH 0/6] http-push updates
From: Junio C Hamano @ 2006-03-13  5:21 UTC (permalink / raw)
  To: Nick Hengeveld; +Cc: git
In-Reply-To: <20060311041749.GB3997@reactrix.com>

Nick Hengeveld <nickh@reactrix.com> writes:

> I'm considering future support for initializing a remote repo if the
> remote url points to an empty directory and the --force arg is present.
> Any thoughts?

Repository maintenance tasks:

 - create a new repository
 - create new branch (and new tag) -- I think you can already do this
 - remove an unneeded branch and tag
 - (perhaps) running update-server-info
 - running repack

> I'm also planning to add support for using packs to send updates, and
> for updating remote server objects/info/packs.  I'm not sure whether it
> makes sense to always send packs or to only do so when enough objects
> need to be pushed.

If you have repack support somehow, always sending packs
(especially the thin kind) would make sense, but otherwise you
probably would want some arrangements to make sure many small
packs are periodically consolidated into larger smaller number
of packs, so that they would not fragment the vma of git
programs that run on the server side by mmap()ing them into
their address spaces.

^ permalink raw reply

* Re: [PATCH] Trivial warning fix for imap-send.c
From: Linus Torvalds @ 2006-03-13  5:22 UTC (permalink / raw)
  To: git; +Cc: H. Peter Anvin, Mark Wooding
In-Reply-To: <4414F6B1.9080107@gmail.com>

On Sun, 12 Mar 2006, A Large Angry SCM wrote:
> 
> 3.2.2.3 Pointers
> 	A pointer to *void* may be converted to or from a pointer to any
> incomplete or object type. A pointer to any incomplete or object type may be
> converted to a pointer to *void* and back again; the result shall compare
> equal to the original pointer.

Large, you're missing the point.

"void *" is guaranteed to be a _superset_ of all pointers.

But that dos not mean that any "void *" pointer can be cast to any other 
pointer. BUT IF IT STARTED OUT AS A POINTER OF THAT TYPE, IT'S GUARANTEED 
THAT IT CAN BE CAST _BACK_ TO THAT TYPE.

(And furthermore, NULL is special in that it will always compare equal 
regardless of how it has ever been cast).

This means, for example, that it's perfectly legal for a C implementation 
to have a 128-bit "void *", where the low bits are the "real pointer" and 
the high 64 bits are the "type descriptor". You could only cast such a 
pointer to that proper type, but you could not cast it to any other type 
(except for the special case of NULL).

My argument boils down to the fact that we don't care one whit about those 
theoretical architectures. It so happens that ia64 function pointers are 
sometimes described this way (due to totally broken reasons - don't ask), 
and that function pointers could indeed be seen as 128-bit quantities. But 
that is such a horribly broken thing, that what compilers on ia64 actually 
do is to instead of having a 128-bit "void *" (which would be legal per 
the standard), they make function pointers actually point to the function 
description (128-bit datum) rather than the actual start of the function.

(I think. I forget the exact details. I think the whole architecture is a 
total mess, and should never have been done in the firstplace).

Similarly, there are certain tagged architectures where the pointer 
actually contains the type it points to, and again, C _allows_ that, and 
if you want to be strictly conforming, you can't do certain things that 
seem obviously correct.

HOWEVER. The undeniable fact is that no sane architecture that anybody 
cares about today (and that, in turn, implies that nobody will care about 
it in the next quarter century - these things have a tendency to 
re-inforce themselves) actually does that. 

Another example is two's complement. C as a language actually allows other 
type representations than two's complement for integers, and there's lots 
of verbiage in the standard about how overflow is undefined etc. Then they 
go to pains to explain how "unsigned" integers are guaranteed to behave as 
if the machine was a regular binary machine, even though the language 
lawyers in general went to great pain to make it clear that if the integer 
representation is binary-packed-decimal, it's still legal from a C 
stanpoint.

But again, nobody sane would ever care. The likelihood that we'll see a 
ternary machine in the next few decades is pretty damn small, because 
while the C standard allows for something else, it would be painful in the 
extreme for anybody to actually convert all the programs that effectively 
depend on 8-bit bytes etc.

So again, in _theory_ the C standard works for some really odd crap out 
there. In practice, there are only certain pretty standard setups (ILP32, 
I32LP64, IL32P64), and some old ones (I16LP32) that nobody cares about, 
and then the really odd ones (36-bit word-addressable monsters where char, 
short, int, long and pointer are all the same size) that have a C 
compiler, but that you will never be able to port _any_ normal program 
to..

In other words, the C standard allows some really strange stuff. Trying to 
even worry about it is just not worth it. It often makes the code just 
much harder to read for absolutely zero gain.

So in practice, the strangest setup you'll ever really care about is 
actually Windows. And it's strange because it can have a totally broken 
size model (IL32LLP64 - although I think that's usually just a compiler 
switch), and because it has such strange system libraries and filesystem 
behaviour (which is sadly more than just a compiler switch).

Even windows (or, perhaps, Windows _in_particular_) will never have things 
like a "char" that isn't 8 bits, etc that could be possible in theory if 
you were to just read the C standard.

		Linus

^ permalink raw reply

* Re: Possible --remove-empty bug
From: Junio C Hamano @ 2006-03-13  5:41 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Marco Costalba, git
In-Reply-To: <7vlkvfw3px.fsf@assigned-by-dhcp.cox.net>

Junio C Hamano <junkio@cox.net> writes:

> Linus Torvalds <torvalds@osdl.org> writes:
>
>> It's supposed to stop traversing the tree once a pathname disappears.
>
> Then what we should simplify is the parent commit that does not
> have those pathnames (i.e. remove parents from that parent
> commit).  In other words, currently the code removes the parent
> commit that makes the tree disappear, but we would want to keep
> that parent, remove the grandparents, and then mark the parent
> uninteresting.

Sorry, the last clause in the above comment is wrong, and does
not describe what the code attached does.

It removes the grandparents from the parent, and leaves the
parent still interesting.  As a result, in your example:

    ... if you have

                a
               / \
              b   c
               \ /
                d

    where the pathname disappeared in "b"...

we would get this world view:

                a
               / \
              b   c
                 /
                d

because when inspecting a and finding that "b" does not have any
paths we are interested in, b loses all of its parents.

^ permalink raw reply

* Fix up diffcore-rename scoring
From: Linus Torvalds @ 2006-03-13  6:26 UTC (permalink / raw)
  To: Junio C Hamano, Git Mailing List


The "score" calculation for diffcore-rename was totally broken.

It scaled "score" as

	score = src_copied * MAX_SCORE / dst->size;

which means that you got a 100% similarity score even if src and dest were 
different, if just every byte of dst was copied from src, even if source 
was much larger than dst (eg we had copied 85% of the bytes, but _deleted_ 
the remaining 15%).

That's clearly bogus. We should do the score calculation relative not to 
the destination size, but to the max size of the two.

This seems to fix it.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
---
diff --git a/diffcore-rename.c b/diffcore-rename.c
index ed99fe2..e992698 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -133,7 +133,7 @@ static int estimate_similarity(struct di
 	 * match than anything else; the destination does not even
 	 * call into this function in that case.
 	 */
-	unsigned long delta_size, base_size, src_copied, literal_added;
+	unsigned long max_size, delta_size, base_size, src_copied, literal_added;
 	unsigned long delta_limit;
 	int score;
 
@@ -144,9 +144,9 @@ static int estimate_similarity(struct di
 	if (!S_ISREG(src->mode) || !S_ISREG(dst->mode))
 		return 0;
 
-	delta_size = ((src->size < dst->size) ?
-		      (dst->size - src->size) : (src->size - dst->size));
+	max_size = ((src->size > dst->size) ? src->size : dst->size);
 	base_size = ((src->size < dst->size) ? src->size : dst->size);
+	delta_size = max_size - base_size;
 
 	/* We would not consider edits that change the file size so
 	 * drastically.  delta_size must be smaller than
@@ -174,12 +174,10 @@ static int estimate_similarity(struct di
 	/* How similar are they?
 	 * what percentage of material in dst are from source?
 	 */
-	if (dst->size < src_copied)
-		score = MAX_SCORE;
-	else if (!dst->size)
+	if (!dst->size)
 		score = 0; /* should not happen */
 	else
-		score = src_copied * MAX_SCORE / dst->size;
+		score = src_copied * MAX_SCORE / max_size;
 	return score;
 }
 

^ permalink raw reply related

* Re: [PATCH] Trivial warning fix for imap-send.c
From: H. Peter Anvin @ 2006-03-13  6:37 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git, Mark Wooding
In-Reply-To: <Pine.LNX.4.64.0603122103440.3618@g5.osdl.org>

On "real" machines, the biggest reason you'd care is that a lot of 
compilers, *especially* in C++ mode, really still define NULL as "0"; 
ostensibly because defining it as "((void *)0)" breaks some obscure C++ 
casting rule.

	-hpa

^ permalink raw reply

* Re: [PATCH] Trivial warning fix for imap-send.c
From: H. Peter Anvin @ 2006-03-13  6:41 UTC (permalink / raw)
  To: Jeff King; +Cc: git
In-Reply-To: <20060313033805.GB14601@coredump.intra.peff.net>

Jeff King wrote:
> 
> I think Linus has cut to the heart of the discussion (that it's worth
> git maintainers' sanity not to worry about such problems). However, for
> pedantry's sake, this is how malloc works:
> 
> A void pointer is guaranteed to be able to hold any type of pointer
> (either char * or struct foo * or whatever). The declaration of malloc
> indicates a return of void *. On a platform where it matters, the
> compiler generates code so that 
>   struct foo *bar = malloc(100);
> converts the void * pointer into the correct size (in the same way that
> assigning between differently sized integers works).
> 

Furthermore, the return value of malloc() (calloc, realloc, ...) is 
guaranteed to be a value suitable for casting to any pointer type for 
which at least one object can fit in the allocated memory block.

	-hpa

^ permalink raw reply

* Re: Fix up diffcore-rename scoring
From: Linus Torvalds @ 2006-03-13  6:44 UTC (permalink / raw)
  To: Junio C Hamano, Git Mailing List
In-Reply-To: <Pine.LNX.4.64.0603122223160.3618@g5.osdl.org>

On Sun, 12 Mar 2006, Linus Torvalds wrote:
> 
> The "score" calculation for diffcore-rename was totally broken.
> 
> It scaled "score" as
> 
> 	score = src_copied * MAX_SCORE / dst->size;
> 
> which means that you got a 100% similarity score even if src and dest were 
> different, if just every byte of dst was copied from src, even if source 
> was much larger than dst (eg we had copied 85% of the bytes, but _deleted_ 
> the remaining 15%).
> 
> That's clearly bogus. We should do the score calculation relative not to 
> the destination size, but to the max size of the two.
> 
> This seems to fix it.

Btw, interestingly, this seems to actually improve on the rename 
detection from your previous one, even though at the face of it, it 
should just have made the scores go down.

I'm not quite sure why, but perhaps it gave a bogus high score to some 
rename that wasn't very good, allowing the _real_ rename to make itself 
seen.

Or maybe I did some mistake in testing it.

		Linus

PS. You can still get a "similarity score" of 100 with the fixed scaling 
even if the source and the destination were different. That happens if 
every byte was marked as "copied" by the similarity estimator. Which can 
happen if you just move things around in the file - the end result is 
different, but all the bytes are copied from the source.

At least with the fixed heuristic, that "perfect similarity" score can be 
_somehow_ be explained. The files are very similar in that they have the 
same content, just in a different order ;)

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox