Git development
 help / color / mirror / Atom feed
* Re: [PATCH v2 5/5] bulk-checkin: replace fast-import based implementation
From: Nguyen Thai Ngoc Duy @ 2011-12-01  8:05 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <1322699263-14475-6-git-send-email-gitster@pobox.com>

On Thu, Dec 1, 2011 at 7:27 AM, Junio C Hamano <gitster@pobox.com> wrote:
> @@ -458,11 +459,15 @@ int cmd_add(int argc, const char **argv, const char *prefix)
>                free(seen);
>        }
>
> +       plug_bulk_checkin();
> +
>        exit_status |= add_files_to_cache(prefix, pathspec, flags);
>
>        if (add_new_files)
>                exit_status |= add_files(&dir, flags);
>
> +       unplug_bulk_checkin();
> +
>  finish:
>        if (active_cache_changed) {
>                if (write_cache(newfd, active_cache, active_nr) ||

Also do the same for git-commit, except as-is commit case, for
updating large files?

update-index should also learn about this, although at plumbing level,
maybe we could have an option to let users choose what to stream (i.e.
skip size check for streaming).
-- 
Duy

^ permalink raw reply

* Re: [HELP] Adding git awareness to the darning patch management system.
From: Jeff King @ 2011-12-01  6:27 UTC (permalink / raw)
  To: Peter Williams; +Cc: git
In-Reply-To: <4ED6D0DB.3060800@users.sourceforge.net>

On Thu, Dec 01, 2011 at 10:56:59AM +1000, Peter Williams wrote:

> >I'm not exactly sure what this means.
> 
> If you look at the screenshots at sourceforge (which were produced on
> top of a Mercurial repo) you'll notice that file names in the left
> most tree have letters in front of them and appear in different
> foreground colours.  These letters are the same as those returned by
> Mercurial's status command and, hence, give a Mercurial user an easy
> to understand snapshot of the status of the files in the playground.
> The colour coding is (relatively) arbitrary (and chosen by me) and is
> intended to make it easier to detect the different file statuses.
>
> My main problem is that I can't find a git file status command (and
> there are a lot of them to choose from) that gives a snapshot of the
> statuses of all files in a directory (including those not tracked or
> ignored).

Thanks, that helps. You probably just want to use "git status
--porcelain", which will show you the state of file modification with
respect to the index and the prior commit, as well as any untracked
files. See the "porcelain format" section in "git help status".

Note that "git status" will not print files which are not modified. You
may want to also run "git ls-files" to get the full listing of files,
including unmodified ones.

> A secondary problem is that, if I could cobble together statuses from
> various commands, mapping git statuses to the Mercurial ones for
> display would not be a good solution as they would not necessarily
> make sense to a git user.  (It's fairly clear to me from my inability
> to make sense of git's CLI that git users think differently to me, a
> Mercurial user, and it's unlikely that I can, without help, make a
> file tree display that makes sense to a git user.)

I'm hoping that "git status --porcelain" will give you a fairly close
mapping of the basic "what happened to this file" concept, based on what
I see in the second screenshot you mentioned.

The trickiest thing is the index, which represents an in-between state
that is not usually exposed by other version control systems. If your
tool does not make use of the index, then it probably makes sense to
just consider a path as modified if it has modifications staged in the
index or in the working tree, which maps to other VCS's idea of
"modified" (because for them, marking something as to-be-committed and
commiting it are part of the same step).

> >For this, you probably want "git diff-files --name-only", which will
> >show files with differences in the working tree. Keep in mind that git
> >has an "index" or "staging area", which means that you have three states
> >of content for a given path:
> >
> >   1. the state of the prior commit (i.e., HEAD)
> >
> >   2. the state that is marked to be committed when "git commit" is run
> >      (i.e., the index)
> >
> >   3. the state in the working tree
> 
> This is a prime example of the different mindset of the git user to
> the hg user.

You don't have to use those features, of course. It's just that
something like "git status" is going to report on the differences
between those states, so as a tool writer you need to know they are
there (and as I said above, you are free to simplify if it fits into the
mental model of your tool).

> >You can compare the first two with "git diff-index", and the latter two
> >with "git diff-files". You can also use "git status --porcelain" to get
> >a machine-readable output that shows how the three states match up, with
> >one line per file.
> 
> This is an example of why I'm confused.  There are too many ways to
> do (similar) things and it's hard to know which to use.

Git is made of little building blocks. The original way to see the
differences between the index and the working tree was via diff-files.
But then people build bigger building blocks out of the smaller ones.
"git status" is really just a shorthand for:

  git diff-index HEAD &&
  git diff-files &&
  git ls-files -o

and is in fact implemented using those building blocks (originally as a
shell script, though these days it is written in C). So you can choose
either and get the same information. Choosing a higher-level building
block may save you some work, if the abstraction matches what you want.
Otherwise, you can compose what you want from the lower levels.

I know it sometimes leads to an overwhelming number of commands, and I'm
not trying to excuse git's tendency to confuse people. I'm just hoping
to unconfuse you in this particular situation.

In your case, I think "status" is the most convenient level of
abstraction for you, because you are interesting in looking at
differences to both the index and HEAD (i.e., the prior commit). But if
you find as you implement that want more flexibility, you can switch to
using the lower-level commands yourself.

> Maybe an example of why I think the feature is useful might help.
> Say that you start editing a file and then decide that you want to
> put this change into a patch rather than committing it.  If you were
> using quilt you would have to do this manually by any of a number or
> ways such as:
> 
> $ <git diff command> file > temp.patch
> $ <git revert command> file
> $ quilt new one.patch
> $ quilt add file
> $ patch -p1 file < temp.patch
> $ rm temp.patch
> 
> In darning, you just do:
> 
> $ darn new one.patch
> $ darn add --absorb file

Sure. We have stgit and topgit, which do similar patch management things
on top of git. I don't personally user either, though, so I don't have
much to say on how they compare to darning, or whether it is worth
looking at their implementations.

> The interface to the SCM to support this is two functions:
> 
> 1: get_files_with_uncommitted_changes() which called with no
> arguments returns a list of the paths of all files with uncommitted
> changes or when given a list of file paths (the more common case)
> returns the subset of that list which have uncommitted changes; and

"status" will do this for you, modulo the simplification of the concept
of the index, as we discussed above.

> 2. copy_clean_version_to(filepath, target_path) which makes a copy of
> the file as recorded in the prior commit and places it at the
> target_path (usually where darning stores the "original" for
> reference when creating diffs).

You probably want:

  git cat-file blob HEAD:filepath >target_path

-Peff

^ permalink raw reply

* Re: [PATCH] Implement fast hash-collision detection
From: Jeff King @ 2011-12-01  5:26 UTC (permalink / raw)
  To: Bill Zaumen; +Cc: git, gitster, pclouds, spearce, torvalds
In-Reply-To: <1322700075.1710.157.camel@yos>

On Wed, Nov 30, 2011 at 04:41:15PM -0800, Bill Zaumen wrote:

> Aside from that, suppose the attacker does what you suggests (providing
> a valid CRC so that git commands like verify-pack don't have an error
> to detect).  You can't tell that something is wrong, but Linus can - 
> the next time he does a fetch.  If he fetches, the server sends
> some SHA-1 hashes and the client responds with 'have' or 'want' in a
> reply.  If the client wants it, the client doesn't have a CRC, if the
> client sends 'have', the CRCs are available so those get sent.  The
> server then checks, notices a mismatch (probably in the CRC of the CRCs
> of all the blobs in the commit's tree), and generates an error.

OK. I thought your claim was that this would help detect
collision-related forgeries in the chain of hashes while verifying a
signed tag. But your claim is only that your scheme gives people who
already have the objects an additional chance to notice discrepencies in
what the server is serving and what they have. And then they can notify
other people of the problem in an out-of-band way (i.e., telling
kernel.org admins that the system is hacked, or telling users not to
fetch from there).

Cryptographically speaking, I think that claim is sound, and you can
certainly construct attack scenarios where this detection would help.
However, quantifying the effectiveness is difficult.  What is the
likelihood of a malicious collided-object replacement being detected
without your scheme? What is it with it?

There are many questions that factor into guessing the latter.

How often does Linus fetch from his own kernel.org repository? He would
usually push, I would think.  Even if he does fetch, he wouldn't be
getting the old objects that he already has. I guess this is the reason
for your digest-of-digests for each commit object sent? But what about
objects that are no longer in current commits, but are in older ones?
E.g., v1.0 of foo.c has sha1 X, and an attacker finds a collision and
replaces it. Meanwhile, the tip of development now has replaced foo.c
with X'. When Linus talks to the server, git will never care about X (it
is neither being transmitted, nor is part of a commit that is being
transmitted).  But people may still be cloning and using the v1.0 tag,
assuming it is valid.

What about the server being more clever about hiding the replacement
object? E.g., instead of just breaking into kernel.org and inserting a
replacement object, the attacker runs a malicious git-daemon that
returns the bogus object to cloners, but the real object to fetchers.
Thus there is nothing for Linus to detect, but new cloners get the
malicious object. Or you could give the bogus object to people who are
getting the object for the first time (since they presumably have no
crc for that object), but otherwise use the real object.

You could get around this deception by pretending to be a "victim";
i.e., cloning fresh and seeing what the server gives you, comparing it
to your known-good repository. Which is similar to what you suggest
here:

> It's also possible to write some additional commands to (for example)
> fetch the SHA-1 hashes and CRCs from all remote repositories you use
> and compare these to make sure they are all consistent, something that
> can be run ocassionally.

But we can already do that. Assume you have an existing repo "foo". To
verify the copy at git://example.com/foo.git, do a fresh clone to
"bar", and then compare the objects in "foo" to "bar", either byte-wise
or by digest.

> That result for birthday attack assumes the goal is to find two files
> that will have the same SHA-1 value (or SHA-1 + CRC).  The case I was
> talking about (apologies if I did not state this clearly) is that you
> have an existing object and an existing CRC and you want to generate
> a second object with the same SHA-1 and same CRC as the first.  A
> birthday attack doesn't work so well in that case - the number of tries
> is much higher than half the number of bits in the digest.

Yes. This is called a "pre-image" attack (to be more specific, it is a
"second pre-image attack", since you have the original object). And
AFAIK, it's not feasible against SHA-1, nor will it be in the near
future (even MD5, which is considered totally broken, does not have
feasible pre-image attacks).

> http://en.wikipedia.org/wiki/Birthday_attack#Digital_signature_susceptibility
> has a discussion regarding digital signatures.  The trick is for a
> person to create a large number of variations of a "fair" and "unfair"
> contract, and use a birthday attack to find a pair that have the same
> hash.  The variations are typically inconsequential changes (extra
> spacing, commas, etc.)

Right. This is the classic birthday attack. I don't keep up with the
state of the art in collision attacks these days, but I think in
practice they are usually executed by adding arbitrary binary garbage
into a "hidden" spot in a file. Which makes it much harder to execute
against text files like source code (as opposed to, say, a binary
firmware blob). IIRC, the practical MD5 attacks involved postscript
documents (with non-printing garbage sections that printers would
ignore).

> In the case I was discussing, a developer creates some code,  commits
> it and pushes it to a shared repository - the developer is not given
> code by the attacker.  The attacker can, however, see the code by
> fetching it.  An attack then consists of generating a collision,
> change the object in the attacker's local repository, and then push
> the original developer's commit (with the modified object) to another
> shared repository before someone else puts the correct objects into
> that repository.  A birthday attack does not work in this case.

Yeah, that is simply not feasible, and is not likely to become so
anytime soon.

> There one issue that this suggests however - it is not clear if the
> 2^57 number given for the best SHA-1 attacks were attempts to generate
> a new file with the same SHA-1 hash as an existing file or a pair of
> files that have the same SHA-1 hash.  If it is the latter, then an
> attack is significantly harder than I assumed as a worst case, but
> still possibly much, much better than brute force.

The 2^57 number is for a collision of two new objects, not for a
pre-image attack. AFAIK, there are currently no pre-image attacks on
SHA-1 at all.

I don't think there's a need to response individually to the points in
the rest of your email; they're all based on the assumption that the
attacker is replacing a known-good written-by-Linus commit with one
found in a pre-image attack.

-Peff

^ permalink raw reply

* Re: [PATCH/RFC 1/2] pull: pass the --no-ff-only flag through to merge, not fetch
From: Junio C Hamano @ 2011-12-01  4:58 UTC (permalink / raw)
  To: Samuel Bronson; +Cc: git
In-Reply-To: <1322703537-3914-1-git-send-email-naesten@gmail.com>

Samuel Bronson <naesten@gmail.com> writes:

> Without this, pull becomes unusable for merging branches when the config
> option `merge.ff` is set to `only`.
>
> Signed-off-by: Samuel Bronson <naesten@gmail.com>

I wonder why you need this. We have "git config --unset merge.ff" after
all. From purely mechanstic point of view, being able to temporarily
defeat a configuration variable makes perfect sense, but from the point of
view of workflow, I am not sure if it is a good thing to even allow it in
the first place in this particular case.

Setting merge.ff to 'only' means you are following along other people's
work and making nothing new on your own in this particular repository, no?
Hence you won't be asking the upstream to pull from this repository, which
in turn means that even if you made a merge by temporarily defeating the
configuration setting with this patch, your future pulls will no longer
fast forward, until somehow the upstream gets hold of your merge commit.

By the way (this is a digression), I also have to say --no-ff-only is too
*ugly* as a UI element, even though I know "git merge" already allows it
by accident.

"ff" is a tristate. By default, fast-forward is done when appropriate, and
people can _refuse_ to fast-forward and force Git to create an extra merge
commit. Or if you are strictly following along, you can say you _require_
fast-forward and reject anything else. So it may make the UI saner if we
updated the UI to allow users to say:

	--ff=normal	the default
        --ff=never	same as --no-ff that forces an extra merge commit
        --ff=required	same as --ff-only

while keeping the current --ff-only and --no-ff as backward compatibility
wart.

I dunno.

^ permalink raw reply

* Re: [PATCH] Implement fast hash-collision detection
From: Nguyen Thai Ngoc Duy @ 2011-12-01  4:43 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, Bill Zaumen, git, spearce, torvalds
In-Reply-To: <7vzkfdqye3.fsf@alter.siamese.dyndns.org>

On Thu, Dec 1, 2011 at 1:05 AM, Junio C Hamano <gitster@pobox.com> wrote:
> Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes:
>
>> What I'm thinking is whether it's possible to decouple two sha-1 roles
>> in git, as object identifier and digest, separately.
>
> Why it would be a good thing? If you have a collided identifier, somebody
> has to choose which blob a particular tree wants to have at the path, and
> because the tree would not record anything but the identifier, you cannot.

Accidental collision likelihood is small enough we don't have to care about.

>> ...
>> The day sha-1 is broken, a project can generate new digests from its
>> old good repo and enforce developers to use new digests for
>> verification instead of sha-1. sha-1 is still used by git as
>> identifier after that day.
>
> And an old blob that is identified with a SHA-1 now has a new blob that
> has different contents but happens to have the same SHA-1. How does Git
> decide which blob to use when a particular object is named by the SHA-1?

Again, I assume the likelihood that a content happens to have the same
sha-1 with another one is too low to care about. If they are, it's
must be an attack. We do not allow malicious objects to enter in the
first place using other digests. Once objects are in, they are safe to
use.
-- 
Duy

^ permalink raw reply

* Re: [PATCHv2 0/4] git-p4: small fixes to branches and labels; tests
From: Pete Wyckoff @ 2011-12-01  4:02 UTC (permalink / raw)
  To: Vitor Antunes; +Cc: Luke Diamand, git
In-Reply-To: <CAOpHH-U3PF0v7EPqnO0FNxNKh+uF1GH=cnA_MA09SaQTxo0cDA@mail.gmail.com>

vitor.hda@gmail.com wrote on Thu, 01 Dec 2011 00:33 +0000:
> On Wed, Nov 30, 2011 at 10:58 PM, Pete Wyckoff <pw@padd.com> wrote:
> > This is another fundamental disconnect between p4 and git.
> > Reading
> >
> > http://www.perforce.com/perforce/doc.current/manuals/p4guide/07_labels.html
> >
> > it is clear that labels are supposed to be used exactly where
> > tags cannot:  to specify a collection of files as they existed
> > at _different_ points in the commit history.
> 
> Check the "Use Tag Fixup Branches" section in fast-import manual, it
> might help on this. The basic concept is to create a special branch
> that puts all files in the same state the P4 label would put them and
> then tag it in git.
> 
> Tried to use this for my branch stuff, but with no success.

Interesting, thanks.  One could certainly construct any arbitrary
tree to represent the tag.  But it would be a truly evil merge of
possibly many commits.

> > Thus I think supporting labels is kind of pointless.  But in the
> > restricted use case that perforce docs tell us not to do, namely
> > using labels to identify change numbers, git can reflect that
> > with tags.
> 
> I still use labels as simple tags. Telling that we should use
> changelists instead of labels is the same as saying that we should use
> IP addresses instead of host names. It works, but I doubt you will
> ever remember it unless you write it down somewhere.

I see your point.  P4 labels are the only way that they support
tagging, apparently.  I'm okay with leaving label support in
git-p4.  And it will be nice if Luke makes it behave a bit
better.  But doing heroics to emulate cross-commit tags feels
like a lot of work, and the wrong direction.

		-- Pete

^ permalink raw reply

* [PATCH] git-svn.perl: close the edit for propedits even with no mods
From: Steven Walter @ 2011-12-01  2:37 UTC (permalink / raw)
  To: rmalperson, git; +Cc: Steven Walter

It's legitimate to update the mergeinfo property without
actually changing any files.  This can happen when changes are
backported to a branch, and then that branch is merged back
into mainline.  We still want to record the updated mergeinfo
for book-keeping.

Signed-off-by: Steven Walter <stevenrwalter@gmail.com>
---
 git-svn.perl |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/git-svn.perl b/git-svn.perl
index 177dd25..cd64659 100755
--- a/git-svn.perl
+++ b/git-svn.perl
@@ -4867,7 +4867,7 @@ sub apply_diff {
 			               $self->{mergeinfo});
 	}
 	$self->rmdirs if $_rmdir;
-	if (@$mods == 0) {
+	if (@$mods == 0 && !defined($self->{mergeinfo})) {
 		$self->abort_edit;
 	} else {
 		$self->close_edit;
-- 
1.7.5.4

^ permalink raw reply related

* Re: [HELP] Adding git awareness to the darning patch management system.
From: Peter Williams @ 2011-11-30 23:47 UTC (permalink / raw)
  To: Tay Ray Chuan; +Cc: git
In-Reply-To: <CALUzUxrcWB62jARtgifTOocGL4gEGXDFMiO=ppHXzFyUmb+3ww@mail.gmail.com>

On 30/11/11 19:04, Tay Ray Chuan wrote:
> On Wed, Nov 30, 2011 at 10:17 AM, Peter Williams
> <peter_ono@users.sourceforge.net>  wrote:
>> I'm the author of the darning patch management system
>> <http://darning.sourceforge.net/>  and would like some help adding git
>> awareness to the system.  At this stage of the development, "awareness" is
>> fairly simple concept with two broad aims:
>
> That link says it "combines the strengths of quilt and mq and
> eliminates their weaknesses", but I don't see any info on why this is
> the case with your program;

Documenting this is on my todo list (but not at the top).

Quilt's weaknesses are documented on their website and my issues with MQ 
is that it (potentially) compromises the Mercurial repository and others 
to which it may push.  (MQ also has the problem that its current 
maintainers don't understand the workflow for which it was designed and 
are changing it to suit a different workflow.)

The repository compromise issue also applies to stgit.

Other things darning adds are:

1. tracking of copies and renames (not in quilt),
2. tracking changes to files' modes (not in quilt),
3. binary diffs (using git binary diff format for import/export), and
4. help managing addition of trailing whitespace by patches.

> it would be great it if you could provide
> a quick start guide (probably easier to show this with some commands
> in the CLI rather than GUI screenshots).

At this stage, the CLI is really only there to allow me to do script 
based testing of darning's internals and is not a fully capable 
interface.  However, if you look in the directory test-cli for files 
with a ".test" suffix you will find examples of their use.

>
> Have a look at StGit's tutorial
> http://www.procode.org/stgit/doc/tutorial.html (very similar to quilt
> and mq too) to see what I mean.

OK

Peter

^ permalink raw reply

* [PATCH/RFC 2/2] merge, pull: Document the --no-ff-only merge option
From: Samuel Bronson @ 2011-12-01  1:38 UTC (permalink / raw)
  To: git; +Cc: Samuel Bronson
In-Reply-To: <1322703537-3914-1-git-send-email-naesten@gmail.com>

Now that --no-ff-only will work with pull and not just with merge, we
should probably tell the users about it.

Signed-off-by: Samuel Bronson <naesten@gmail.com>
---
 Documentation/merge-options.txt |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/Documentation/merge-options.txt b/Documentation/merge-options.txt
index 6bd0b04..2adc4f5 100644
--- a/Documentation/merge-options.txt
+++ b/Documentation/merge-options.txt
@@ -56,9 +56,13 @@ With --no-squash perform the merge and commit the result. This
 option can be used to override --squash.
 
 --ff-only::
+--no-ff-only::
 	Refuse to merge and exit with a non-zero status unless the
 	current `HEAD` is already up-to-date or the merge can be
 	resolved as a fast-forward.
++
+With --no-ff-only, don't refuse. This is can be used to override
+--ff-only, or the "`only`" value for the `merge.ff` configuration option.
 
 -s <strategy>::
 --strategy=<strategy>::
-- 
1.7.7.3

^ permalink raw reply related

* [PATCH/RFC 1/2] pull: pass the --no-ff-only flag through to merge, not fetch
From: Samuel Bronson @ 2011-12-01  1:38 UTC (permalink / raw)
  To: git; +Cc: Samuel Bronson

Without this, pull becomes unusable for merging branches when the config
option `merge.ff` is set to `only`.

Signed-off-by: Samuel Bronson <naesten@gmail.com>
---
 git-pull.sh |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/git-pull.sh b/git-pull.sh
index 9868a0b..a09fcbc 100755
--- a/git-pull.sh
+++ b/git-pull.sh
@@ -76,6 +76,8 @@ do
 		no_ff=--no-ff ;;
 	--ff-only)
 		ff_only=--ff-only ;;
+	--no-ff-only)
+		ff_only=--no-ff-only ;;
 	-s=*|--s=*|--st=*|--str=*|--stra=*|--strat=*|--strate=*|\
 		--strateg=*|--strategy=*|\
 	-s|--s|--st|--str|--stra|--strat|--strate|--strateg|--strategy)
-- 
1.7.7.3

^ permalink raw reply related

* Re: [HELP] Adding git awareness to the darning patch management system.
From: Peter Williams @ 2011-12-01  0:56 UTC (permalink / raw)
  To: Jeff King; +Cc: git
In-Reply-To: <20111130072248.GG5317@sigill.intra.peff.net>

On 30/11/11 17:22, Jeff King wrote:
> On Wed, Nov 30, 2011 at 12:17:22PM +1000, Peter Williams wrote:
>
>> 1. presenting the file tree of the sources being patched in a way
>> that makes sense to the user including the current status of files
>> from the point of view of the underlying SCM (in this case, git), and
>
> I'm not exactly sure what this means.

If you look at the screenshots at sourceforge (which were produced on 
top of a Mercurial repo) you'll notice that file names in the left most 
tree have letters in front of them and appear in different foreground 
colours.  These letters are the same as those returned by Mercurial's 
status command and, hence, give a Mercurial user an easy to understand 
snapshot of the status of the files in the playground.  The colour 
coding is (relatively) arbitrary (and chosen by me) and is intended to 
make it easier to detect the different file statuses.

My main problem is that I can't find a git file status command (and 
there are a lot of them to choose from) that gives a snapshot of the 
statuses of all files in a directory (including those not tracked or 
ignored).  A secondary problem is that, if I could cobble together 
statuses from various commands, mapping git statuses to the Mercurial 
ones for display would not be a good solution as they would not 
necessarily make sense to a git user.  (It's fairly clear to me from my 
inability to make sense of git's CLI that git users think differently to 
me, a Mercurial user, and it's unlikely that I can, without help, make a 
file tree display that makes sense to a git user.)

>
>> 2. detecting files with uncommitted changes (from the SCM's point of
>> view) when the user adds them to a patch (or pushes a patch that
>> contains them) so that they may be alerted to the fact and offered
>> the choice of absorbing the uncommitted changes into the patch (or
>> not).
>
> For this, you probably want "git diff-files --name-only", which will
> show files with differences in the working tree. Keep in mind that git
> has an "index" or "staging area", which means that you have three states
> of content for a given path:
>
>    1. the state of the prior commit (i.e., HEAD)
>
>    2. the state that is marked to be committed when "git commit" is run
>       (i.e., the index)
>
>    3. the state in the working tree

This is a prime example of the different mindset of the git user to the 
hg user.

>
> You can compare the first two with "git diff-index", and the latter two
> with "git diff-files". You can also use "git status --porcelain" to get
> a machine-readable output that shows how the three states match up, with
> one line per file.

This is an example of why I'm confused.  There are too many ways to do 
(similar) things and it's hard to know which to use.

>
>> I've already implemented this interface for Mercurial (with which I
>> am familiar) and looked at doing the same with git but had difficulty
>> discovering the definitive mechanisms for obtaining the necessary
>> data.  So I'm soliciting your help in overcoming these problems.
>
> I hope the above helps you some. If not, just ask. It might be easier to
> understand what you are looking for if you can give concrete examples.

Maybe an example of why I think the feature is useful might help.  Say 
that you start editing a file and then decide that you want to put this 
change into a patch rather than committing it.  If you were using quilt 
you would have to do this manually by any of a number or ways such as:

$ <git diff command> file > temp.patch
$ <git revert command> file
$ quilt new one.patch
$ quilt add file
$ patch -p1 file < temp.patch
$ rm temp.patch

In darning, you just do:

$ darn new one.patch
$ darn add --absorb file

If you're using the GUI (the primary interface), it will report that the 
file has uncommitted changes and offer the choice of absorbing the 
changes into the new patch, forcing the new patch to consider the 
current file state as its starting point or (of course) cancel the 
addition.  The CLI command will fail if an attempt to add a file which 
has uncommitted changes is made unless either the --absorb or --force 
(which uses the file's current content as the starting point from the 
patches point of view) options are used.  (So, whichever interface is in 
use, you have to explicitly state how you want uncommitted changes to be 
treated.

The interface to the SCM to support this is two functions:

1: get_files_with_uncommitted_changes() which called with no arguments 
returns a list of the paths of all files with uncommitted changes or 
when given a list of file paths (the more common case) returns the 
subset of that list which have uncommitted changes; and

2. copy_clean_version_to(filepath, target_path) which makes a copy of 
the file as recorded in the prior commit and places it at the 
target_path (usually where darning stores the "original" for reference 
when creating diffs).

A similar mechanism is in place for the case where a file is added to a 
patch and the file is in an underlying patch with unrefreshed changes 
but this requires no help from the underlying SCM.

Both of these mechanisms also come into play when a patch is 
pushed/applied so that the user has (relatively painless) control over 
which changes end up in which patch.

With MQ and the above example, the file would be automatically added to 
the current patch (or a new patch if you created one) absorbing the 
changes whether you wanted it to or not.  I.e. there is no way of 
creating MQ patches that don't automatically absorb all uncommitted changes.

Thanks for your reply,
Peter
PS Darning can be used on top of git repository without "git awareness" 
but is not as useful as it would be with it.

^ permalink raw reply

* Re: [PATCH] Implement fast hash-collision detection
From: Bill Zaumen @ 2011-12-01  0:41 UTC (permalink / raw)
  To: Jeff King; +Cc: git, gitster, pclouds, spearce, torvalds
In-Reply-To: <20111130062512.GA5317@sigill.intra.peff.net>

On Wed, 2011-11-30 at 01:25 -0500, Jeff King wrote:
> On Tue, Nov 29, 2011 at 01:56:28PM -0800, Bill Zaumen wrote:
> But I
> think the important attacks bypass your CRC anyway. Consider this attack
> scenario:
> 
>   1. Linus signs a tag (or a commit) and pushes it to kernel.org.
> 
>   2. kernel.org gets hacked, and the attacker replaces an object with
>      an evil colliding version[1].
> 
>   3. I clone from kernel.org, and run "git tag --verify". Git says it's
>      OK, because the signature checks out, but I have a bogus object.
> 
> How does your CRC help? If I understand your scheme correctly,
> kernel.org will have told me the CRC of all of the objects during the
> clone. But that isn't part of what Linus signed, so the attacker in step
> 2 could just as easily have overwritten kernel.org's crc file, and the
> signature will remain valid.

First, there is a misconception - the server will not tell you the CRC.
The CRC will be computed locally by the client instead.  

Aside from that, suppose the attacker does what you suggests (providing
a valid CRC so that git commands like verify-pack don't have an error
to detect).  You can't tell that something is wrong, but Linus can - 
the next time he does a fetch.  If he fetches, the server sends
some SHA-1 hashes and the client responds with 'have' or 'want' in a
reply.  If the client wants it, the client doesn't have a CRC, if the
client sends 'have', the CRCs are available so those get sent.  The
server then checks, notices a mismatch (probably in the CRC of the CRCs
of all the blobs in the commit's tree), and generates an error.

It's also possible to write some additional commands to (for example)
fetch the SHA-1 hashes and CRCs from all remote repositories you use
and compare these to make sure they are all consistent, something that
can be run ocassionally.


> > An efficient algorithm to do both simultaneously does not yet exist.
> > So, if we could generate a SHA-1 collision in one second, it would
> > presumably take billions of seconds (many decades of continuous
> > computation) to generate a SHA-1 hash with the same CRC, and well
> > before a year has elapsed, the original object should have been in all
> > the repositories, preventing a forged object from being inserted. Of
> > course, eventually you might need a real message digest.
> 
> This is wrong, for two reasons.
> 
>   1. The method for generating an object that collides in both sha-1 and
>      CRC is not necessarily to generate a colliding sha-1 and then do a
>      pre-image attack on the CRC. It is to do a birthday attack on the
>      sha-1 and the CRC together. Which halves the bit-strength of the
>      CRC to 16 bits (just as we can generally find collisions in 160-bit
>      sha1s in 2^80). 

That result for birthday attack assumes the goal is to find two files
that will have the same SHA-1 value (or SHA-1 + CRC).  The case I was
talking about (apologies if I did not state this clearly) is that you
have an existing object and an existing CRC and you want to generate
a second object with the same SHA-1 and same CRC as the first.  A
birthday attack doesn't work so well in that case - the number of tries
is much higher than half the number of bits in the digest.

http://en.wikipedia.org/wiki/Birthday_attack#Digital_signature_susceptibility
has a discussion regarding digital signatures.  The trick is for a
person to create a large number of variations of a "fair" and "unfair"
contract, and use a birthday attack to find a pair that have the same
hash.  The variations are typically inconsequential changes (extra
spacing, commas, etc.)  In the case I was discussing, a developer
creates some code,  commits it and pushes it to a shared repository -
the developer is not given code by the attacker.  The attacker can,
however, see the code by fetching it.  An attack then consists of
generating a collision, change the object in the attacker's local
repository, and then push the original developer's commit (with the
modified object) to another shared repository before someone else
puts the correct objects into that repository.  A birthday attack
does not work in this case.

There one issue that this suggests however - it is not clear if the
2^57 number given for the best SHA-1 attacks were attempts to generate
a new file with the same SHA-1 hash as an existing file or a pair of
files that have the same SHA-1 hash.  If it is the latter, then an
attack is significantly harder than I assumed as a worst case, but
still possibly much, much better than brute force.

http://en.wikipedia.org/wiki/Birthday_problem has an analysis of the
birthday problem (the basis of a birthday attack) and clearly notes that
this is different than the "same birthday as you" variation - you don't
do nearly so well in that case.

> Anyway, all of that is just reiterating that CRC should not be used
>      as a security function. It can easily be replaced in your scheme by
>      sha-256, which does have the desired properties.

Oh, I'm perfectly happy with sha-256 (and indicated that in the
documentation that is in the patch) - there's a tradeoff between error
detection and speed, and I simply guessed that the community was more
concerned with speed.

>   2. Your attack seems to be "find the sha-1 collision, publish one of
>      your colliding objects (i.e., the innocent-looking half), then try
>      to break the CRC". And then you claim that by the time you find the
>      CRC, everybody will already have the object.
> 
>      But wouldn't a smarter attack be to first find the collision, including
>      the CRC, and only _then_ start the attack? Then nobody will have
>      the object.

My assumption was that legitimate developer wrote some code, put it in
one of the remote repositories, and than an attacker downloaded that
code and tried to get a modified version into a different remote
repository.

> 
>      Moreover, it's not true that after a year everyone will have the
>      object. People still run "git clone" against kernel.org. Those
>      repos do not have the object.

This isn't the problem - a clone is going to be an exact copy of an
existing repository.  I was referring to the case where a commit made it
into one remote repository but not another - to get into the second one,
someone else would have had to review the code, creating a window where
an attacker with write access to the second repository could put the
modified object there.

> Right, but we are assuming that sha1 is broken. That's the whole
> security problem. So the existing digest is not worth much.

The assumption was more that "J Random Hacker's" code would be
trusted far less than code submitted by Linus, so if "J Random Hacker"
can generate create a replacement file with the same SHA-1 hash as
one in one of Linus' commits, others will initially assume that Linus
wrote that code. But, Git uses a "first in wins" rule so the bad guy
has to generate the replacement file and get it inserted before Linus'
commit reaches all the shared repositories for the project.  A SHA-1
collision is harmless if computed too late to get in.

> > Second, any value of the CRC that is stored permanently (baring bugs,
> > in my implementation, of course) is computed locally 

> But if I don't already have the object, then I have nothing to compare
> against. So when I get it from kernel.org, I have to simply accept that
> the object I'm getting is good, and write it into my object db.

The value is computed locally but can be compared remotely.

You (specifically) won't find anything while talking to kernel.org in
this example, but when you try to fetch from a different repository,
instead of just noting that you have the object, you'll get a
notification that there was a collision, so you'll find the problem
earlier than you would otherwise.  Also, anyone who had the right object
from kernel.org (the object before the repository was 'hacked' would get
a warning when trying to fetch from kernel.org after the change.

> Yes, the header has to go at the end of the existing headers. But I
> don't see any reason that would be a problem for the scheme I described.

Thanks for mentioning it - I tried putting the header in multiple
locations and probably missed the one spot where it would work.

> Current git should ignore headers that it doesn't understand. I haven't
> tested this, but Junio recently has been experimenting with
> gpg-signature lines in commits, and I'm pretty sure he checked that
> older gits properly ignore them.

Possibly there was another bug in the test I tried as well - something
was preventing the directory cleanup in one of the 'notes' test.

Anyway, thanks for the comments.

Bill

^ permalink raw reply

* Re: [PATCHv2 0/4] git-p4: small fixes to branches and labels; tests
From: Vitor Antunes @ 2011-12-01  0:37 UTC (permalink / raw)
  To: Pete Wyckoff; +Cc: Luke Diamand, git
In-Reply-To: <20111130230007.GA11598@arf.padd.com>

On Wed, Nov 30, 2011 at 11:00 PM, Pete Wyckoff <pw@padd.com> wrote:
> And avoids collision with some Vitor code that will get
> added eventually.

I'm starting to doubt I will ever be able to overcome the fast-import
limitation on not allowing branch delesetion. Sure, the code I wrote was
garbage! But they seem to be very relunctant on the concept of deleting
branches on the fly.
Did you ever take a look at the patch I sent? Maybe you could help me
shape it up a bit.

-- 
Vitor Antunes

^ permalink raw reply

* Re: [PATCHv2 0/4] git-p4: small fixes to branches and labels; tests
From: Vitor Antunes @ 2011-12-01  0:33 UTC (permalink / raw)
  To: Pete Wyckoff; +Cc: Luke Diamand, git
In-Reply-To: <20111130225813.GA11544@arf.padd.com>

On Wed, Nov 30, 2011 at 10:58 PM, Pete Wyckoff <pw@padd.com> wrote:
> This is another fundamental disconnect between p4 and git.
> Reading
>
> http://www.perforce.com/perforce/doc.current/manuals/p4guide/07_labels.html
>
> it is clear that labels are supposed to be used exactly where
> tags cannot:  to specify a collection of files as they existed
> at _different_ points in the commit history.

Check the "Use Tag Fixup Branches" section in fast-import manual, it
might help on this. The basic concept is to create a special branch
that puts all files in the same state the P4 label would put them and
then tag it in git.

Tried to use this for my branch stuff, but with no success.

> Thus I think supporting labels is kind of pointless.  But in the
> restricted use case that perforce docs tell us not to do, namely
> using labels to identify change numbers, git can reflect that
> with tags.

I still use labels as simple tags. Telling that we should use
changelists instead of labels is the same as saying that we should use
IP addresses instead of host names. It works, but I doubt you will
ever remember it unless you write it down somewhere.

-- 
Vitor Antunes

^ permalink raw reply

* [PATCH v2 5/5] bulk-checkin: replace fast-import based implementation
From: Junio C Hamano @ 2011-12-01  0:27 UTC (permalink / raw)
  To: git
In-Reply-To: <1322699263-14475-1-git-send-email-gitster@pobox.com>

This extends the earlier approach to stream a large file directly from the
filesystem to its own packfile, and allows "git add" to send large files
directly into a single pack. Older code used to spawn fast-import, but the
new bulk-checkin API replaces it.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Makefile               |    2 +
 builtin/add.c          |    5 +
 builtin/pack-objects.c |    6 +-
 bulk-checkin.c         |  275 ++++++++++++++++++++++++++++++++++++++++++++++++
 bulk-checkin.h         |   16 +++
 cache.h                |    2 +
 config.c               |    4 +
 environment.c          |    1 +
 sha1_file.c            |   67 +-----------
 t/t1050-large.sh       |   94 +++++++++++++++--
 zlib.c                 |    9 ++-
 11 files changed, 403 insertions(+), 78 deletions(-)
 create mode 100644 bulk-checkin.c
 create mode 100644 bulk-checkin.h

diff --git a/Makefile b/Makefile
index 3139c19..418dd2e 100644
--- a/Makefile
+++ b/Makefile
@@ -505,6 +505,7 @@ LIB_H += argv-array.h
 LIB_H += attr.h
 LIB_H += blob.h
 LIB_H += builtin.h
+LIB_H += bulk-checkin.h
 LIB_H += cache.h
 LIB_H += cache-tree.h
 LIB_H += color.h
@@ -591,6 +592,7 @@ LIB_OBJS += base85.o
 LIB_OBJS += bisect.o
 LIB_OBJS += blob.o
 LIB_OBJS += branch.o
+LIB_OBJS += bulk-checkin.o
 LIB_OBJS += bundle.o
 LIB_OBJS += cache-tree.o
 LIB_OBJS += color.o
diff --git a/builtin/add.c b/builtin/add.c
index c59b0c9..1c42900 100644
--- a/builtin/add.c
+++ b/builtin/add.c
@@ -13,6 +13,7 @@
 #include "diff.h"
 #include "diffcore.h"
 #include "revision.h"
+#include "bulk-checkin.h"
 
 static const char * const builtin_add_usage[] = {
 	"git add [options] [--] <filepattern>...",
@@ -458,11 +459,15 @@ int cmd_add(int argc, const char **argv, const char *prefix)
 		free(seen);
 	}
 
+	plug_bulk_checkin();
+
 	exit_status |= add_files_to_cache(prefix, pathspec, flags);
 
 	if (add_new_files)
 		exit_status |= add_files(&dir, flags);
 
+	unplug_bulk_checkin();
+
  finish:
 	if (active_cache_changed) {
 		if (write_cache(newfd, active_cache, active_nr) ||
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index b458b6d..dde913e 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -76,7 +76,7 @@ static struct pack_idx_option pack_idx_opts;
 static const char *base_name;
 static int progress = 1;
 static int window = 10;
-static unsigned long pack_size_limit, pack_size_limit_cfg;
+static unsigned long pack_size_limit;
 static int depth = 50;
 static int delta_search_threads;
 static int pack_to_stdout;
@@ -2009,10 +2009,6 @@ static int git_pack_config(const char *k, const char *v, void *cb)
 			    pack_idx_opts.version);
 		return 0;
 	}
-	if (!strcmp(k, "pack.packsizelimit")) {
-		pack_size_limit_cfg = git_config_ulong(k, v);
-		return 0;
-	}
 	return git_default_config(k, v, cb);
 }
 
diff --git a/bulk-checkin.c b/bulk-checkin.c
new file mode 100644
index 0000000..807463e
--- /dev/null
+++ b/bulk-checkin.c
@@ -0,0 +1,275 @@
+/*
+ * Copyright (c) 2011, Google Inc.
+ */
+#include "bulk-checkin.h"
+#include "csum-file.h"
+#include "pack.h"
+
+static int pack_compression_level = Z_DEFAULT_COMPRESSION;
+
+static struct bulk_checkin_state {
+	unsigned plugged:1;
+
+	char *pack_tmp_name;
+	struct sha1file *f;
+	off_t offset;
+	struct pack_idx_option pack_idx_opts;
+
+	struct pack_idx_entry **written;
+	uint32_t alloc_written;
+	uint32_t nr_written;
+} state;
+
+static void finish_bulk_checkin(struct bulk_checkin_state *state)
+{
+	unsigned char sha1[20];
+	char packname[PATH_MAX];
+	int i;
+
+	if (!state->f)
+		return;
+
+	if (state->nr_written == 0) {
+		close(state->f->fd);
+		unlink(state->pack_tmp_name);
+		goto clear_exit;
+	} else if (state->nr_written == 1) {
+		sha1close(state->f, sha1, CSUM_FSYNC);
+	} else {
+		int fd = sha1close(state->f, sha1, 0);
+		fixup_pack_header_footer(fd, sha1, state->pack_tmp_name,
+					 state->nr_written, sha1,
+					 state->offset);
+		close(fd);
+	}
+
+	sprintf(packname, "%s/pack/pack-", get_object_directory());
+	finish_tmp_packfile(packname, state->pack_tmp_name,
+			    state->written, state->nr_written,
+			    &state->pack_idx_opts, sha1);
+	for (i = 0; i < state->nr_written; i++)
+		free(state->written[i]);
+
+clear_exit:
+	free(state->written);
+	memset(state, 0, sizeof(*state));
+
+	/* Make objects we just wrote available to ourselves */
+	reprepare_packed_git();
+}
+
+static int already_written(struct bulk_checkin_state *state, unsigned char sha1[])
+{
+	int i;
+
+	/* The object may already exist in the repository */
+	if (has_sha1_file(sha1))
+		return 1;
+
+	/* Might want to keep the list sorted */
+	for (i = 0; i < state->nr_written; i++)
+		if (!hashcmp(state->written[i]->sha1, sha1))
+			return 1;
+
+	/* This is a new object we need to keep */
+	return 0;
+}
+
+/*
+ * Read the contents from fd for size bytes, streaming it to the
+ * packfile in state while updating the hash in ctx. Signal a failure
+ * by returning a negative value when the resulting pack would exceed
+ * the pack size limit and this is not the first object in the pack,
+ * so that the caller can discard what we wrote from the current pack
+ * by truncating it and opening a new one. The caller will then call
+ * us again after rewinding the input fd.
+ *
+ * The already_hashed_to pointer is kept untouched by the caller to
+ * make sure we do not hash the same byte when we are called
+ * again. This way, the caller does not have to checkpoint its hash
+ * status before calling us just in case we ask it to call us again
+ * with a new pack.
+ */
+static int stream_to_pack(struct bulk_checkin_state *state,
+			  git_SHA_CTX *ctx, off_t *already_hashed_to,
+			  int fd, size_t size, enum object_type type,
+			  const char *path, unsigned flags)
+{
+	git_zstream s;
+	unsigned char obuf[16384];
+	unsigned hdrlen;
+	int status = Z_OK;
+	int write_object = (flags & HASH_WRITE_OBJECT);
+	off_t offset = 0;
+
+	memset(&s, 0, sizeof(s));
+	git_deflate_init(&s, pack_compression_level);
+
+	hdrlen = encode_in_pack_object_header(type, size, obuf);
+	s.next_out = obuf + hdrlen;
+	s.avail_out = sizeof(obuf) - hdrlen;
+
+	while (status != Z_STREAM_END) {
+		unsigned char ibuf[16384];
+
+		if (size && !s.avail_in) {
+			ssize_t rsize = size < sizeof(ibuf) ? size : sizeof(ibuf);
+			if (xread(fd, ibuf, rsize) != rsize)
+				die("failed to read %d bytes from '%s'",
+				    (int)rsize, path);
+			offset += rsize;
+			if (*already_hashed_to < offset) {
+				size_t hsize = offset - *already_hashed_to;
+				if (rsize < hsize)
+					hsize = rsize;
+				if (hsize)
+					git_SHA1_Update(ctx, ibuf, hsize);
+				*already_hashed_to = offset;
+			}
+			s.next_in = ibuf;
+			s.avail_in = rsize;
+			size -= rsize;
+		}
+
+		status = git_deflate(&s, size ? 0 : Z_FINISH);
+
+		if (!s.avail_out || status == Z_STREAM_END) {
+			if (write_object) {
+				size_t written = s.next_out - obuf;
+
+				/* would we bust the size limit? */
+				if (state->nr_written &&
+				    pack_size_limit_cfg &&
+				    pack_size_limit_cfg < state->offset + written) {
+					git_deflate_abort(&s);
+					return -1;
+				}
+
+				sha1write(state->f, obuf, written);
+				state->offset += written;
+			}
+			s.next_out = obuf;
+			s.avail_out = sizeof(obuf);
+		}
+
+		switch (status) {
+		case Z_OK:
+		case Z_BUF_ERROR:
+		case Z_STREAM_END:
+			continue;
+		default:
+			die("unexpected deflate failure: %d", status);
+		}
+	}
+	git_deflate_end(&s);
+	return 0;
+}
+
+/* Lazily create backing packfile for the state */
+static void prepare_to_stream(struct bulk_checkin_state *state,
+			      unsigned flags)
+{
+	if (!(flags & HASH_WRITE_OBJECT) || state->f)
+		return;
+
+	state->f = create_tmp_packfile(&state->pack_tmp_name);
+	reset_pack_idx_option(&state->pack_idx_opts);
+
+	/* Pretend we are going to write only one object */
+	state->offset = write_pack_header(state->f, 1);
+	if (!state->offset)
+		die_errno("unable to write pack header");
+}
+
+static int deflate_to_pack(struct bulk_checkin_state *state,
+			   unsigned char result_sha1[],
+			   int fd, size_t size,
+			   enum object_type type, const char *path,
+			   unsigned flags)
+{
+	off_t seekback, already_hashed_to;
+	git_SHA_CTX ctx;
+	unsigned char obuf[16384];
+	unsigned header_len;
+	struct sha1file_checkpoint checkpoint;
+	struct pack_idx_entry *idx = NULL;
+
+	seekback = lseek(fd, 0, SEEK_CUR);
+	if (seekback == (off_t) -1)
+		return error("cannot find the current offset");
+
+	header_len = sprintf((char *)obuf, "%s %" PRIuMAX,
+			     typename(type), (uintmax_t)size) + 1;
+	git_SHA1_Init(&ctx);
+	git_SHA1_Update(&ctx, obuf, header_len);
+
+	/* Note: idx is non-NULL when we are writing */
+	if ((flags & HASH_WRITE_OBJECT) != 0)
+		idx = xcalloc(1, sizeof(*idx));
+
+	already_hashed_to = seekback;
+
+	while (1) {
+		prepare_to_stream(state, flags);
+		if (idx) {
+			sha1file_checkpoint(state->f, &checkpoint);
+			idx->offset = state->offset;
+			crc32_begin(state->f);
+		}
+		if (!stream_to_pack(state, &ctx, &already_hashed_to,
+				    fd, size, type, path, flags))
+			break;
+		/*
+		 * Writing this object to the current pack will make
+		 * it too big; we need to truncate it, start a new
+		 * pack, and write into it.
+		 */
+		if (!idx)
+			die("BUG: should not happen");
+		sha1file_truncate(state->f, &checkpoint);
+		state->offset = checkpoint.offset;
+		finish_bulk_checkin(state);
+		if (lseek(fd, seekback, SEEK_SET) == (off_t) -1)
+			return error("cannot seek back");
+	}
+	git_SHA1_Final(result_sha1, &ctx);
+	if (!idx)
+		return 0;
+
+	if (already_written(state, result_sha1)) {
+		sha1file_truncate(state->f, &checkpoint);
+		state->offset = checkpoint.offset;
+		free(idx);
+	} else {
+		idx->crc32 = crc32_end(state->f);
+		hashcpy(idx->sha1, result_sha1);
+		ALLOC_GROW(state->written,
+			   state->nr_written + 1,
+			   state->alloc_written);
+		state->written[state->nr_written++] = idx;
+	}
+	return 0;
+}
+
+int index_bulk_checkin(unsigned char *sha1,
+		       int fd, size_t size, enum object_type type,
+		       const char *path, unsigned flags)
+{
+	int status = deflate_to_pack(&state, sha1, fd, size, type,
+				     path, flags);
+	if (!state.plugged)
+		finish_bulk_checkin(&state);
+	return status;
+}
+
+void plug_bulk_checkin(void)
+{
+	state.plugged = 1;
+}
+
+void unplug_bulk_checkin(void)
+{
+	state.plugged = 0;
+	if (state.f)
+		finish_bulk_checkin(&state);
+}
diff --git a/bulk-checkin.h b/bulk-checkin.h
new file mode 100644
index 0000000..4f599f8
--- /dev/null
+++ b/bulk-checkin.h
@@ -0,0 +1,16 @@
+/*
+ * Copyright (c) 2011, Google Inc.
+ */
+#ifndef BULK_CHECKIN_H
+#define BULK_CHECKIN_H
+
+#include "cache.h"
+
+extern int index_bulk_checkin(unsigned char sha1[],
+			      int fd, size_t size, enum object_type type,
+			      const char *path, unsigned flags);
+
+extern void plug_bulk_checkin(void);
+extern void unplug_bulk_checkin(void);
+
+#endif
diff --git a/cache.h b/cache.h
index 2e6ad36..4f20861 100644
--- a/cache.h
+++ b/cache.h
@@ -35,6 +35,7 @@ int git_inflate(git_zstream *, int flush);
 void git_deflate_init(git_zstream *, int level);
 void git_deflate_init_gzip(git_zstream *, int level);
 void git_deflate_end(git_zstream *);
+int git_deflate_abort(git_zstream *);
 int git_deflate_end_gently(git_zstream *);
 int git_deflate(git_zstream *, int flush);
 unsigned long git_deflate_bound(git_zstream *, unsigned long);
@@ -598,6 +599,7 @@ extern size_t packed_git_window_size;
 extern size_t packed_git_limit;
 extern size_t delta_base_cache_limit;
 extern unsigned long big_file_threshold;
+extern unsigned long pack_size_limit_cfg;
 extern int read_replace_refs;
 extern int fsync_object_files;
 extern int core_preload_index;
diff --git a/config.c b/config.c
index edf9914..c736802 100644
--- a/config.c
+++ b/config.c
@@ -797,6 +797,10 @@ int git_default_config(const char *var, const char *value, void *dummy)
 		return 0;
 	}
 
+	if (!strcmp(var, "pack.packsizelimit")) {
+		pack_size_limit_cfg = git_config_ulong(var, value);
+		return 0;
+	}
 	/* Add other config variables here and to Documentation/config.txt. */
 	return 0;
 }
diff --git a/environment.c b/environment.c
index 0bee6a7..31e4284 100644
--- a/environment.c
+++ b/environment.c
@@ -60,6 +60,7 @@ char *notes_ref_name;
 int grafts_replace_parents = 1;
 int core_apply_sparse_checkout;
 struct startup_info *startup_info;
+unsigned long pack_size_limit_cfg;
 
 /* Parallel index stat data preload? */
 int core_preload_index = 0;
diff --git a/sha1_file.c b/sha1_file.c
index 27f3b9b..c96e366 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -18,6 +18,7 @@
 #include "refs.h"
 #include "pack-revindex.h"
 #include "sha1-lookup.h"
+#include "bulk-checkin.h"
 
 #ifndef O_NOATIME
 #if defined(__linux__) && (defined(__i386__) || defined(__PPC__))
@@ -2679,10 +2680,8 @@ static int index_core(unsigned char *sha1, int fd, size_t size,
 }
 
 /*
- * This creates one packfile per large blob, because the caller
- * immediately wants the result sha1, and fast-import can report the
- * object name via marks mechanism only by closing the created
- * packfile.
+ * This creates one packfile per large blob unless bulk-checkin
+ * machinery is "plugged".
  *
  * This also bypasses the usual "convert-to-git" dance, and that is on
  * purpose. We could write a streaming version of the converting
@@ -2696,65 +2695,7 @@ static int index_stream(unsigned char *sha1, int fd, size_t size,
 			enum object_type type, const char *path,
 			unsigned flags)
 {
-	struct child_process fast_import;
-	char export_marks[512];
-	const char *argv[] = { "fast-import", "--quiet", export_marks, NULL };
-	char tmpfile[512];
-	char fast_import_cmd[512];
-	char buf[512];
-	int len, tmpfd;
-
-	strcpy(tmpfile, git_path("hashstream_XXXXXX"));
-	tmpfd = git_mkstemp_mode(tmpfile, 0600);
-	if (tmpfd < 0)
-		die_errno("cannot create tempfile: %s", tmpfile);
-	if (close(tmpfd))
-		die_errno("cannot close tempfile: %s", tmpfile);
-	sprintf(export_marks, "--export-marks=%s", tmpfile);
-
-	memset(&fast_import, 0, sizeof(fast_import));
-	fast_import.in = -1;
-	fast_import.argv = argv;
-	fast_import.git_cmd = 1;
-	if (start_command(&fast_import))
-		die_errno("index-stream: git fast-import failed");
-
-	len = sprintf(fast_import_cmd, "blob\nmark :1\ndata %lu\n",
-		      (unsigned long) size);
-	write_or_whine(fast_import.in, fast_import_cmd, len,
-		       "index-stream: feeding fast-import");
-	while (size) {
-		char buf[10240];
-		size_t sz = size < sizeof(buf) ? size : sizeof(buf);
-		ssize_t actual;
-
-		actual = read_in_full(fd, buf, sz);
-		if (actual < 0)
-			die_errno("index-stream: reading input");
-		if (write_in_full(fast_import.in, buf, actual) != actual)
-			die_errno("index-stream: feeding fast-import");
-		size -= actual;
-	}
-	if (close(fast_import.in))
-		die_errno("index-stream: closing fast-import");
-	if (finish_command(&fast_import))
-		die_errno("index-stream: finishing fast-import");
-
-	tmpfd = open(tmpfile, O_RDONLY);
-	if (tmpfd < 0)
-		die_errno("index-stream: cannot open fast-import mark");
-	len = read(tmpfd, buf, sizeof(buf));
-	if (len < 0)
-		die_errno("index-stream: reading fast-import mark");
-	if (close(tmpfd) < 0)
-		die_errno("index-stream: closing fast-import mark");
-	if (unlink(tmpfile))
-		die_errno("index-stream: unlinking fast-import mark");
-	if (len != 44 ||
-	    memcmp(":1 ", buf, 3) ||
-	    get_sha1_hex(buf + 3, sha1))
-		die_errno("index-stream: unexpected fast-import mark: <%s>", buf);
-	return 0;
+	return index_bulk_checkin(sha1, fd, size, type, path, flags);
 }
 
 int index_fd(unsigned char *sha1, int fd, struct stat *st,
diff --git a/t/t1050-large.sh b/t/t1050-large.sh
index deba111..29d6024 100755
--- a/t/t1050-large.sh
+++ b/t/t1050-large.sh
@@ -7,21 +7,97 @@ test_description='adding and checking out large blobs'
 
 test_expect_success setup '
 	git config core.bigfilethreshold 200k &&
-	echo X | dd of=large bs=1k seek=2000
+	echo X | dd of=large1 bs=1k seek=2000 &&
+	echo X | dd of=large2 bs=1k seek=2000 &&
+	echo X | dd of=large3 bs=1k seek=2000 &&
+	echo Y | dd of=huge bs=1k seek=2500
 '
 
-test_expect_success 'add a large file' '
-	git add large &&
-	# make sure we got a packfile and no loose objects
-	test -f .git/objects/pack/pack-*.pack &&
-	test ! -f .git/objects/??/??????????????????????????????????????
+test_expect_success 'add a large file or two' '
+	git add large1 huge large2 &&
+	# make sure we got a single packfile and no loose objects
+	bad= count=0 idx= &&
+	for p in .git/objects/pack/pack-*.pack
+	do
+		count=$(( $count + 1 ))
+		if test -f "$p" && idx=${p%.pack}.idx && test -f "$idx"
+		then
+			continue
+		fi
+		bad=t
+	done &&
+	test -z "$bad" &&
+	test $count = 1 &&
+	cnt=$(git show-index <"$idx" | wc -l) &&
+	test $cnt = 2 &&
+	for l in .git/objects/??/??????????????????????????????????????
+	do
+		test -f "$l" || continue
+		bad=t
+	done &&
+	test -z "$bad" &&
+
+	# attempt to add another copy of the same
+	git add large3 &&
+	bad= count=0 &&
+	for p in .git/objects/pack/pack-*.pack
+	do
+		count=$(( $count + 1 ))
+		if test -f "$p" && idx=${p%.pack}.idx && test -f "$idx"
+		then
+			continue
+		fi
+		bad=t
+	done &&
+	test -z "$bad" &&
+	test $count = 1
 '
 
 test_expect_success 'checkout a large file' '
-	large=$(git rev-parse :large) &&
-	git update-index --add --cacheinfo 100644 $large another &&
+	large1=$(git rev-parse :large1) &&
+	git update-index --add --cacheinfo 100644 $large1 another &&
 	git checkout another &&
-	cmp large another ;# this must not be test_cmp
+	cmp large1 another ;# this must not be test_cmp
+'
+
+test_expect_success 'packsize limit' '
+	test_create_repo mid &&
+	(
+		cd mid &&
+		git config core.bigfilethreshold 64k &&
+		git config pack.packsizelimit 256k &&
+
+		# mid1 and mid2 will fit within 256k limit but
+		# appending mid3 will bust the limit and will
+		# result in a separate packfile.
+		test-genrandom "a" $(( 66 * 1024 )) >mid1 &&
+		test-genrandom "b" $(( 80 * 1024 )) >mid2 &&
+		test-genrandom "c" $(( 128 * 1024 )) >mid3 &&
+		git add mid1 mid2 mid3 &&
+
+		count=0
+		for pi in .git/objects/pack/pack-*.idx
+		do
+			test -f "$pi" && count=$(( $count + 1 ))
+		done &&
+		test $count = 2 &&
+
+		(
+			git hash-object --stdin <mid1
+			git hash-object --stdin <mid2
+			git hash-object --stdin <mid3
+		) |
+		sort >expect &&
+
+		for pi in .git/objects/pack/pack-*.idx
+		do
+			git show-index <"$pi"
+		done |
+		sed -e "s/^[0-9]* \([0-9a-f]*\) .*/\1/" |
+		sort >actual &&
+
+		test_cmp expect actual
+	)
 '
 
 test_done
diff --git a/zlib.c b/zlib.c
index 3c63d48..2b2c0c7 100644
--- a/zlib.c
+++ b/zlib.c
@@ -188,13 +188,20 @@ void git_deflate_init_gzip(git_zstream *strm, int level)
 	    strm->z.msg ? strm->z.msg : "no message");
 }
 
-void git_deflate_end(git_zstream *strm)
+int git_deflate_abort(git_zstream *strm)
 {
 	int status;
 
 	zlib_pre_call(strm);
 	status = deflateEnd(&strm->z);
 	zlib_post_call(strm);
+	return status;
+}
+
+void git_deflate_end(git_zstream *strm)
+{
+	int status = git_deflate_abort(strm);
+
 	if (status == Z_OK)
 		return;
 	error("deflateEnd: %s (%s)", zerr_to_string(status),
-- 
1.7.8.rc4.177.g4d64

^ permalink raw reply related

* [PATCH v2 4/5] csum-file: introduce sha1file_checkpoint
From: Junio C Hamano @ 2011-12-01  0:27 UTC (permalink / raw)
  To: git
In-Reply-To: <1322699263-14475-1-git-send-email-gitster@pobox.com>

It is useful to be able to rewind a check-summed file to a certain
previous state after writing data into it using sha1write() API. The
fast-import command does this after streaming a blob data to the packfile
being generated and then noticing that the same blob has already been
written, and it does this with a private code truncate_pack() that is
commented as "Yes, this is a layering violation".

Introduce two API functions, sha1file_checkpoint(), that allows the caller
to save a state of a sha1file, and then later revert it to the saved state.
Use it to reimplement truncate_pack().

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 csum-file.c   |   20 ++++++++++++++++++++
 csum-file.h   |    9 +++++++++
 fast-import.c |   25 ++++++++-----------------
 3 files changed, 37 insertions(+), 17 deletions(-)

diff --git a/csum-file.c b/csum-file.c
index fc97d6e..53f5375 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -158,6 +158,26 @@ struct sha1file *sha1fd_throughput(int fd, const char *name, struct progress *tp
 	return f;
 }
 
+void sha1file_checkpoint(struct sha1file *f, struct sha1file_checkpoint *checkpoint)
+{
+	sha1flush(f);
+	checkpoint->offset = f->total;
+	checkpoint->ctx = f->ctx;
+}
+
+int sha1file_truncate(struct sha1file *f, struct sha1file_checkpoint *checkpoint)
+{
+	off_t offset = checkpoint->offset;
+
+	if (ftruncate(f->fd, offset) ||
+	    lseek(f->fd, offset, SEEK_SET) != offset)
+		return -1;
+	f->total = offset;
+	f->ctx = checkpoint->ctx;
+	f->offset = 0; /* sha1flush() was called in checkpoint */
+	return 0;
+}
+
 void crc32_begin(struct sha1file *f)
 {
 	f->crc32 = crc32(0, NULL, 0);
diff --git a/csum-file.h b/csum-file.h
index 6a7967c..3b540bd 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -17,6 +17,15 @@ struct sha1file {
 	unsigned char buffer[8192];
 };
 
+/* Checkpoint */
+struct sha1file_checkpoint {
+	off_t offset;
+	git_SHA_CTX ctx;
+};
+
+extern void sha1file_checkpoint(struct sha1file *, struct sha1file_checkpoint *);
+extern int sha1file_truncate(struct sha1file *, struct sha1file_checkpoint *);
+
 /* sha1close flags */
 #define CSUM_CLOSE	1
 #define CSUM_FSYNC	2
diff --git a/fast-import.c b/fast-import.c
index 8d8ea3c..a8db41b 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -1143,17 +1143,11 @@ static int store_object(
 	return 0;
 }
 
-static void truncate_pack(off_t to, git_SHA_CTX *ctx)
+static void truncate_pack(struct sha1file_checkpoint *checkpoint)
 {
-	if (ftruncate(pack_data->pack_fd, to)
-	 || lseek(pack_data->pack_fd, to, SEEK_SET) != to)
+	if (sha1file_truncate(pack_file, checkpoint))
 		die_errno("cannot truncate pack to skip duplicate");
-	pack_size = to;
-
-	/* yes this is a layering violation */
-	pack_file->total = to;
-	pack_file->offset = 0;
-	pack_file->ctx = *ctx;
+	pack_size = checkpoint->offset;
 }
 
 static void stream_blob(uintmax_t len, unsigned char *sha1out, uintmax_t mark)
@@ -1166,8 +1160,8 @@ static void stream_blob(uintmax_t len, unsigned char *sha1out, uintmax_t mark)
 	unsigned long hdrlen;
 	off_t offset;
 	git_SHA_CTX c;
-	git_SHA_CTX pack_file_ctx;
 	git_zstream s;
+	struct sha1file_checkpoint checkpoint;
 	int status = Z_OK;
 
 	/* Determine if we should auto-checkpoint. */
@@ -1175,11 +1169,8 @@ static void stream_blob(uintmax_t len, unsigned char *sha1out, uintmax_t mark)
 		|| (pack_size + 60 + len) < pack_size)
 		cycle_packfile();
 
-	offset = pack_size;
-
-	/* preserve the pack_file SHA1 ctx in case we have to truncate later */
-	sha1flush(pack_file);
-	pack_file_ctx = pack_file->ctx;
+	sha1file_checkpoint(pack_file, &checkpoint);
+	offset = checkpoint.offset;
 
 	hdrlen = snprintf((char *)out_buf, out_sz, "blob %" PRIuMAX, len) + 1;
 	if (out_sz <= hdrlen)
@@ -1245,14 +1236,14 @@ static void stream_blob(uintmax_t len, unsigned char *sha1out, uintmax_t mark)
 
 	if (e->idx.offset) {
 		duplicate_count_by_type[OBJ_BLOB]++;
-		truncate_pack(offset, &pack_file_ctx);
+		truncate_pack(&checkpoint);
 
 	} else if (find_sha1_pack(sha1, packed_git)) {
 		e->type = OBJ_BLOB;
 		e->pack_id = MAX_PACK_ID;
 		e->idx.offset = 1; /* just not zero! */
 		duplicate_count_by_type[OBJ_BLOB]++;
-		truncate_pack(offset, &pack_file_ctx);
+		truncate_pack(&checkpoint);
 
 	} else {
 		e->depth = 0;
-- 
1.7.8.rc4.177.g4d64

^ permalink raw reply related

* [PATCH v2 3/5] finish_tmp_packfile(): a helper function
From: Junio C Hamano @ 2011-12-01  0:27 UTC (permalink / raw)
  To: git
In-Reply-To: <1322699263-14475-1-git-send-email-gitster@pobox.com>

Factor out a small logic out of the private write_pack_file() function
in builtin/pack-objects.c.

This changes the order of finishing multi-pack generation slightly. The
code used to

 - adjust shared perm of temporary packfile
 - rename temporary packfile to the final name
 - update mtime of the packfile under the final name
 - adjust shared perm of temporary idxfile
 - rename temporary idxfile to the final name

but because the helper does not want to do the mtime thing, the updated
code does that step first and then all the rest.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 builtin/pack-objects.c |   33 ++++++++++-----------------------
 pack-write.c           |   31 +++++++++++++++++++++++++++++++
 pack.h                 |    1 +
 3 files changed, 42 insertions(+), 23 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 3258fa9..b458b6d 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -617,20 +617,8 @@ static void write_pack_file(void)
 
 		if (!pack_to_stdout) {
 			struct stat st;
-			const char *idx_tmp_name;
 			char tmpname[PATH_MAX];
 
-			idx_tmp_name = write_idx_file(NULL, written_list, nr_written,
-						      &pack_idx_opts, sha1);
-
-			snprintf(tmpname, sizeof(tmpname), "%s-%s.pack",
-				 base_name, sha1_to_hex(sha1));
-			free_pack_by_name(tmpname);
-			if (adjust_shared_perm(pack_tmp_name))
-				die_errno("unable to make temporary pack file readable");
-			if (rename(pack_tmp_name, tmpname))
-				die_errno("unable to rename temporary pack file");
-
 			/*
 			 * Packs are runtime accessed in their mtime
 			 * order since newer packs are more likely to contain
@@ -638,28 +626,27 @@ static void write_pack_file(void)
 			 * packs then we should modify the mtime of later ones
 			 * to preserve this property.
 			 */
-			if (stat(tmpname, &st) < 0) {
+			if (stat(pack_tmp_name, &st) < 0) {
 				warning("failed to stat %s: %s",
-					tmpname, strerror(errno));
+					pack_tmp_name, strerror(errno));
 			} else if (!last_mtime) {
 				last_mtime = st.st_mtime;
 			} else {
 				struct utimbuf utb;
 				utb.actime = st.st_atime;
 				utb.modtime = --last_mtime;
-				if (utime(tmpname, &utb) < 0)
+				if (utime(pack_tmp_name, &utb) < 0)
 					warning("failed utime() on %s: %s",
 						tmpname, strerror(errno));
 			}
 
-			snprintf(tmpname, sizeof(tmpname), "%s-%s.idx",
-				 base_name, sha1_to_hex(sha1));
-			if (adjust_shared_perm(idx_tmp_name))
-				die_errno("unable to make temporary index file readable");
-			if (rename(idx_tmp_name, tmpname))
-				die_errno("unable to rename temporary index file");
-
-			free((void *) idx_tmp_name);
+			/* Enough space for "-<sha-1>.pack"? */
+			if (sizeof(tmpname) <= strlen(base_name) + 50)
+				die("pack base name '%s' too long", base_name);
+			snprintf(tmpname, sizeof(tmpname), "%s-", base_name);
+			finish_tmp_packfile(tmpname, pack_tmp_name,
+					    written_list, nr_written,
+					    &pack_idx_opts, sha1);
 			free(pack_tmp_name);
 			puts(sha1_to_hex(sha1));
 		}
diff --git a/pack-write.c b/pack-write.c
index 863cce8..cadc3e1 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -338,3 +338,34 @@ struct sha1file *create_tmp_packfile(char **pack_tmp_name)
 	*pack_tmp_name = xstrdup(tmpname);
 	return sha1fd(fd, *pack_tmp_name);
 }
+
+void finish_tmp_packfile(char *name_buffer,
+			 const char *pack_tmp_name,
+			 struct pack_idx_entry **written_list,
+			 uint32_t nr_written,
+			 struct pack_idx_option *pack_idx_opts,
+			 unsigned char sha1[])
+{
+	const char *idx_tmp_name;
+	char *end_of_name_prefix = strrchr(name_buffer, 0);
+
+	if (adjust_shared_perm(pack_tmp_name))
+		die_errno("unable to make temporary pack file readable");
+
+	idx_tmp_name = write_idx_file(NULL, written_list, nr_written,
+				      pack_idx_opts, sha1);
+	if (adjust_shared_perm(idx_tmp_name))
+		die_errno("unable to make temporary index file readable");
+
+	sprintf(end_of_name_prefix, "%s.pack", sha1_to_hex(sha1));
+	free_pack_by_name(name_buffer);
+
+	if (rename(pack_tmp_name, name_buffer))
+		die_errno("unable to rename temporary pack file");
+
+	sprintf(end_of_name_prefix, "%s.idx", sha1_to_hex(sha1));
+	if (rename(idx_tmp_name, name_buffer))
+		die_errno("unable to rename temporary index file");
+
+	free((void *)idx_tmp_name);
+}
diff --git a/pack.h b/pack.h
index 0027ac6..cfb0f69 100644
--- a/pack.h
+++ b/pack.h
@@ -86,5 +86,6 @@ extern int encode_in_pack_object_header(enum object_type, uintmax_t, unsigned ch
 extern int read_pack_header(int fd, struct pack_header *);
 
 extern struct sha1file *create_tmp_packfile(char **pack_tmp_name);
+extern void finish_tmp_packfile(char *name_buffer, const char *pack_tmp_name, struct pack_idx_entry **written_list, uint32_t nr_written, struct pack_idx_option *pack_idx_opts, unsigned char sha1[]);
 
 #endif
-- 
1.7.8.rc4.177.g4d64

^ permalink raw reply related

* [PATCH v2 2/5] create_tmp_packfile(): a helper function
From: Junio C Hamano @ 2011-12-01  0:27 UTC (permalink / raw)
  To: git
In-Reply-To: <1322699263-14475-1-git-send-email-gitster@pobox.com>

Factor out a small logic out of the private write_pack_file() function
in builtin/pack-objects.c

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 builtin/pack-objects.c |   12 +++---------
 pack-write.c           |   10 ++++++++++
 pack.h                 |    3 +++
 3 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 6643c16..3258fa9 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -584,16 +584,10 @@ static void write_pack_file(void)
 		unsigned char sha1[20];
 		char *pack_tmp_name = NULL;
 
-		if (pack_to_stdout) {
+		if (pack_to_stdout)
 			f = sha1fd_throughput(1, "<stdout>", progress_state);
-		} else {
-			char tmpname[PATH_MAX];
-			int fd;
-			fd = odb_mkstemp(tmpname, sizeof(tmpname),
-					 "pack/tmp_pack_XXXXXX");
-			pack_tmp_name = xstrdup(tmpname);
-			f = sha1fd(fd, pack_tmp_name);
-		}
+		else
+			f = create_tmp_packfile(&pack_tmp_name);
 
 		offset = write_pack_header(f, nr_remaining);
 		if (!offset)
diff --git a/pack-write.c b/pack-write.c
index 46f3f84..863cce8 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -328,3 +328,13 @@ int encode_in_pack_object_header(enum object_type type, uintmax_t size, unsigned
 	*hdr = c;
 	return n;
 }
+
+struct sha1file *create_tmp_packfile(char **pack_tmp_name)
+{
+	char tmpname[PATH_MAX];
+	int fd;
+
+	fd = odb_mkstemp(tmpname, sizeof(tmpname), "pack/tmp_pack_XXXXXX");
+	*pack_tmp_name = xstrdup(tmpname);
+	return sha1fd(fd, *pack_tmp_name);
+}
diff --git a/pack.h b/pack.h
index d429d8a..0027ac6 100644
--- a/pack.h
+++ b/pack.h
@@ -84,4 +84,7 @@ extern int encode_in_pack_object_header(enum object_type, uintmax_t, unsigned ch
 #define PH_ERROR_PACK_SIGNATURE	(-2)
 #define PH_ERROR_PROTOCOL	(-3)
 extern int read_pack_header(int fd, struct pack_header *);
+
+extern struct sha1file *create_tmp_packfile(char **pack_tmp_name);
+
 #endif
-- 
1.7.8.rc4.177.g4d64

^ permalink raw reply related

* [PATCH v2 1/5] write_pack_header(): a helper function
From: Junio C Hamano @ 2011-12-01  0:27 UTC (permalink / raw)
  To: git
In-Reply-To: <1322699263-14475-1-git-send-email-gitster@pobox.com>

Factor out a small logic out of the private write_pack_file() function
in builtin/pack-objects.c

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 builtin/pack-objects.c |    9 +++------
 pack-write.c           |   12 ++++++++++++
 pack.h                 |    2 ++
 3 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index ba3705d..6643c16 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -571,7 +571,6 @@ static void write_pack_file(void)
 	uint32_t i = 0, j;
 	struct sha1file *f;
 	off_t offset;
-	struct pack_header hdr;
 	uint32_t nr_remaining = nr_result;
 	time_t last_mtime = 0;
 	struct object_entry **write_order;
@@ -596,11 +595,9 @@ static void write_pack_file(void)
 			f = sha1fd(fd, pack_tmp_name);
 		}
 
-		hdr.hdr_signature = htonl(PACK_SIGNATURE);
-		hdr.hdr_version = htonl(PACK_VERSION);
-		hdr.hdr_entries = htonl(nr_remaining);
-		sha1write(f, &hdr, sizeof(hdr));
-		offset = sizeof(hdr);
+		offset = write_pack_header(f, nr_remaining);
+		if (!offset)
+			die_errno("unable to write pack header");
 		nr_written = 0;
 		for (; i < nr_objects; i++) {
 			struct object_entry *e = write_order[i];
diff --git a/pack-write.c b/pack-write.c
index 9cd3bfb..46f3f84 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -178,6 +178,18 @@ const char *write_idx_file(const char *index_name, struct pack_idx_entry **objec
 	return index_name;
 }
 
+off_t write_pack_header(struct sha1file *f, uint32_t nr_entries)
+{
+	struct pack_header hdr;
+
+	hdr.hdr_signature = htonl(PACK_SIGNATURE);
+	hdr.hdr_version = htonl(PACK_VERSION);
+	hdr.hdr_entries = htonl(nr_entries);
+	if (sha1write(f, &hdr, sizeof(hdr)))
+		return 0;
+	return sizeof(hdr);
+}
+
 /*
  * Update pack header with object_count and compute new SHA1 for pack data
  * associated to pack_fd, and write that SHA1 at the end.  That new SHA1
diff --git a/pack.h b/pack.h
index 722a54e..d429d8a 100644
--- a/pack.h
+++ b/pack.h
@@ -2,6 +2,7 @@
 #define PACK_H
 
 #include "object.h"
+#include "csum-file.h"
 
 /*
  * Packed object header
@@ -74,6 +75,7 @@ extern const char *write_idx_file(const char *index_name, struct pack_idx_entry
 extern int check_pack_crc(struct packed_git *p, struct pack_window **w_curs, off_t offset, off_t len, unsigned int nr);
 extern int verify_pack_index(struct packed_git *);
 extern int verify_pack(struct packed_git *);
+extern off_t write_pack_header(struct sha1file *f, uint32_t);
 extern void fixup_pack_header_footer(int, unsigned char *, const char *, uint32_t, unsigned char *, off_t);
 extern char *index_pack_lockfile(int fd);
 extern int encode_in_pack_object_header(enum object_type, uintmax_t, unsigned char *);
-- 
1.7.8.rc4.177.g4d64

^ permalink raw reply related

* [PATCH v2 0/5] Bulk Check-in
From: Junio C Hamano @ 2011-12-01  0:27 UTC (permalink / raw)
  To: git

This is a re-roll of the earlier bulk-check-in series. The only change is
that the last one is a bit re-structured to pay attention to the packsize
limit from the get-go and also not write objects that already exist in the
repository, instead of "oops, I forgot to do that, and here is a fix".

Junio C Hamano (5):
  write_pack_header(): a helper function
  create_tmp_packfile(): a helper function
  finish_tmp_packfile(): a helper function
  csum-file: introduce sha1file_checkpoint
  bulk-checkin: replace fast-import based implementation

 Makefile               |    2 +
 builtin/add.c          |    5 +
 builtin/pack-objects.c |   62 +++--------
 bulk-checkin.c         |  275 ++++++++++++++++++++++++++++++++++++++++++++++++
 bulk-checkin.h         |   16 +++
 cache.h                |    2 +
 config.c               |    4 +
 csum-file.c            |   20 ++++
 csum-file.h            |    9 ++
 environment.c          |    1 +
 fast-import.c          |   25 ++---
 pack-write.c           |   53 +++++++++
 pack.h                 |    6 +
 sha1_file.c            |   67 +-----------
 t/t1050-large.sh       |   94 +++++++++++++++--
 zlib.c                 |    9 ++-
 16 files changed, 516 insertions(+), 134 deletions(-)
 create mode 100644 bulk-checkin.c
 create mode 100644 bulk-checkin.h

-- 
1.7.8.rc4.177.g4d64

^ permalink raw reply

* mini-wierdness with "git am"
From: Aghiles @ 2011-11-30 23:12 UTC (permalink / raw)
  To: git list

Hello,

I am using : git version 1.7.0.3, in Mac OS X.

A slightly strange behavior occurs with "git am --abort". Here
is the sequence:

> git am 000*
error: ... : patch doesn not apply
(oops, forgot to reset)
> git reset --hard HEAD~1
> git am 000*
previous rebase directory ... still exists but mbox given.
(oops, forgot to abort)
> git am --abort

At this point, my repository is back to ORIG_HEAD!
Meaning that it reverted my preivous "git reset --hard HEAD~1".

Hope this is clear.

As usual, thanks for the great product.

-- aghiles

^ permalink raw reply

* Re: [PATCHv2 0/4] git-p4: small fixes to branches and labels; tests
From: Pete Wyckoff @ 2011-11-30 23:00 UTC (permalink / raw)
  To: Luke Diamand; +Cc: Vitor Antunes, git
In-Reply-To: <20111130225813.GA11544@arf.padd.com>

P.S.  Since you're respinning anyway to change the label code,
can you stick the 'branch with shell char' test from t9801 in
with t9803?  It feels more appropriate there than with the branch
tests.  And avoids collision with some Vitor code that will get
added eventually.

		-- Pete

^ permalink raw reply

* Re: [PATCHv2 0/4] git-p4: small fixes to branches and labels; tests
From: Pete Wyckoff @ 2011-11-30 22:58 UTC (permalink / raw)
  To: Luke Diamand; +Cc: Vitor Antunes, git
In-Reply-To: <4ED6809A.9020703@diamand.org>

luke@diamand.org wrote on Wed, 30 Nov 2011 19:14 +0000:
> On 30/11/11 14:55, Vitor Antunes wrote:
> >Luke Diamand<luke<at>  diamand.org>  writes:
> >
> >>In adding the test case for labels I also found and fixed a few other
> >>small bugs in the label handling:
> >>
> >>  - labels missing a description or "EOT" in their text cause problems;
> >>  - labels without an owner cause problems.
> >>
> >>I also noticed, but did not fix, that you can't have more than one label
> >>per commit (the others are silently dropped) and the documentation for
> >>branch import could be improved. I've added a (failing) test case for
> >>the multiple label problem.

I was hanging onto your v1, and made a comment on the v1's 3/4
that perhaps you missed.  Also acked the entire thing.  I can
resend if my mailer ate it.

Don't expect Junio to pick it up until after 1.7.8 goes out.

> >Hi Luke,
> >
> >Seeing that you have some experience using labels, could I kindly ask you to
> >include some description of it in git-p4.txt?
> 
> OK, if you can help me understand what's going on...
> 
> The label-detection bug that I've described, on further
> investigation, looks to be a fundamental limitation.
> 
> With perforce, I can check out the head revision, and then tag just
> a single file. If I then check out on that tag, I get just that one
> file.
> 
> I think I can't do that with git; certainly fast-import can't do it.

This is another fundamental disconnect between p4 and git.
Reading

http://www.perforce.com/perforce/doc.current/manuals/p4guide/07_labels.html

it is clear that labels are supposed to be used exactly where
tags cannot:  to specify a collection of files as they existed
at _different_ points in the commit history.

Thus I think supporting labels is kind of pointless.  But in the
restricted use case that perforce docs tell us not to do, namely
using labels to identify change numbers, git can reflect that
with tags.

> So the code in git-p4 that is checking the file vs label counts
> (git-p4 around line 1496) is actually trying to say "this label
> can't be imported into git".
> 
> If my understanding is correct, I can then fix my test and update
> the docs and the code to explain this.

Yeah.  It's just a big restriction on how labels get imported.
A better error message and some docs would be useful.

> As an aside, git-p4.txt currently has quite good information on the
> config variables, but nothing on the command line variables.
> Possibly that should be fixed.

Recently, I wrote an asciidoc-style document for git-p4, and
tried to find all the options on all the commands.  There's a lot
more than I ever knew about.  :)  I'll take another pass
through it then send it out for review.  Maybe we can get rid
of the old git-p4.txt then and work on improving a more
structured document.

		-- Pete

^ permalink raw reply

* Re: [PATCH 5/3] revert: introduce --abort to cancel a failed cherry-pick
From: Junio C Hamano @ 2011-11-30 22:52 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Ramkumar Ramachandra, git, Christian Couder,
	Martin von Zweigbergk, Phil Hord, Jay Soffian
In-Reply-To: <20111122112001.GF7399@elie.hsd1.il.comcast.net>

Jonathan Nieder <jrnieder@gmail.com> writes:

> After running some ill-advised command like "git cherry-pick
> HEAD..linux-next", the bewildered novice may want to return to more
> familiar territory.  Introduce a "git cherry-pick --abort" command
> that rolls back the entire cherry-pick sequence and places the
> repository back on solid ground.

This is confusing; if you have many commits in the range, and a handful of
them replayed without conflicts and then you hit a conflict, where should
(I am not asking "where does ... with your patch") abort take us? The
state after the random commit that happened to have replayed successfully?
The state before the entire cherry-pick sequence started?  "back on solid
ground" does not tell us which one you meant.

I am assuming that it is the latter.

> diff --git a/Documentation/sequencer.txt b/Documentation/sequencer.txt
> index 75f8e869..5747f442 100644
> --- a/Documentation/sequencer.txt
> +++ b/Documentation/sequencer.txt
> @@ -7,3 +7,6 @@
>  	Forget about the current operation in progress.  Can be used
>  	to clear the sequencer state after a failed cherry-pick or
>  	revert.
> +
> +--abort::
> +	Cancel the operation and return to the pre-sequence state.

Ok, it is the latter.

> +static int reset_for_rollback(const unsigned char *sha1)
> +{
> +	const char *argv[4];	/* reset --merge <arg> + NULL */
> +	argv[0] = "reset";
> +	argv[1] = "--merge";
> +	argv[2] = sha1_to_hex(sha1);
> +	argv[3] = NULL;
> +	return run_command_v_opt(argv, RUN_GIT_CMD);
> +}

So you give the value of the HEAD before the sequence started to this
function and all should go well. Where do you read that value from?

> +static int rollback_single_pick(void)
> +{
> +	unsigned char head_sha1[20];
> +
> +	if (!file_exists(git_path("CHERRY_PICK_HEAD")) &&
> +	    !file_exists(git_path("REVERT_HEAD")))
> +		return error(_("no cherry-pick or revert in progress"));
> +	if (!resolve_ref("HEAD", head_sha1, 0, NULL))
> +		return error(_("cannot resolve HEAD"));
> +	if (is_null_sha1(head_sha1))
> +		return error(_("cannot abort from a branch yet to be born"));

Ok, this is for single-pick so HEAD is where we came from. Good.

> +	return reset_for_rollback(head_sha1);
> +}
> +
> +static int sequencer_rollback(struct replay_opts *opts)
> +{
> +	const char *filename;
> +	FILE *f;
> +	unsigned char sha1[20];
> +	struct strbuf buf = STRBUF_INIT;
> +
> +	filename = git_path(SEQ_HEAD_FILE);
> +	f = fopen(filename, "r");
> +	if (!f && errno == ENOENT) {
> +		/*
> +		 * There is no multiple-cherry-pick in progress.
> +		 * If CHERRY_PICK_HEAD or REVERT_HEAD indicates
> +		 * a single-cherry-pick in progress, abort that.
> +		 */
> +		return rollback_single_pick();
> +	}
> +	if (!f)
> +		return error(_("cannot open %s: %s"), filename,
> +						strerror(errno));
> +	if (strbuf_getline(&buf, f, '\n')) {
> +		error(_("cannot read %s: %s"), filename, ferror(f) ?
> +			strerror(errno) : _("unexpected end of file"));
> +		goto fail;
> +	}

And when we are in multi-pick, SEQ_HEAD_FILE has it.

Looks good from a cursory review. Thanks.

^ permalink raw reply

* Re: BUG: "--work-tree blah" does not imply "--git-dir blah/.git" or fix misleading error message
From: Junio C Hamano @ 2011-11-30 20:45 UTC (permalink / raw)
  To: Carlos Martín Nieto; +Cc: John Twilley, git
In-Reply-To: <20111130194233.GD12096@centaur.lab.cmartin.tk>

Carlos Martín Nieto <carlos@cmartin.tk> writes:

> On Wed, Nov 30, 2011 at 11:21:29AM -0800, Junio C Hamano wrote:
>> subdirectory "blah/.git"), but I do not think this is likely to change, as
>> I suspect that people and scripts are relying on the current behaviour to
>> be able to do something like this:
>> 
>>     cd /pub/scm/git/git.git ;# this is a bare repository
>>     mkdir /var/tmp/git
>>     git --work-tree=/var/tmp/git checkout
>
> This is in fact the way that many (or from what I can see the most
> popular) tutorials for abusing git as a deployment system tell you to
> run it (though more often than not setting GIT_WORKTREE in the
> environment).

Heh, *ab*using is a good description if they really mean to use it as a
deployment system. For one thing it won't do anything a proper deployment
should do if the target directory is not empty ;-)

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox