Git development

Git development
 help / color / mirror / Atom feed

* git, monorepos, and access control
From: Coiner, John @ 2018-12-05 20:13 UTC (permalink / raw)
  To: git@vger.kernel.org

Hi,

I'm an engineer with AMD. I'm looking at whether we could switch our 
internal version control to a monorepo, possibly one based on git and 
VFSForGit.

One obstacle to moving AMD to git/VFSForGit is the lack of access 
control support in git. AMD has a lot of data whose distribution must be 
limited. Sometimes it's a legal requirement, eg. CPU core designs are 
covered by US export control laws and not all employees may see them. 
Sometimes it's a contractual obligation, as when a third party shares 
data with us and we agree only to share this data with certain 
employees. Any hypothetical AMD monorepo should be able to securely deny 
read access in certain subtrees to users without required permissions.

Has anyone looked at adding access control to git, at a per-directory 
granularity? Is this a feature that the git community would possibly 
welcome?

Here's my rough thinking about how it might work:
  - an administrator can designate that a tree object requires zero or 
more named privileges to read
  - when a mortal user attempts to retrieve the tree object, a hook 
allows the server to check if the user has a given privilege. The hook 
can query an arbitrary user/group data base, LDAP or whatever. The 
details of this check are mostly in the hook; git only knows about 
abstract named privileges.
  - if the user has permission, everything goes as normal.
  - if the user lacks permission, they get a DeniedTree object which 
might carry some metadata about what permissions would be needed to see 
more. The DeniedTree lacks the real tree's entries. (TBD, how do we 
render a denied tree in the workspace? An un-writable directory 
containing only a GITDENIED file with some user friendly error message?)
  - hashes are secret. If the hashes from a protected tree leak, the 
data also leaks. No check on the server prevents it from handing out 
contents for correctly-guessed hashes.
  - mortal users shouldn't be able to alter permissions. Of course, 
mortal users will often modify tree objects that carry permissions. So 
the server should enforce that a user isn't pushing updates that alter 
permissions on the same logical directory.

I would welcome your feedback on whether this idea makes technical 
sense, and whether the feature could ever be a fit for git.

You might ask what alternatives we are looking at. At our scale, we'd 
really want a version control system that implements a virtual 
filesystem. That already limits us to ClearCase, VFSForGit, and maybe 
Vesta among public ones.  Am I missing any? We would also want one that 
permits branching enormous numbers of files without creating enormous 
amounts of data in the repo -- git gets that right, and perforce (our 
status quo) does not. That's how I got onto the idea of adding read 
authorization to git.

Thanks,

John

^ permalink raw reply

* Re: [PATCH 8/9] sha1-file: use loose object cache for quick existence check
From: Jeff King @ 2018-12-05 20:17 UTC (permalink / raw)
  To: René Scharfe
  Cc: Ævar Arnfjörð Bjarmason, Geert Jansen,
	Junio C Hamano, git@vger.kernel.org, Takuto Ikuta
In-Reply-To: <48ab861b-46c8-2bd5-0508-fa106efd8e56@web.de>

On Wed, Dec 05, 2018 at 07:41:44PM +0100, René Scharfe wrote:

> > If we start to convert those, there's a
> > little bit of a rabbit hole, but it's actually not too bad.
> 
> You don't need to crawl in just for quick_has_loose(), but eventually
> everything has to be converted.  It seems a bit much for one patch, but
> perhaps that's just my ever-decreasing attention span speaking.

Yeah, my normal process here is to dig to the bottom of the rabbit hole,
and then break it into actual patches. I just shared the middle state
here. ;)

I suspect the http bits could be split off into their own thing. The
bits in sha1-file.c I'd plan to mostly do all together, as they are
a family of related functions.

Mostly I wasn't sure how to wrap this up with the other changes. It's
obviously pretty invasive, and I don't want it to get in the way of
actual functional changes we've been discussing.

> > @@ -879,15 +879,15 @@ int git_open_cloexec(const char *name, int flags)
> >   * Note that it may point to static storage and is only valid until another
> >   * call to stat_sha1_file().
> >   */
> > -static int stat_sha1_file(struct repository *r, const unsigned char *sha1,
> > -			  struct stat *st, const char **path)
> > +static int stat_loose_object(struct repository *r, const struct object_id *oid,
> > +			     struct stat *st, const char **path)
> 
> Hmm, read_sha1_file() was renamed to read_object_file() earlier this
> year, and I'd have expected this to become stat_object_file().  Names..

read_object_file() is about reading an object from _any_ source. These
are specifically about loose objects, and I think that distinction is
important (both here and for map_loose_object, etc).

I'd actually argue that read_object_file() should just be read_object(),
but that already exists. Sigh. (I think it's fixable, but obviously
orthogonal to this topic).

> Anyway, the comment above and one a few lines below should be updated
> as well.

Thanks, fixed.

> >  static int open_sha1_file(struct repository *r,
> > -			  const unsigned char *sha1, const char **path)
> > +			  const struct object_id *oid, const char **path)
> 
> That function should lose the "sha1" in its name as well.

Yep, fixed.

> > -static void *unpack_sha1_rest(git_zstream *stream, void *buffer, unsigned long size, const unsigned char *sha1)
> > +static void *unpack_loose_rest(git_zstream *stream,
> > +			       void *buffer, unsigned long size,
> > +			       const struct object_id *oid)
> 
> Hmm, both old and new name here look weird to me at this point.

It makes more sense in the pairing of unpack_sha1_header() and
unpack_sha1_rest(). Maybe "body" would be better than "rest".

At any rate, it probably makes sense to rename them together (but I
didn't touch the "header" one here). Maybe the name changes should come
as a separate patch. I was mostly changing them here because I was
changing the signatures anyway, and had to touch all of the callers.

-Peff

^ permalink raw reply

* Re: git, monorepos, and access control
From: Ævar Arnfjörð Bjarmason @ 2018-12-05 20:34 UTC (permalink / raw)
  To: Coiner, John; +Cc: git@vger.kernel.org
In-Reply-To: <939efd87-b2af-29d7-efdd-9cf8f6de9d10@amd.com>

On Wed, Dec 05 2018, Coiner, John wrote:

> I'm an engineer with AMD. I'm looking at whether we could switch our
> internal version control to a monorepo, possibly one based on git and
> VFSForGit.
>
> One obstacle to moving AMD to git/VFSForGit is the lack of access
> control support in git. AMD has a lot of data whose distribution must be
> limited. Sometimes it's a legal requirement, eg. CPU core designs are
> covered by US export control laws and not all employees may see them.
> Sometimes it's a contractual obligation, as when a third party shares
> data with us and we agree only to share this data with certain
> employees. Any hypothetical AMD monorepo should be able to securely deny
> read access in certain subtrees to users without required permissions.
>
> Has anyone looked at adding access control to git, at a per-directory
> granularity? Is this a feature that the git community would possibly
> welcome?
>
> Here's my rough thinking about how it might work:
>   - an administrator can designate that a tree object requires zero or
> more named privileges to read
>   - when a mortal user attempts to retrieve the tree object, a hook
> allows the server to check if the user has a given privilege. The hook
> can query an arbitrary user/group data base, LDAP or whatever. The
> details of this check are mostly in the hook; git only knows about
> abstract named privileges.
>   - if the user has permission, everything goes as normal.
>   - if the user lacks permission, they get a DeniedTree object which
> might carry some metadata about what permissions would be needed to see
> more. The DeniedTree lacks the real tree's entries. (TBD, how do we
> render a denied tree in the workspace? An un-writable directory
> containing only a GITDENIED file with some user friendly error message?)
>   - hashes are secret. If the hashes from a protected tree leak, the
> data also leaks. No check on the server prevents it from handing out
> contents for correctly-guessed hashes.
>   - mortal users shouldn't be able to alter permissions. Of course,
> mortal users will often modify tree objects that carry permissions. So
> the server should enforce that a user isn't pushing updates that alter
> permissions on the same logical directory.
>
> I would welcome your feedback on whether this idea makes technical
> sense, and whether the feature could ever be a fit for git.
>
> You might ask what alternatives we are looking at. At our scale, we'd
> really want a version control system that implements a virtual
> filesystem. That already limits us to ClearCase, VFSForGit, and maybe
> Vesta among public ones.  Am I missing any? We would also want one that
> permits branching enormous numbers of files without creating enormous
> amounts of data in the repo -- git gets that right, and perforce (our
> status quo) does not. That's how I got onto the idea of adding read
> authorization to git.

All of what you've described is possible to implement in git, but as far
as I know there's no existing implementation of it.

Microsoft's GVFS probably comes closest, and they're actively
upstreaming bits of that, but as far as I know that doesn't in any way
try to solve this "users XYZ can't even list such-and-such a tree"
problem.

Google has much the same problem with Android, i.e. there's some parts
of the giant checkout of multiple repos the "repo" tool manages that
aren't available to some of Google's employees, or only to some partners
etc. They solve this by splitting those at repo boundaries, and are
working on moving that to submodules. Perhaps this would work in your
case? It's not as easy to work with, but could be made to work with
existing software.

In case you haven't, read "SECURITY" in the git-fetch(1) manpage,
although if that could be fixed to work for the case you're describing.

In a hypothetical implementation that does this, you are always going to
need to have something where you hand off the SHA-1 of the
"secret-stuff-here/" top-level tree object to clients that can't access
any of the sub-trees or blobs from that directory, since they're going
to need it to construct a commit object consisting of
"stuff-they-can-access/" plus the current state of "secret-stuff-here/"
to push upstream. But you can hide the blobs & trees under
"secret-stuff-here/" and their SHA-1s.

Note also that in the general case your "is this blob secret to user
xyz" or "is this tree secret to user xyz" will need to deal with there
happening to be some blob or tree in "secret-stuff-here/" that just so
happens to share the exact same tree/blob as something they need under
"stuff-they-can-access/". Think a directory that only contains a
".gitignore" file, or a BSD "LICENSE" file.

^ permalink raw reply

* Re: git, monorepos, and access control
From: Derrick Stolee @ 2018-12-05 20:43 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, Coiner, John; +Cc: git@vger.kernel.org
In-Reply-To: <878t13zp8y.fsf@evledraar.gmail.com>

On 12/5/2018 3:34 PM, Ævar Arnfjörð Bjarmason wrote:
> On Wed, Dec 05 2018, Coiner, John wrote:
>
>> I'm an engineer with AMD. I'm looking at whether we could switch our
>> internal version control to a monorepo, possibly one based on git and
>> VFSForGit.
>>
>> Has anyone looked at adding access control to git, at a per-directory
>> granularity? Is this a feature that the git community would possibly
>> welcome?
> All of what you've described is possible to implement in git, but as far
> as I know there's no existing implementation of it.
>
> Microsoft's GVFS probably comes closest, and they're actively
> upstreaming bits of that, but as far as I know that doesn't in any way
> try to solve this "users XYZ can't even list such-and-such a tree"
> problem.
(Avar has a lot of good ideas in his message, so I'm just going to add
on a few here.)

This directory-level security is not a goal for VFS for Git, and I don't
see itbecoming a priority as it breaks a number of design decisions we
made in our object storage and communication models.

The best I can think about when considering Git as an approach would be
to use submodules for your security-related content, and then have server-
side security for access to those repos. Of course, submodules are not
supported in VFS for Git, either.

The Gerrit service has _branch_ level security, which is related to the
reachability questions that a directory security would need. However,
the problem is quite different. Gerrit does have a lot of experience in
dealing with submodules, though, so that's probably a good place to
start.

Thanks,
-Stolee

^ permalink raw reply

* Re: git, monorepos, and access control
From: Duy Nguyen @ 2018-12-05 20:58 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Ævar Arnfjörð Bjarmason, John.Coiner,
	Git Mailing List
In-Reply-To: <a5a3009e-346e-2b63-5b7c-3e9daf0c7de2@gmail.com>

On Wed, Dec 5, 2018 at 9:46 PM Derrick Stolee <stolee@gmail.com> wrote:
> This directory-level security is not a goal for VFS for Git, and I don't
> see itbecoming a priority as it breaks a number of design decisions we
> made in our object storage and communication models.
>
> The best I can think about when considering Git as an approach would be
> to use submodules for your security-related content, and then have server-
> side security for access to those repos. Of course, submodules are not
> supported in VFS for Git, either.

Another option is builtin per-blob encryption (maybe with just
clean/smudge filter), then access control will be about obtaining the
decryption key (*) and we don't break object storage and
communication. Of course pack delta compression becomes absolutely
useless. But that is perhaps an acceptable trade off.

(*) Git will not cache the key in any shape or form. Whenever it needs
to deflate an encrypted blob, it asks for the key from a separate
daemon. This guy handles all the access control.

> The Gerrit service has _branch_ level security, which is related to the
> reachability questions that a directory security would need. However,
> the problem is quite different. Gerrit does have a lot of experience in
> dealing with submodules, though, so that's probably a good place to
> start.
>
> Thanks,
> -Stolee
-- 
Duy

^ permalink raw reply

* Re: git, monorepos, and access control
From: Jeff King @ 2018-12-05 21:01 UTC (permalink / raw)
  To: Coiner, John; +Cc: git@vger.kernel.org
In-Reply-To: <939efd87-b2af-29d7-efdd-9cf8f6de9d10@amd.com>

On Wed, Dec 05, 2018 at 08:13:16PM +0000, Coiner, John wrote:

> One obstacle to moving AMD to git/VFSForGit is the lack of access 
> control support in git. AMD has a lot of data whose distribution must be 
> limited. Sometimes it's a legal requirement, eg. CPU core designs are 
> covered by US export control laws and not all employees may see them. 
> Sometimes it's a contractual obligation, as when a third party shares 
> data with us and we agree only to share this data with certain 
> employees. Any hypothetical AMD monorepo should be able to securely deny 
> read access in certain subtrees to users without required permissions.
> 
> Has anyone looked at adding access control to git, at a per-directory 
> granularity? Is this a feature that the git community would possibly 
> welcome?

In my opinion this feature is so contrary to Git's general assumptions
that it's likely to create a ton of information leaks of the supposedly
protected data.

For instance, Git is very eager to try to find delta-compression
opportunities between objects, even if they don't have any relationship
within the tree structure. So imagine I want to know the contents of
tree X. I push up a tree Y similar to X, then fetch it back, falsely
claiming to have X but not Y. If the server generates a delta, that may
reveal information about X (which you can then iterate to send Y', and
so on, treating the server as an oracle until you've guessed the content
of X).

You could work around that by teaching the server to refuse to use "X"
in any way when the client does not have the right permissions. But:

  - that's just one example; there are probably a number of such leaks

  - you're fighting an uphill battle against the way Git is implemented;
    there's no single point to enforce this kind of segmentation

The model that fits more naturally with how Git is implemented would be
to use submodules. There you leak the hash of the commit from the
private submodule, but that's probably obscure enough (and if you're
really worried, you can add a random nonce to the commit messages in the
submodule to make their hashes unguessable).

> I would welcome your feedback on whether this idea makes technical 
> sense, and whether the feature could ever be a fit for git.

Sorry I don't have a more positive response. What you want to do is
perfectly reasonable, but I just think it's a mismatch with how Git
works (and because of the security impact, one missed corner case
renders the whole thing useless).

-Peff

^ permalink raw reply

* Re: git, monorepos, and access control
From: Ævar Arnfjörð Bjarmason @ 2018-12-05 21:12 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Derrick Stolee, John.Coiner, Git Mailing List
In-Reply-To: <CACsJy8AzMvG3U5GnVkn0Ax3XP3NnPCwwc1AzdVV9JkePfMjwWg@mail.gmail.com>


On Wed, Dec 05 2018, Duy Nguyen wrote:

> On Wed, Dec 5, 2018 at 9:46 PM Derrick Stolee <stolee@gmail.com> wrote:
>> This directory-level security is not a goal for VFS for Git, and I don't
>> see itbecoming a priority as it breaks a number of design decisions we
>> made in our object storage and communication models.
>>
>> The best I can think about when considering Git as an approach would be
>> to use submodules for your security-related content, and then have server-
>> side security for access to those repos. Of course, submodules are not
>> supported in VFS for Git, either.
>
> Another option is builtin per-blob encryption (maybe with just
> clean/smudge filter), then access control will be about obtaining the
> decryption key (*) and we don't break object storage and
> communication. Of course pack delta compression becomes absolutely
> useless. But that is perhaps an acceptable trade off.

Right, this is another option, but from what John described wouldn't
work in this case. "Any hypothetical AMD monorepo should be able to
securely deny read access in certain subtrees to users without required
permissions".

I.e. in this case there will be a
secret-stuff-here/ryzen-microcode.code.encrypted or whatever,
unauthorized users can't see the content, but they can see from the
filename that it exists, and from "git log" who works on it.

It'll also baloon in size on the server-side since we can't delta any of
these objects, they'll all be X sized encrypted binaries.

> (*) Git will not cache the key in any shape or form. Whenever it needs
> to deflate an encrypted blob, it asks for the key from a separate
> daemon. This guy handles all the access control.
>
>> The Gerrit service has _branch_ level security, which is related to the
>> reachability questions that a directory security would need. However,
>> the problem is quite different. Gerrit does have a lot of experience in
>> dealing with submodules, though, so that's probably a good place to
>> start.
>>
>> Thanks,
>> -Stolee

^ permalink raw reply

* Re: [PATCH] rebase docs: fix incorrect format of the section Behavioral Differences
From: Johannes Sixt @ 2018-12-05 21:16 UTC (permalink / raw)
  To: Elijah Newren, Junio C Hamano; +Cc: Git Mailing List
In-Reply-To: <CABPp-BHzESYnQy5JwXvtXyLHgHR9u3UNVOZF2gU1m_uTMGkyfg@mail.gmail.com>

Am 05.12.18 um 16:37 schrieb Elijah Newren:
> On Tue, Dec 4, 2018 at 11:40 PM Junio C Hamano <gitster@pobox.com> wrote:
>>
>> Johannes Sixt <j6t@kdbg.org> writes:
>>
>>> Please let me deposit my objection. This paragraph is not the right
>>> place to explain what directory renme detection is and how it works
>>> under the hood. "works fine" in the original text is the right phrase
>>> here; if there is concern that this induces expectations that cannot
>>> be met, throw in the word "heuristics".
>>>
>>> Such as:
>>>     Directory rename heuristics work fine in the merge and interactive
>>>     backends. It does not in the am backend because...
>>
>> OK, good enough, I guess.  Or just s/work fine/is enabled/.
> 
> So...
> 
> Directory rename heuristics are enabled in the merge and interactive
> backends. They are not in the am backend because it operates on
> "fake ancestors" that involve trees containing only the files modified
> in the patch.  Due to the lack of accurate tree information, directory
> rename detection is disabled for the am backend.

OK with me. And also if you drop sentence no.2 and keep just no.1 and no.3.

Thanks,
-- Hannes

^ permalink raw reply

* Re: BUG: CR marker ^M doesn't show up in '-' lines of diffs when the ending of the removed line is CR+LF
From: Johannes Sixt @ 2018-12-05 21:31 UTC (permalink / raw)
  To: Frank Schäfer; +Cc: Junio C Hamano, brian m. carlson, git
In-Reply-To: <de3bd8e1-cdd7-5fcf-0912-a216cc8cb3e9@googlemail.com>

Am 05.12.18 um 20:29 schrieb Frank Schäfer:
> 
> Am 02.12.18 um 22:22 schrieb Johannes Sixt:
>> Am 02.12.18 um 20:31 schrieb Frank Schäfer:
>>> With other words:
>>> "If CR comes immediately before a LF, do the following with *all* lines:
>>> <RESET> after the CR if eol=lf but do not <RESET> after the CR if
>>> eol=crlf."
>>
>> Why? It is the pager's duty to highlight CR, IMO. If it doesn't, but
>> the user wants to see them, then they are using the wrong pager or the
>> wrong pager settings.
> AFAIU Junios explanation it's not the pagers fault.

Then Junio and I are in disagreement. IMO, Git does not have to be more 
clever than the pager when it comes to presentation of text.

>> As far as I am concerned, I do not have any of my files marked as
>> eol=crlf, but I do *not* want to see ^M in the pager. I.e., having git
>> insert <RESET> between CR and LF would do the wrong thing for me.
>>
> But doing the same thing in added lines is doing the right thing for you ?

Yes, I think so. As long as I'm not telling Git that my files are CRLF 
when they actual are, then the CR before LF is a whitespace error. 
Nevertheless, I do *NOT* want Git to outwit my pager by inserting 
<RESET> between CR and LF all the time so that it is forced to treat the 
lone CR like a control character that is to be made visible.

> Or are you suggesting to fix the behavior of added lines instead ?
> In any case, inconsistent behavior is not what we want.

I'm suggesting that users who knowingly store CRLF files in the 
database, but do not want to see ^M in added lines have to use 
whitespace=cr-at-eol and that's all. I do not see inconsistency.

-- Hannes

^ permalink raw reply

* Re: [RFC PATCH 3/3] test-lib: add the '--stress' option to run a test repeatedly under load
From: Jeff King @ 2018-12-05 21:36 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: git
In-Reply-To: <20181205103454.GJ30222@szeder.dev>

On Wed, Dec 05, 2018 at 11:34:54AM +0100, SZEDER Gábor wrote:

> 
> Just a quick reply to this one point for now:
> 
> On Wed, Dec 05, 2018 at 12:44:09AM -0500, Jeff King wrote:
> > On Tue, Dec 04, 2018 at 05:34:57PM +0100, SZEDER Gábor wrote:
> > > +	job_nr=0
> > > +	while test $job_nr -lt "$job_count"
> > > +	do
> > > +		wait
> > > +		job_nr=$(($job_nr + 1))
> > > +	done
> > 
> > Do we need to loop? Calling "wait" with no arguments should wait for all
> > children.
> 
> It should, but in dash, ksh, ksh93, Bash v4.2 or older it doesn't seem
> to do so, at least not on my machine, and I get back my shell prompt
> right after hitting ctrl-C or the first failed test, and then get the
> progress output from the remaining jobs later.

That's weird. I run my stress script under dash with a single "wait",
and it handles the failing case just fine. And switching to a single
"wait" in your script works for me.

But the ^C case is interesting. Try running your script under "sh -x"
and hitting ^C. The signal interrupts the first wait. In my script (or
yours modified to use a single wait), we then proceed immediately to the
"exit", and get output from the subshells after we've exited.

But in your script with the loop, the first wait is interrupted, and
then the _second_ wait (in the second loop iteration) picks up all of
the exiting processes. The subsequent waits are all noops that return
immediately, because there are no processes left.

So the right number of waits is either "1" or "2". Looping means we do
too many (which is mostly a harmless noop) or too few (in the off chance
that you have only a single job ;) ). So it works out in practice.

But I think a more clear way to do it is to simply wait in the signal
trap, to cover the case when our original wait was aborted.

I.e., something like this:

diff --git a/t/test-lib.sh b/t/test-lib.sh
index 9b7f687396..5e0c41626f 100644
--- a/t/test-lib.sh
+++ b/t/test-lib.sh
@@ -98,7 +98,7 @@ done,*)
 	mkdir -p "$(dirname "$TEST_RESULTS_BASE")"
 	stressfail="$TEST_RESULTS_BASE.stress-failed"
 	rm -f "$stressfail"
-	trap 'echo aborted >"$stressfail"' TERM INT HUP
+	trap 'echo aborted >"$stressfail"; wait' TERM INT HUP

 	job_nr=0
 	while test $job_nr -lt "$job_count"
@@ -128,12 +128,7 @@ done,*)
 		job_nr=$(($job_nr + 1))
 	done

-	job_nr=0
-	while test $job_nr -lt "$job_count"
-	do
-		wait
-		job_nr=$(($job_nr + 1))
-	done
+	wait

 	if test -f "$stressfail" && test "$(cat "$stressfail")" != "aborted"
 	then

which behaves well for me in all cases.

> Bash 4.3 or later are strange: I get back the shell prompt immediately
> after ctrl-C as well, so it doesn't appear to be waiting for all
> remaining jobs to finish either, but! I don't get any of the progress
> output from those jobs to mess up my next command.

Interesting. My bash 4.4 seems to behave the same as dash. It almost
sounds like the SIGINT is getting passed to the subshells for you.
Probably not really worth digging into, though.

In case anybody is playing with this and needed to simulate a stress
failure, here's what I used:

diff --git a/t/t0000-basic.sh b/t/t0000-basic.sh
index b6566003dd..a370cd9977 100755
--- a/t/t0000-basic.sh
+++ b/t/t0000-basic.sh
@@ -1171,4 +1171,11 @@ test_expect_success 'very long name in the index handled sanely' '
 	test $len = 4098
 '

+test_expect_success 'roll those dice' '
+	case "$(openssl rand -base64 1)" in
+	z*)
+		return 1
+	esac
+'
+
 test_done

^ permalink raw reply related

* [PATCH v3] list-objects.c: don't segfault for missing cmdline objects
From: Matthew DeVore @ 2018-12-05 21:43 UTC (permalink / raw)
  To: git; +Cc: Matthew DeVore, gitster, jonathantanmy, jeffhost, ramsay, matvore
In-Reply-To: <20181023215745.245333-1-matvore@google.com>

When a command is invoked with both --exclude-promisor-objects,
--objects-edge-aggressive, and a missing object on the command line,
the rev_info.cmdline array could get a NULL pointer for the value of
an 'item' field. Prevent dereferencing of a NULL pointer in that
situation.

Properly handle --ignore-missing. If it is not passed, die when an
object is missing. Otherwise, just silently ignore it.

Signed-off-by: Matthew DeVore <matvore@google.com>
---
 revision.c               |  2 ++
 t/t0410-partial-clone.sh | 16 ++++++++++++++--
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/revision.c b/revision.c
index 13e0519c02..293303b67d 100644
--- a/revision.c
+++ b/revision.c
@@ -1729,6 +1729,8 @@ int handle_revision_arg(const char *arg_, struct rev_info *revs, int flags, unsi
 	if (!cant_be_filename)
 		verify_non_filename(revs->prefix, arg);
 	object = get_reference(revs, arg, &oid, flags ^ local_flags);
+	if (!object)
+		return revs->ignore_missing ? 0 : -1;
 	add_rev_cmdline(revs, object, arg_, REV_CMD_REV, flags ^ local_flags);
 	add_pending_object_with_path(revs, object, arg, oc.mode, oc.path);
 	free(oc.path);
diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index ba3887f178..169f7f10a7 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -349,7 +349,7 @@ test_expect_success 'rev-list stops traversal at promisor commit, tree, and blob
 	grep $(git -C repo rev-parse bar) out  # sanity check that some walking was done
 '
 
-test_expect_success 'rev-list accepts missing and promised objects on command line' '
+test_expect_success 'rev-list dies for missing objects on cmd line' '
 	rm -rf repo &&
 	test_create_repo repo &&
 	test_commit -C repo foo &&
@@ -366,7 +366,19 @@ test_expect_success 'rev-list accepts missing and promised objects on command li
 
 	git -C repo config core.repositoryformatversion 1 &&
 	git -C repo config extensions.partialclone "arbitrary string" &&
-	git -C repo rev-list --exclude-promisor-objects --objects "$COMMIT" "$TREE" "$BLOB"
+
+	for OBJ in "$COMMIT" "$TREE" "$BLOB"; do
+		test_must_fail git -C repo rev-list --objects \
+			--exclude-promisor-objects "$OBJ" &&
+		test_must_fail git -C repo rev-list --objects-edge-aggressive \
+			--exclude-promisor-objects "$OBJ" &&
+
+		# Do not die or crash when --ignore-missing is passed.
+		git -C repo rev-list --ignore-missing --objects \
+			--exclude-promisor-objects "$OBJ" &&
+		git -C repo rev-list --ignore-missing --objects-edge-aggressive \
+			--exclude-promisor-objects "$OBJ"
+	done
 '
 
 test_expect_success 'gc repacks promisor objects separately from non-promisor objects' '
-- 
2.20.0.rc1.387.gf8505762e3-goog


^ permalink raw reply related

* Re: [RFC PATCH 3/3] test-lib: add the '--stress' option to run a test repeatedly under load
From: Jeff King @ 2018-12-05 21:56 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: git
In-Reply-To: <20181205140106.GM30222@szeder.dev>

On Wed, Dec 05, 2018 at 03:01:06PM +0100, SZEDER Gábor wrote:

> > >   - Make '--stress' imply '--verbose-log' and discard the test's
> > >     standard ouput and error; dumping the output of several parallel
> > >     tests to the terminal would create a big ugly mess.
> > 
> > Makes sense. My script just redirects the output to a file, but it
> > predates --verbose-log.
> > 
> > My script always runs with "-x". I guess it's not that hard to add it if
> > you want, but it is annoying to finally get a failure after several
> > minutes only to realize that your are missing some important
> > information. ;)
> > 
> > Ditto for "-i", which leaves more readable output (you know the
> > interesting part is at the end of the log), and leaves a trash directory
> > state that is more likely to be useful for inspecting.
> 
> I wanted to imply only the bare minimum of options that are necessary
> to prevent the tests from stomping on each other.  Other than that I
> agree and wouldn't run '--stress' without '-x -i' myself, and at one
> point considered to recommend '-x -i' in the description of
> '--stress'.

Yeah, it probably is the right thing to make the options as orthogonal
as possible, and let the user decide what they want. I was just
wondering if I could replace my "stress" script with this. But I think
I'd still want it to remember to use a sane set of options (including
"--root" in my case).

> > > 'wait' for all parallel jobs before exiting (either because a failure
> > > was found or because the user lost patience and aborted the stress
> > > test), allowing the still running tests to finish.  Otherwise the "OK
> > > X.Y" progress output from the last iteration would likely arrive after
> > > the user got back the shell prompt, interfering with typing in the
> > > next command.  OTOH, this waiting might induce a considerable delay
> > > between hitting ctrl-C and the test actually exiting; I'm not sure
> > > this is the right tradeoff.
> > 
> > If there is a delay, it's actually quite annoying to _not_ wait; you
> > start doing something in the terminal, and then a bunch of extra test
> > statuses print at a random time.
> 
> At least the INT/TERM/etc. handler should say something like "Waiting
> for background jobs to finish", to let the user know why it doesn't
> exit right away.

That seems reasonable. There's also no point in waiting for them to
finish if we know we're aborting anyway. Could we just kill them all?

I guess it's a little tricky, because $! is going to give us the pid of
each subshell. We actually want to kill the test sub-process. That takes
a few contortions, but the output is nice (every sub-job immediately
says "ABORTED ...", and the final process does not exit until the whole
tree is done):

diff --git a/t/test-lib.sh b/t/test-lib.sh
index 9b7f687396..357dead3ff 100644
--- a/t/test-lib.sh
+++ b/t/test-lib.sh
@@ -98,8 +98,9 @@ done,*)
 	mkdir -p "$(dirname "$TEST_RESULTS_BASE")"
 	stressfail="$TEST_RESULTS_BASE.stress-failed"
 	rm -f "$stressfail"
-	trap 'echo aborted >"$stressfail"' TERM INT HUP
+	trap 'echo aborted >"$stressfail"; kill $job_pids 2>/dev/null; wait' TERM INT HUP
 
+	job_pids=
 	job_nr=0
 	while test $job_nr -lt "$job_count"
 	do
@@ -108,10 +109,13 @@ done,*)
 			GIT_TEST_STRESS_JOB_NR=$job_nr
 			export GIT_TEST_STRESS_STARTED GIT_TEST_STRESS_JOB_NR
 
+			trap 'kill $test_pid 2>/dev/null' TERM
+
 			cnt=0
 			while ! test -e "$stressfail"
 			do
-				if $TEST_SHELL_PATH "$0" "$@" >/dev/null 2>&1
+				$TEST_SHELL_PATH "$0" "$@" >/dev/null 2>&1 & test_pid=$!
+				if wait
 				then
 					printf >&2 "OK   %2d.%d\n" $GIT_TEST_STRESS_JOB_NR $cnt
 				elif test -f "$stressfail" &&
@@ -124,16 +128,11 @@ done,*)
 				fi
 				cnt=$(($cnt + 1))
 			done
-		) &
+		) & job_pids="$job_pids $!"
 		job_nr=$(($job_nr + 1))
 	done
 
-	job_nr=0
-	while test $job_nr -lt "$job_count"
-	do
-		wait
-		job_nr=$(($job_nr + 1))
-	done
+	wait
 
 	if test -f "$stressfail" && test "$(cat "$stressfail")" != "aborted"
 	then

> > > +				elif test -f "$stressfail" &&
> > > +				     test "$(cat "$stressfail")" = "aborted"
> > > +				then
> > > +					printf >&2 "ABORTED %2d.%d\n" $GIT_TEST_STRESS_JOB_NR $cnt
> > > +				else
> > > +					printf >&2 "FAIL %2d.%d\n" $GIT_TEST_STRESS_JOB_NR $cnt
> > > +					echo $GIT_TEST_STRESS_JOB_NR >>"$stressfail"
> > > +				fi
> > 
> > Hmm. What happens if we see a failure _and_ there's an abort? That's
> > actually pretty plausible if you see a failure whiz by, and you want to
> > abort the remaining scripts because they take a long time to run.
> > 
> > If the abort happens first, then the goal is to assume any errors are
> > due to the abort.
> 
> Yeah, that's why I added this ABORTED case.  When I hit ctrl-C, a
> couple of the tests tend to fail, but that's not the rare failure we
> are hoping to find when stress testing, so printing FAIL would be
> misleading.  The test script exits with 1 all the same, so we can't
> tell the difference, but since rare failures are, well, rare, this
> failure is most likely due to the abort.

With the patch as posted I actually see most of them still say "OK",
because the SIGINT does not make it to them (it would be a different
story if I killed the whole process group).

But with the modification above, I get "ABORTED" for most, which is what
I'd want. And it should be pretty race-free, since the "aborted" file is
put into place before triggering the signal cascade.

> > If the fail happens first (which I think is the more likely case), then
> > the abort handler will overwrite the file with "aborted", and we'll
> > forget that there was a real failure. This works in my script because of
> > the symlinking (if a real failure symlinked $fail to a directory, then
> > the attempt to write "aborted" will just be a noop).
> 
> OK, I think the abort handler should append, not overwrite, and then
> the file's contents should be checked e.g. with a case statement
> looking for *aborted*.
> 
> Or use two separate files for signaling abort and test failure.

Two separate files seems a lot simpler to me.

> > In my experience, outputting the failed log saves a step (especially
> > with "-i"), since seeing the failed test and its output is often
> > sufficient.
> 
> I just don't like when test output, especially with '-x', is dumped on
> my terminal :)

Don't you then use your terminal to display the file? ;) (I'm kidding, I
assume you load it in "less" or whatever, which is distinct).

I wish this could be optional somehow, though. I really prefer the dump.
I guess if I continue to use my script as a wrapper, it can dump for me
(after digging in the $stressfail file, I guess).

> > I'm also sad to lose the symlink to the failed trash directory, which
> > saves cutting and pasting when you have to inspect it. But I guess we
> > can't rely on symlinks everywhere.
> 
> You can tab-complete most of the trash directory, and then there are
> only those 1-2-3 digits of the job number that you have to type.  That
> worked out well enough for me so far (but I've only used it for a few
> days, so...).
> 
> We could rename the trash directory of the failed test to something
> short, sweet, and common, but that could be racy in the unlikely case
> that two tests fail at the same time.  And I like the 'trash
> directory.' prefix, because then 'make clean' will delete that one,
> too, without modifying 't/Makefile' (which is also the reason for
> adding the 'stress-N' suffix instead of a prefix, and putting the fail
> file into the 'test-results' directory).

We could rename it to "trash directory.xxx.failed", which is at least
predictable. That can even be done atomically, though I think it's a
little tricky to do portably with shell tools. I guess I can continue to
do the symlink thing in my wrapper script.

-Peff

^ permalink raw reply related

* Re: [PATCH 2/3] test-lib-functions: introduce the 'test_set_port' helper function
From: Jeff King @ 2018-12-05 21:59 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: git, Luke Diamand
In-Reply-To: <20181205122035.GL30222@szeder.dev>

On Wed, Dec 05, 2018 at 01:20:35PM +0100, SZEDER Gábor wrote:

> On Wed, Dec 05, 2018 at 12:17:09AM -0500, Jeff King wrote:
> > On Tue, Dec 04, 2018 at 05:34:56PM +0100, SZEDER Gábor wrote:
> > 
> > > Several test scripts run daemons like 'git-daemon' or Apache, and
> > > communicate with them through TCP sockets.  To have unique ports where
> > > these daemons are accessible, the ports are usually the number of the
> > > corresponding test scripts, unless the user overrides them via
> > > environment variables, and thus all those tests and test libs contain
> > > more or less the same bit of one-liner boilerplate code to find out
> > > the port.  The last patch in this series will make this a bit more
> > > complicated.
> > > 
> > > Factor out finding the port for a daemon into the common helper
> > > function 'test_set_port' to avoid repeating ourselves.
> > 
> > OK. Looks like this should keep the same port numbers for all of the
> > existing tests, which I think is good.
> 
> Not for all existing tests, though: t0410 and the 'git p4' tests do
> get different ports.

Yeah, I should have said "most". The important thing is that they remain
static and predictable for normal non-stress runs, which they do.

-Peff

^ permalink raw reply

* Re: What's cooking in git.git (Nov 2018, #06; Wed, 21)
From: Matthew DeVore @ 2018-12-05 22:04 UTC (permalink / raw)
  To: Junio C Hamano, git
In-Reply-To: <xmqqlg5m7qlb.fsf@gitster-ct.c.googlers.com>



On 11/21/2018 01:00 AM, Junio C Hamano wrote:
> * md/list-lazy-objects-fix (2018-10-29) 1 commit
>   - list-objects.c: don't segfault for missing cmdline objects
> 
>   "git rev-list --exclude-promissor-objects" had to take an object
>   that does not exist locally (and is lazily available) from the
>   command line without barfing, but the code dereferenced NULL.
> 
>   That sympotom may be worth addressing, but I think the "fix" is
>   overly broad and is wrong.  Giving a missing object should be
>   diagnosed as an error, not swept under the rug silently.

Thank you for taking a look. I just sent v3 of the patch: 
https://public-inbox.org/git/20181205214346.106217-1-matvore@google.com/T/#u

which is different from v2 in that it honors the presence or absence of 
--ignore-missing. It will only sweep things under the rug if that flag 
is passed.

^ permalink raw reply

* [PATCH 0/2] Add commit-graph fuzzer and fix buffer overflow
From: Josh Steadmon @ 2018-12-05 22:32 UTC (permalink / raw)
  To: git, stolee

Add a new fuzz test for the commit graph and fix a buffer read-overflow
that it discovered.

Josh Steadmon (2):
  commit-graph, fuzz: Add fuzzer for commit-graph
  commit-graph: fix buffer read-overflow

 .gitignore          |  1 +
 Makefile            |  1 +
 commit-graph.c      | 76 +++++++++++++++++++++++++++++++++------------
 fuzz-commit-graph.c | 18 +++++++++++
 4 files changed, 77 insertions(+), 19 deletions(-)
 create mode 100644 fuzz-commit-graph.c

-- 
2.20.0.rc2.403.gdbc3b29805-goog


^ permalink raw reply

* [PATCH 1/2] commit-graph, fuzz: Add fuzzer for commit-graph
From: Josh Steadmon @ 2018-12-05 22:32 UTC (permalink / raw)
  To: git, stolee
In-Reply-To: <cover.1544048946.git.steadmon@google.com>

Breaks load_commit_graph_one() into a new function,
parse_commit_graph(). The latter function operates on arbitrary buffers,
which makes it suitable as a fuzzing target.

Adds fuzz-commit-graph.c, which provides a fuzzing entry point
compatible with libFuzzer (and possibly other fuzzing engines).

Signed-off-by: Josh Steadmon <steadmon@google.com>
---
 .gitignore          |  1 +
 Makefile            |  1 +
 commit-graph.c      | 63 +++++++++++++++++++++++++++++++++------------
 fuzz-commit-graph.c | 18 +++++++++++++
 4 files changed, 66 insertions(+), 17 deletions(-)
 create mode 100644 fuzz-commit-graph.c

diff --git a/.gitignore b/.gitignore
index 0d77ea5894..8bcf153ed9 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1,3 +1,4 @@
+/fuzz-commit-graph
 /fuzz_corpora
 /fuzz-pack-headers
 /fuzz-pack-idx
diff --git a/Makefile b/Makefile
index 1a44c811aa..6b72f37c29 100644
--- a/Makefile
+++ b/Makefile
@@ -684,6 +684,7 @@ SCRIPTS = $(SCRIPT_SH_INS) \
 
 ETAGS_TARGET = TAGS
 
+FUZZ_OBJS += fuzz-commit-graph.o
 FUZZ_OBJS += fuzz-pack-headers.o
 FUZZ_OBJS += fuzz-pack-idx.o
 
diff --git a/commit-graph.c b/commit-graph.c
index 40c855f185..0755359b1a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -46,6 +46,10 @@
 #define GRAPH_MIN_SIZE (GRAPH_HEADER_SIZE + 4 * GRAPH_CHUNKLOOKUP_WIDTH \
 			+ GRAPH_FANOUT_SIZE + GRAPH_OID_LEN)
 
+struct commit_graph *parse_commit_graph(void *graph_map, int fd,
+					size_t graph_size);
+
+
 char *get_commit_graph_filename(const char *obj_dir)
 {
 	return xstrfmt("%s/info/commit-graph", obj_dir);
@@ -84,16 +88,10 @@ static int commit_graph_compatible(struct repository *r)
 struct commit_graph *load_commit_graph_one(const char *graph_file)
 {
 	void *graph_map;
-	const unsigned char *data, *chunk_lookup;
 	size_t graph_size;
 	struct stat st;
-	uint32_t i;
-	struct commit_graph *graph;
+	struct commit_graph *ret;
 	int fd = git_open(graph_file);
-	uint64_t last_chunk_offset;
-	uint32_t last_chunk_id;
-	uint32_t graph_signature;
-	unsigned char graph_version, hash_version;
 
 	if (fd < 0)
 		return NULL;
@@ -108,27 +106,61 @@ struct commit_graph *load_commit_graph_one(const char *graph_file)
 		die(_("graph file %s is too small"), graph_file);
 	}
 	graph_map = xmmap(NULL, graph_size, PROT_READ, MAP_PRIVATE, fd, 0);
+	ret = parse_commit_graph(graph_map, fd, graph_size);
+
+	if (ret == NULL) {
+		munmap(graph_map, graph_size);
+		close(fd);
+		exit(1);
+	}
+
+	return ret;
+}
+
+/*
+ * This function is intended to be used only from load_commit_graph_one() or in
+ * fuzz tests.
+ */
+struct commit_graph *parse_commit_graph(void *graph_map, int fd,
+					size_t graph_size)
+{
+	const unsigned char *data, *chunk_lookup;
+	uint32_t i;
+	struct commit_graph *graph;
+	uint64_t last_chunk_offset;
+	uint32_t last_chunk_id;
+	uint32_t graph_signature;
+	unsigned char graph_version, hash_version;
+
+	/*
+	 * This should already be checked in load_commit_graph_one, but we still
+	 * need a check here for when we're calling parse_commit_graph directly
+	 * from fuzz tests. We can omit the error message in that case.
+	 */
+	if (graph_size < GRAPH_MIN_SIZE)
+		return NULL;
+
 	data = (const unsigned char *)graph_map;
 
 	graph_signature = get_be32(data);
 	if (graph_signature != GRAPH_SIGNATURE) {
 		error(_("graph signature %X does not match signature %X"),
 		      graph_signature, GRAPH_SIGNATURE);
-		goto cleanup_fail;
+		return NULL;
 	}
 
 	graph_version = *(unsigned char*)(data + 4);
 	if (graph_version != GRAPH_VERSION) {
 		error(_("graph version %X does not match version %X"),
 		      graph_version, GRAPH_VERSION);
-		goto cleanup_fail;
+		return NULL;
 	}
 
 	hash_version = *(unsigned char*)(data + 5);
 	if (hash_version != GRAPH_OID_VERSION) {
 		error(_("hash version %X does not match version %X"),
 		      hash_version, GRAPH_OID_VERSION);
-		goto cleanup_fail;
+		return NULL;
 	}
 
 	graph = alloc_commit_graph();
@@ -152,7 +184,8 @@ struct commit_graph *load_commit_graph_one(const char *graph_file)
 		if (chunk_offset > graph_size - GIT_MAX_RAWSZ) {
 			error(_("improper chunk offset %08x%08x"), (uint32_t)(chunk_offset >> 32),
 			      (uint32_t)chunk_offset);
-			goto cleanup_fail;
+			free(graph);
+			return NULL;
 		}
 
 		switch (chunk_id) {
@@ -187,7 +220,8 @@ struct commit_graph *load_commit_graph_one(const char *graph_file)
 
 		if (chunk_repeated) {
 			error(_("chunk id %08x appears multiple times"), chunk_id);
-			goto cleanup_fail;
+			free(graph);
+			return NULL;
 		}
 
 		if (last_chunk_id == GRAPH_CHUNKID_OIDLOOKUP)
@@ -201,11 +235,6 @@ struct commit_graph *load_commit_graph_one(const char *graph_file)
 	}
 
 	return graph;
-
-cleanup_fail:
-	munmap(graph_map, graph_size);
-	close(fd);
-	exit(1);
 }
 
 static void prepare_commit_graph_one(struct repository *r, const char *obj_dir)
diff --git a/fuzz-commit-graph.c b/fuzz-commit-graph.c
new file mode 100644
index 0000000000..420851d0d2
--- /dev/null
+++ b/fuzz-commit-graph.c
@@ -0,0 +1,18 @@
+#include "object-store.h"
+#include "commit-graph.h"
+
+struct commit_graph *parse_commit_graph(void *graph_map, int fd,
+					size_t graph_size);
+
+int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size);
+
+int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size)
+{
+	struct commit_graph *g;
+
+	g = parse_commit_graph((void *) data, -1, size);
+	if (g)
+		free(g);
+
+	return 0;
+}
-- 
2.20.0.rc2.403.gdbc3b29805-goog


^ permalink raw reply related

* [PATCH 2/2] commit-graph: fix buffer read-overflow
From: Josh Steadmon @ 2018-12-05 22:32 UTC (permalink / raw)
  To: git, stolee
In-Reply-To: <cover.1544048946.git.steadmon@google.com>

fuzz-commit-graph identified a case where Git will read past the end of
a buffer containing a commit graph if the graph's header has an
incorrect chunk count. A simple bounds check in parse_commit_graph()
prevents this.

Signed-off-by: Josh Steadmon <steadmon@google.com>
Helped-by: Derrick Stolee <stolee@gmail.com>
---
 commit-graph.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 0755359b1a..fee171a5f3 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -175,10 +175,19 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
 	last_chunk_offset = 8;
 	chunk_lookup = data + 8;
 	for (i = 0; i < graph->num_chunks; i++) {
-		uint32_t chunk_id = get_be32(chunk_lookup + 0);
-		uint64_t chunk_offset = get_be64(chunk_lookup + 4);
+		uint32_t chunk_id;
+		uint64_t chunk_offset;
 		int chunk_repeated = 0;
 
+		if (chunk_lookup + GRAPH_CHUNKLOOKUP_WIDTH > data + graph_size) {
+			error(_("chunk lookup table entry missing; graph file may be incomplete"));
+			free(graph);
+			return NULL;
+		}
+
+		chunk_id = get_be32(chunk_lookup + 0);
+		chunk_offset = get_be64(chunk_lookup + 4);
+
 		chunk_lookup += GRAPH_CHUNKLOOKUP_WIDTH;
 
 		if (chunk_offset > graph_size - GIT_MAX_RAWSZ) {
-- 
2.20.0.rc2.403.gdbc3b29805-goog


^ permalink raw reply related

* Re: git, monorepos, and access control
From: Ævar Arnfjörð Bjarmason @ 2018-12-05 22:37 UTC (permalink / raw)
  To: Coiner, John; +Cc: git@vger.kernel.org
In-Reply-To: <939efd87-b2af-29d7-efdd-9cf8f6de9d10@amd.com>

On Wed, Dec 05 2018, Coiner, John wrote:

I forgot to mention this in my initial reply in
<878t13zp8y.fsf@evledraar.gmail.com>, but on a re-reading I re-spotted
this:

>   - hashes are secret. If the hashes from a protected tree leak, the
> data also leaks. No check on the server prevents it from handing out
> contents for correctly-guessed hashes.

This is a thing I know *way* less about so maybe I'm completely wrong,
but even if we have all the rest of the things outlined in your post to
support this, isn't this part going to be susceptible to timing attacks?

We'll do more work if you send a SHA-1 during negotiation that shares a
prefix with an existing SHA-1, since we need to binary search & compare
further. SHA-1 is 160 bits which gives you a huge space of potential
hashes, but not if I can try one bit at a time working from the start of
the hash to work my way to a valid existing hash stored on the server.

Of course that assumes a way to do this over the network, it'll be on
AMD's internal network so much faster than average, but maybe this is
completely implausible.

NetSpectre was different and relied on executing code on the remote
computer in a sandbox, not waiting for network roundtrips for each try,
so maybe this would be a non-issue.

^ permalink raw reply

* Re: [PATCH] rebase docs: fix incorrect format of the section Behavioral Differences
From: Junio C Hamano @ 2018-12-05 22:46 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: Elijah Newren, Git Mailing List
In-Reply-To: <bd67cd33-dbe8-03a0-e760-c7bb23854084@kdbg.org>

Johannes Sixt <j6t@kdbg.org> writes:

> Am 05.12.18 um 16:37 schrieb Elijah Newren:
>> On Tue, Dec 4, 2018 at 11:40 PM Junio C Hamano <gitster@pobox.com> wrote:
>>>
>>> Johannes Sixt <j6t@kdbg.org> writes:
>>>
>>>> Please let me deposit my objection. This paragraph is not the right
>>>> place to explain what directory renme detection is and how it works
>>>> under the hood. "works fine" in the original text is the right phrase
>>>> here; if there is concern that this induces expectations that cannot
>>>> be met, throw in the word "heuristics".
>>>>
>>>> Such as:
>>>>     Directory rename heuristics work fine in the merge and interactive
>>>>     backends. It does not in the am backend because...
>>>
>>> OK, good enough, I guess.  Or just s/work fine/is enabled/.
>>
>> So...
>>
>> Directory rename heuristics are enabled in the merge and interactive
>> backends. They are not in the am backend because it operates on
>> "fake ancestors" that involve trees containing only the files modified
>> in the patch.  Due to the lack of accurate tree information, directory
>> rename detection is disabled for the am backend.
>
> OK with me. And also if you drop sentence no.2 and keep just no.1 and no.3.

Yeah, that shorter version may be sufficient to explain why we do
not use the heuristics in the "am" backend.

^ permalink raw reply

* Re: [PATCH 1/2] commit-graph, fuzz: Add fuzzer for commit-graph
From: Ævar Arnfjörð Bjarmason @ 2018-12-05 22:48 UTC (permalink / raw)
  To: Josh Steadmon; +Cc: git, stolee
In-Reply-To: <53e62baaa8769bf8e90991e32e0d123cc6629559.1544048946.git.steadmon@google.com>


On Wed, Dec 05 2018, Josh Steadmon wrote:

> Breaks load_commit_graph_one() into a new function,
> parse_commit_graph(). The latter function operates on arbitrary buffers,
> which makes it suitable as a fuzzing target.
>
> Adds fuzz-commit-graph.c, which provides a fuzzing entry point
> compatible with libFuzzer (and possibly other fuzzing engines).
>
> Signed-off-by: Josh Steadmon <steadmon@google.com>
> ---
>  .gitignore          |  1 +
>  Makefile            |  1 +
>  commit-graph.c      | 63 +++++++++++++++++++++++++++++++++------------
>  fuzz-commit-graph.c | 18 +++++++++++++
>  4 files changed, 66 insertions(+), 17 deletions(-)
>  create mode 100644 fuzz-commit-graph.c
>
> diff --git a/.gitignore b/.gitignore
> index 0d77ea5894..8bcf153ed9 100644
> --- a/.gitignore
> +++ b/.gitignore
> @@ -1,3 +1,4 @@
> +/fuzz-commit-graph
>  /fuzz_corpora
>  /fuzz-pack-headers
>  /fuzz-pack-idx
> diff --git a/Makefile b/Makefile
> index 1a44c811aa..6b72f37c29 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -684,6 +684,7 @@ SCRIPTS = $(SCRIPT_SH_INS) \
>
>  ETAGS_TARGET = TAGS
>
> +FUZZ_OBJS += fuzz-commit-graph.o
>  FUZZ_OBJS += fuzz-pack-headers.o
>  FUZZ_OBJS += fuzz-pack-idx.o
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 40c855f185..0755359b1a 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -46,6 +46,10 @@
>  #define GRAPH_MIN_SIZE (GRAPH_HEADER_SIZE + 4 * GRAPH_CHUNKLOOKUP_WIDTH \
>  			+ GRAPH_FANOUT_SIZE + GRAPH_OID_LEN)
>
> +struct commit_graph *parse_commit_graph(void *graph_map, int fd,
> +					size_t graph_size);
> +
> +
>  char *get_commit_graph_filename(const char *obj_dir)
>  {
>  	return xstrfmt("%s/info/commit-graph", obj_dir);
> @@ -84,16 +88,10 @@ static int commit_graph_compatible(struct repository *r)
>  struct commit_graph *load_commit_graph_one(const char *graph_file)
>  {
>  	void *graph_map;
> -	const unsigned char *data, *chunk_lookup;
>  	size_t graph_size;
>  	struct stat st;
> -	uint32_t i;
> -	struct commit_graph *graph;
> +	struct commit_graph *ret;
>  	int fd = git_open(graph_file);
> -	uint64_t last_chunk_offset;
> -	uint32_t last_chunk_id;
> -	uint32_t graph_signature;
> -	unsigned char graph_version, hash_version;
>
>  	if (fd < 0)
>  		return NULL;
> @@ -108,27 +106,61 @@ struct commit_graph *load_commit_graph_one(const char *graph_file)
>  		die(_("graph file %s is too small"), graph_file);
>  	}
>  	graph_map = xmmap(NULL, graph_size, PROT_READ, MAP_PRIVATE, fd, 0);
> +	ret = parse_commit_graph(graph_map, fd, graph_size);
> +
> +	if (ret == NULL) {

Code in git usually uses just !ret.

> +		munmap(graph_map, graph_size);
> +		close(fd);
> +		exit(1);

Ouch, exit(1) from load_commit_graph_one()? Can't we return NULL here
instead? Nasty to exit from a library routine, but then I see later...

> @@ -201,11 +235,6 @@ struct commit_graph *load_commit_graph_one(const char *graph_file)
>  	}
>
>  	return graph;
> -
> -cleanup_fail:
> -	munmap(graph_map, graph_size);
> -	close(fd);
> -	exit(1);
>  }

... ah, I see this is where you got the exit(1) from. So it was there
already.

>  static void prepare_commit_graph_one(struct repository *r, const char *obj_dir)
> diff --git a/fuzz-commit-graph.c b/fuzz-commit-graph.c
> new file mode 100644
> index 0000000000..420851d0d2
> --- /dev/null
> +++ b/fuzz-commit-graph.c
> @@ -0,0 +1,18 @@
> +#include "object-store.h"
> +#include "commit-graph.h"
> +
> +struct commit_graph *parse_commit_graph(void *graph_map, int fd,
> +					size_t graph_size);
> +
> +int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size);
> +
> +int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size)
> +{
> +	struct commit_graph *g;
> +
> +	g = parse_commit_graph((void *) data, -1, size);
> +	if (g)
> +		free(g);
> +
> +	return 0;
> +}

I hadn't looked at this before, but see your 5e47215080 ("fuzz: add
basic fuzz testing target.", 2018-10-12) for some prior art.

There's instructions there for a very long "make" invocation. Would be
nice if this were friendlier and we could just do "make test-fuzz" or
something...

^ permalink raw reply

* Re: [PATCH on sb/more-repo-in-api] revision: use commit graph in get_reference()
From: Junio C Hamano @ 2018-12-05 23:15 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git
In-Reply-To: <20181204224238.50966-1-jonathantanmy@google.com>

Jonathan Tan <jonathantanmy@google.com> writes:

> Looking at the bigger picture, the speed of the connectivity check
> during a fetch might be further improved by passing only the negotiation
> tips (obtained through --negotiation-tip) instead of "--all". This patch
> just handles the low-hanging fruit first.

That sounds like a good direction, when having to list excessive
number of refs is the primary problem.  When fetching their 'master'
into our 'remotes/origin/master' and doing nothing else, we may end
up showing only the latter, which will miss optimization opportunity
a lot if the latest change made over there is to merge in the change
we asked them to pull earlier (which would be greatly helped if we
let them know about the tip of the topic they earlier pulled from
us), but also avoids having to send irrelevant refs that point at
tags addded to months old states.  So there is a subtle trade-off
between sending more refs to reduce the resulting packfile, and
sending fewer refs to reduce the cost of the "have" exchange.

Changing the way to throw each object pointed at by a ref into the
queue to be emitted in the "have" exchange from regular object
parsing to peeking of precomputed data would reduce the local cost
of "have" exchange, but it does not reduce the network cost at all,
though.

As to the change being specific to get_reference() and not to
parse_object(), I think what we see here is probably better, simply
because parse_object() is not in the position to asssume that it is
likely to be asked to parse commits, but I think get_reference() is,
after looking at its callsites in revision.c.

I do share the meta-comment concern with Peff, though.

> ---
>  revision.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/revision.c b/revision.c
> index b5108b75ab..e7da2c57ab 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -212,7 +212,20 @@ static struct object *get_reference(struct rev_info *revs, const char *name,
>  {
>  	struct object *object;
>  
> -	object = parse_object(revs->repo, oid);
> +	/*
> +	 * If the repository has commit graphs, repo_parse_commit() avoids
> +	 * reading the object buffer, so use it whenever possible.
> +	 */
> +	if (oid_object_info(revs->repo, oid, NULL) == OBJ_COMMIT) {
> +		struct commit *c = lookup_commit(revs->repo, oid);
> +		if (!repo_parse_commit(revs->repo, c))
> +			object = (struct object *) c;
> +		else
> +			object = NULL;
> +	} else {
> +		object = parse_object(revs->repo, oid);
> +	}
> +
>  	if (!object) {
>  		if (revs->ignore_missing)
>  			return object;

^ permalink raw reply

* Re: git, monorepos, and access control
From: Coiner, John @ 2018-12-05 23:42 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, Duy Nguyen
  Cc: Derrick Stolee, Git Mailing List, peff@peff.net
In-Reply-To: <877egnznhh.fsf@evledraar.gmail.com>

inline...

On 12/05/2018 04:12 PM, Ævar Arnfjörð Bjarmason wrote:
> On Wed, Dec 05 2018, Duy Nguyen wrote:
>
>>
>> Another option is builtin per-blob encryption (maybe with just
>> clean/smudge filter), then access control will be about obtaining the
>> decryption key (*) and we don't break object storage and
>> communication. Of course pack delta compression becomes absolutely
>> useless. But that is perhaps an acceptable trade off.
> Right, this is another option, but from what John described wouldn't
> work in this case. "Any hypothetical AMD monorepo should be able to
> securely deny read access in certain subtrees to users without required
> permissions".
>
> I.e. in this case there will be a
> secret-stuff-here/ryzen-microcode.code.encrypted or whatever,
> unauthorized users can't see the content, but they can see from the
> filename that it exists, and from "git log" who works on it.

Ah, clean/smudge has potential. It locates the security boundary 
entirely outside the git service and outside the repo. No big software 
engineering project is needed. It should work in VFSForGit just as it 
does in plain old git. Beautiful!

For our use case, it might be OK for commit logs, branch names, and 
filenames to be world-readable. Commit logs are probably the greatest 
concern; and perhaps we could address that through other means.

Maybe it's possible to design a text-to-text cipher that still permits 
git to store compact diffs of ciphered text, with only modest impact on 
security?

Thank you all for your thoughtful replies.

>
> This is a thing I know *way* less about so maybe I'm completely wrong,
> but even if we have all the rest of the things outlined in your post to
> support this, isn't this part going to be susceptible to timing attacks?

That's a good point. It's not easy to estimate exposure to something 
like this.

> For instance, Git is very eager to try to find delta-compression
> opportunities between objects, even if they don't have any relationship
> within the tree structure. So imagine I want to know the contents of
> tree X. I push up a tree Y similar to X, then fetch it back, falsely
> claiming to have X but not Y. If the server generates a delta, that may
> reveal information about X (which you can then iterate to send Y', and
> so on, treating the server as an oracle until you've guessed the content
> of X).
Another good point. I wouldn't have thought of either of these attacks. 
You're scaring me (appropriately) about the risks of adding security to 
a previously-unsecured interface. Let me push on the smudge/clean 
approach and maybe that will bear fruit.

Thanks again.

- John


^ permalink raw reply

* Re: [RFC PATCH 3/3] test-lib: add the '--stress' option to run a test repeatedly under load
From: Junio C Hamano @ 2018-12-06  0:22 UTC (permalink / raw)
  To: Jeff King; +Cc: SZEDER Gábor, git
In-Reply-To: <20181205213625.GD19936@sigill.intra.peff.net>

Jeff King <peff@peff.net> writes:

> But the ^C case is interesting. Try running your script under "sh -x"
> and hitting ^C. The signal interrupts the first wait. In my script (or
> yours modified to use a single wait), we then proceed immediately to the
> "exit", and get output from the subshells after we've exited.
>
> But in your script with the loop, the first wait is interrupted, and
> then the _second_ wait (in the second loop iteration) picks up all of
> the exiting processes. The subsequent waits are all noops that return
> immediately, because there are no processes left.
>
> So the right number of waits is either "1" or "2". Looping means we do
> too many (which is mostly a harmless noop) or too few (in the off chance
> that you have only a single job ;) ). So it works out in practice.

Well, if you time your ^C perfectly, no number of waits is right, I
am afraid.  You spawn N processes and while looping N times waiting
for them, you can ^C out of wait before these N processes all die,
no?

> But I think a more clear way to do it is to simply wait in the signal
> trap, to cover the case when our original wait was aborted.

Sounds good.

^ permalink raw reply

* Re: git, monorepos, and access control
From: brian m. carlson @ 2018-12-06  0:23 UTC (permalink / raw)
  To: Jeff King; +Cc: Coiner, John, git@vger.kernel.org
In-Reply-To: <20181205210104.GC19936@sigill.intra.peff.net>

[-- Attachment #1: Type: text/plain, Size: 1288 bytes --]

On Wed, Dec 05, 2018 at 04:01:05PM -0500, Jeff King wrote:
> You could work around that by teaching the server to refuse to use "X"
> in any way when the client does not have the right permissions. But:
> 
>   - that's just one example; there are probably a number of such leaks
> 
>   - you're fighting an uphill battle against the way Git is implemented;
>     there's no single point to enforce this kind of segmentation
> 
> The model that fits more naturally with how Git is implemented would be
> to use submodules. There you leak the hash of the commit from the
> private submodule, but that's probably obscure enough (and if you're
> really worried, you can add a random nonce to the commit messages in the
> submodule to make their hashes unguessable).

I tend to agree that submodules are the way to go here over a monorepo.
Not only do you have much better access control opportunities, in
general, smaller repositories are going to perform better and be easier
to work with than monorepos. While VFS for Git is a great technology,
using a set of smaller repositories is going to outperform it every
time, both on developer machines, and on your source code hosting
platform.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply

* Re: Any way to make git-log to enumerate commits?
From: Junio C Hamano @ 2018-12-06  0:31 UTC (permalink / raw)
  To: Konstantin Khomoutov; +Cc: Konstantin Kharlamov, git
In-Reply-To: <20181205145419.vbbaghzzrnceez45@tigra>

Konstantin Khomoutov <kostix@bswap.ru> writes:

> On Wed, Dec 05, 2018 at 05:22:14PM +0300, Konstantin Kharlamov wrote:
>
>> It would be great if git-log has a formatting option to insert an
>> index of the current commit since HEAD.
>> 
>> It would allow after quitting the git-log to immediately fire up "git
>> rebase -i HEAD~index" instead of "git rebase -i
>> go-copy-paste-this-long-number-id".
>
> This may have little sense in a general case as the history maintained
> by Git is a graph, not a single line. Hence your prospective approach
> would only work for cases like `git log` called with the
> "--first-parent" command-line option.

I do not see why the "name each rev relative to HEAD" formatting
option cannot produce HEAD^2~2 etc.

It would be similar to "git log | git name-rev --stdin" but I do not
offhand recall if we had a way to tell name-rev to use only HEAD as
the anchoring point.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox