commit-message attack for extracting sensitive data from rewritten Git history

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* commit-message attack for extracting sensitive data from rewritten Git history
@ 2013-04-07 23:17 Roberto Tyley
  2013-04-08 15:40 ` Junio C Hamano
  0 siblings, 1 reply; 6+ messages in thread
From: Roberto Tyley @ 2013-04-07 23:17 UTC (permalink / raw)
  To: git

This is a demonstration of a mildly-interesting security concern
relating to Git & git-filter-branch - not a vulnerability in Git
itself, just in the way it can be used. I thought it was interesting
to demonstrate that there is sometimes an avenue of attack for
recovering sensitive data that's been removed from Git history using
git-filter-branch. I think it's a low-severity issue, you may wish to
ignore this, and indeed I've been very politely told already that it's
clearly nonsense :)

Here's an unmodified repo, in which the user unwisely committed a
database password:

https://github.com/bfg-repo-cleaner-demos/gma-demo-repo-original/commit/8c9cfe3c

The unwise commit is reverted with a second commit using 'git revert',
which obviously leaves the password in Git history, and - some time
later - it's decided to properly clean the repo history with
git-filter-branch & git gc, purging the password so the repo can be
more widely shared (open-sourced, or just externally hosted).

git-filter-branch works exactly as intended, purging the password, but
the one thing it does not- typically - do is update the commit
message. So in the cleaned repo, the commit message for the revert
commit still looks like this:

https://github.com/bfg-repo-cleaner-demos/gma-demo-repo-git-filter-branch-cleaned/commit/bf0637a5

It contains a commit id (8c9cfe3) which is no longer in the repo, but
can very easily be associated with an existing commit simply by
examining the subject line of the reverted commit ("Carelessly
checking password into source control"). It's also obvious, from
examining the repo, where the excised data was removed (ie at the
"db.password=" line). At this point it's possible to do a brute-force
attack where you generate possible passwords, insert them into the
available commit's tree, and compare them against the leaked commit
id. When the the commit id matches, the sensitive data has been
recovered.

A proof-of-concept implementation of this attack was indeed able to
recover the purged password:

--
$ java -jar gma-0.1.jar 8c9cfe3c attack-pinpoint
gma-demo-repo-git-filter-branch-cleaned

Brute-force search using these characters : 0123456789abcdefghijklmnopqrstuvwxyz
Available commit, presumed cleaned : 8ebbf661
File path : src/main/resources/config.properties
Template blob : dca1a2fb
Exhausted strings of length 1 or less
...
Exhausted strings of length 4 or less
Match with '0g6rw'
--

So all of this amounts to a fairly low severity issue - people should
always change credentials when they mistakenly commit them to a repo -
but I guess the point is that from a paranoia point of view, you want
to remove all information - including old commit hashes buried in
commit messages - that relate to sensitive data when you clean a repo
for sharing. The git-filter-branch command has a --msg-filter option
which could be used for this purpose, with the application of some
judicious bash-scripting, grep&sed-ing. However, I must confess that I
believe users would be better advised to use The BFG:

http://rtyley.github.io/bfg-repo-cleaner/

The BFG already addresses this issue by replacing all old Git
object-ids found in commit/tag messages with the updated id. For
instance, here's that exact same commit message when cleaned with the
BFG:

https://github.com/bfg-repo-cleaner-demos/gma-demo-repo-bfg-cleaned/commit/35840201

In the case that the users specifies a filtering operation is not
removing 'private' data, the BFG replaces old ids with text of the
form '"newid [formerly oldid]", but if the operation is in fact to
strip private data, the replacement value is simply the newid - and
without the old commit id, the attack described above is not possible.

I believe it's worth educating users to give them a more realistic
understanding of their exposure, and would like to update the
documentation of git-filter-branch to give them a better idea of their
options for removing private data - that would include noting the BFG
as alternative.

- Roberto Tyley

https://github.com/rtyley/bfg-repo-cleaner/blob/v1.2.0/src/main/scala/com/madgag/git/bfg/cleaner/ObjectIdSubstitutor.scala#L33-L60

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: commit-message attack for extracting sensitive data from rewritten Git history
  2013-04-07 23:17 commit-message attack for extracting sensitive data from rewritten Git history Roberto Tyley
@ 2013-04-08 15:40 ` Junio C Hamano
  2013-04-08 21:54   ` Jeff King
  0 siblings, 1 reply; 6+ messages in thread
From: Junio C Hamano @ 2013-04-08 15:40 UTC (permalink / raw)
  To: Roberto Tyley; +Cc: git

Roberto Tyley <roberto.tyley@gmail.com> writes:

> Here's an unmodified repo, in which the user unwisely committed a
> database password:
>
> https://github.com/bfg-repo-cleaner-demos/gma-demo-repo-original/commit/8c9cfe3c
>
> The unwise commit is reverted with a second commit using 'git revert',
> which obviously leaves the password in Git history, and - some time
> later - it's decided to properly clean the repo history with
> git-filter-branch & git gc, purging the password so the repo can be
> more widely shared (open-sourced, or just externally hosted).
>
> git-filter-branch works exactly as intended, purging the password, but
> the one thing it does not- typically - do is update the commit
> message....
> .... The git-filter-branch command has a --msg-filter option
> which could be used for this purpose, with the application of some
> judicious bash-scripting, grep&sed-ing. However, I must confess that I
> believe users would be better advised to use The BFG:
>
> http://rtyley.github.io/bfg-repo-cleaner/

With or without the security issue, leaving old object names that
will become irrelevant in the rewritten history will make the
resulting history less useful, simply because people cannot look at
the objects these messages refer to. The same argument is behind the
reason why "cherry-pick -x" was originally the default, found to be
a mistake and made optional.

filter-branch provides "map" helper function to help mapping old
object names to rewritten object names, but stops there; it leaves
it up to the message filter script to identify what string in the
message is an object name to be rewritten.

It can be taught to be more helpful to the message filter writers,
and you seem to have done so in BFG, which is very good.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: commit-message attack for extracting sensitive data from rewritten Git history
  2013-04-08 15:40 ` Junio C Hamano
@ 2013-04-08 21:54   ` Jeff King
  2013-04-09  6:03     ` Johannes Sixt
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff King @ 2013-04-08 21:54 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Roberto Tyley, git

On Mon, Apr 08, 2013 at 08:40:36AM -0700, Junio C Hamano wrote:

> With or without the security issue, leaving old object names that
> will become irrelevant in the rewritten history will make the
> resulting history less useful, simply because people cannot look at
> the objects these messages refer to. The same argument is behind the
> reason why "cherry-pick -x" was originally the default, found to be
> a mistake and made optional.
> 
> filter-branch provides "map" helper function to help mapping old
> object names to rewritten object names, but stops there; it leaves
> it up to the message filter script to identify what string in the
> message is an object name to be rewritten.
> 
> It can be taught to be more helpful to the message filter writers,
> and you seem to have done so in BFG, which is very good.

Yeah, it would make sense for filter-branch to have a "--map-commit-ids"
option or similar that does the update. At first I thought it might take
two passes, but I don't think it is necessary, as long as we traverse
the commits topologically (i.e., you cannot have mentioned X in a commit
that is an ancestor of X, so you do not have to worry about mapping it
until after it has been processed).

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: commit-message attack for extracting sensitive data from rewritten Git history
  2013-04-08 21:54   ` Jeff King
@ 2013-04-09  6:03     ` Johannes Sixt
  2013-04-09 17:01       ` Jeff King
  0 siblings, 1 reply; 6+ messages in thread
From: Johannes Sixt @ 2013-04-09  6:03 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Roberto Tyley, git

Am 4/8/2013 23:54, schrieb Jeff King:
> Yeah, it would make sense for filter-branch to have a "--map-commit-ids"
> option or similar that does the update. At first I thought it might take
> two passes, but I don't think it is necessary, as long as we traverse
> the commits topologically (i.e., you cannot have mentioned X in a commit
> that is an ancestor of X, so you do not have to worry about mapping it
> until after it has been processed).

Topological traversal is not sufficient. Consider this history:

     o--A--o--
    /     /
 --o--B--o

If A mentions B (think of cherry-pick -x), then you must ensure that the
branch containing B was traversed first.

-- Hannes

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: commit-message attack for extracting sensitive data from rewritten Git history
  2013-04-09  6:03     ` Johannes Sixt
@ 2013-04-09 17:01       ` Jeff King
  2013-04-09 18:08         ` Roberto Tyley
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff King @ 2013-04-09 17:01 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: Junio C Hamano, Roberto Tyley, git

On Tue, Apr 09, 2013 at 08:03:24AM +0200, Johannes Sixt wrote:

> Am 4/8/2013 23:54, schrieb Jeff King:
> > Yeah, it would make sense for filter-branch to have a "--map-commit-ids"
> > option or similar that does the update. At first I thought it might take
> > two passes, but I don't think it is necessary, as long as we traverse
> > the commits topologically (i.e., you cannot have mentioned X in a commit
> > that is an ancestor of X, so you do not have to worry about mapping it
> > until after it has been processed).
> 
> Topological traversal is not sufficient. Consider this history:
> 
>      o--A--o--
>     /     /
>  --o--B--o
> 
> If A mentions B (think of cherry-pick -x), then you must ensure that the
> branch containing B was traversed first.

Yeah, you're right. Multiple passes are necessary to get it
completely right. And because each pass may change more commit id's, you
have to recurse to pick up those changes, and keep going until you have
a pass with no changes.

But I haven't thought that hard about it. There might be a clever
optimization where you can prune out parts of the history (e.g., if you
know that all changes to consider are in descendants of a commit, you do
not have to care about rewriting the commit or its ancestors).

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: commit-message attack for extracting sensitive data from rewritten Git history
  2013-04-09 17:01       ` Jeff King
@ 2013-04-09 18:08         ` Roberto Tyley
  0 siblings, 0 replies; 6+ messages in thread
From: Roberto Tyley @ 2013-04-09 18:08 UTC (permalink / raw)
  To: Jeff King; +Cc: Johannes Sixt, Junio C Hamano, git

On 9 April 2013 18:01, Jeff King <peff@peff.net> wrote:
> On Tue, Apr 09, 2013 at 08:03:24AM +0200, Johannes Sixt wrote:
>> If A mentions B (think of cherry-pick -x), then you must ensure that the
>> branch containing B was traversed first.
>
> Yeah, you're right. Multiple passes are necessary to get it
> completely right. And because each pass may change more commit id's, you
> have to recurse to pick up those changes, and keep going until you have
> a pass with no changes.

Just to give some context on how the BFG handles this (without doing
multiple passes):

The BFG makes a design choice (based on it's intended use-case of
annihilating unwanted data) that a specific tree or blob will always
be cleaned in exactly the same way - because when you're trying to get
rid of large blobs or private data, you most likely /don't care/ where
it is, what commit it belongs to, how old it is. The id for a cleaned
tree or blob is always the same no matter where it came from, and so
the BFG maintains a in-memory mapping of 'dirty' to 'clean' object ids
while cleaning a repo - whenever an object (commit, tag, tree, blob)
is cleaned, these values are stored in the map:

  dirty-id -> clean-id
  clean-id -> clean-id

(in terms of memory overhead, this amounts to only ~ 128MB for even
quite a large repo like the linux kernel, so I don't spend much time
worrying about it)

The map memoises the cleaning functions on all objects, so an object
(particularly a tree) never gets cleaned more than once, which is one
of the things that makes the BFG fast.

Having these memoised functions makes cleaning commit messages fairly
easy - the message is grepped for hex strings more than a few
characters in length, and if a matched string resolves uniquely to an
object id in the repo, the clean() method is called on it to get the
cleaned id - which will either return immediately with a previously
calculated result, or if the id came from a different branch, trigger
a cascade of more cleaning, eventually returning the required cleaned
id.

In the case of git-filter-branch, the user has a lot more freedom to
change the tree-structure of commits on a commit-by-commit basis, so
memoising tree-cleaning is out of the question, but I guess it might
be possible to do memoisation of just the commit ids to short-cut the
multiple-pass problem.

- Roberto Tyley

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2013-04-09 18:08 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-04-07 23:17 commit-message attack for extracting sensitive data from rewritten Git history Roberto Tyley
2013-04-08 15:40 ` Junio C Hamano
2013-04-08 21:54   ` Jeff King
2013-04-09  6:03     ` Johannes Sixt
2013-04-09 17:01       ` Jeff King
2013-04-09 18:08         ` Roberto Tyley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).