git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] docs: add filter-branch note about The BFG
@ 2013-12-17 10:53 Roberto Tyley
  2013-12-17 18:13 ` Junio C Hamano
  2013-12-17 18:40 ` [PATCH] docs: add filter-branch note about " Jonathan Nieder
  0 siblings, 2 replies; 7+ messages in thread
From: Roberto Tyley @ 2013-12-17 10:53 UTC (permalink / raw)
  To: git; +Cc: peff, tr, Roberto Tyley

The BFG is a tool specifically designed for the task of removing
unwanted data from Git repository history - a common use-case for which
git-filter-branch has been the traditional workhorse.

It's beneficial to let users know that filter-branch has an alternative
here:

* speed : The BFG is 10-50x faster
  http://rtyley.github.io/bfg-repo-cleaner/#speed
* complexity of configuration : filter-branch is a very flexible tool,
  but demands very careful usage in order to get the desired results
  http://rtyley.github.io/bfg-repo-cleaner/#examples

Obviously, filter-branch has it's advantages too - it permits very
complex rewrites, and doesn't require a JVM - but for the common
use-case of deleting unwanted data, it's helpful to users to be aware
that an alternative exists.

The BFG was released under the GPL in February 2013, and has since seen
widespread production use (The Guardian, RedHat, Google, UK Government
Digital Service), been tested against large repos (~300K commits, ~5GB
packfiles) and received significant positive feedback from users:

http://rtyley.github.io/bfg-repo-cleaner/#feedback

Signed-off-by: Roberto Tyley <roberto.tyley@gmail.com>
---
 Documentation/git-filter-branch.txt | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-filter-branch.txt b/Documentation/git-filter-branch.txt
index e4c8e82..918e965 100644
--- a/Documentation/git-filter-branch.txt
+++ b/Documentation/git-filter-branch.txt
@@ -18,6 +18,12 @@ SYNOPSIS
 
 DESCRIPTION
 -----------
+
+NOTE: For simply removing unwanted data from repository history, you may
+want to use link:http://rtyley.github.io/bfg-repo-cleaner/[The BFG Repo-Cleaner]
+instead - it's generally faster and simpler for eliminating large files
+or private data.
+
 Lets you rewrite Git revision history by rewriting the branches mentioned
 in the <rev-list options>, applying custom filters on each revision.
 Those filters can modify each tree (e.g. removing a file or running
@@ -393,7 +399,7 @@ git filter-branch --index-filter \
 Checklist for Shrinking a Repository
 ------------------------------------
 
-git-filter-branch is often used to get rid of a subset of files,
+git-filter-branch can be used to get rid of a subset of files,
 usually with some combination of `--index-filter` and
 `--subdirectory-filter`.  People expect the resulting repository to
 be smaller than the original, but you need a few more steps to
@@ -429,6 +435,12 @@ warned.
   (or if your git-gc is not new enough to support arguments to
   `--prune`, use `git repack -ad; git prune` instead).
 
+SEE ALSO
+--------
+link:http://rtyley.github.io/bfg-repo-cleaner/[The BFG Repo-Cleaner]
+- a tool specifically designed for removing unwanted data from Git
+repository history.
+
 GIT
 ---
 Part of the linkgit:git[1] suite
-- 
1.8.3.4 (Apple Git-47)

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] docs: add filter-branch note about The BFG
  2013-12-17 10:53 [PATCH] docs: add filter-branch note about The BFG Roberto Tyley
@ 2013-12-17 18:13 ` Junio C Hamano
  2013-12-18  1:04   ` Roberto Tyley
  2013-12-17 18:40 ` [PATCH] docs: add filter-branch note about " Jonathan Nieder
  1 sibling, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2013-12-17 18:13 UTC (permalink / raw)
  To: Roberto Tyley; +Cc: git, peff, tr

Roberto Tyley <roberto.tyley@gmail.com> writes:

> The BFG is a tool specifically designed for the task of removing
> unwanted data from Git repository history - a common use-case for which
> git-filter-branch has been the traditional workhorse.
>
> It's beneficial to let users know that filter-branch has an alternative
> here:
>
> * speed : The BFG is 10-50x faster
>   http://rtyley.github.io/bfg-repo-cleaner/#speed
> * complexity of configuration : filter-branch is a very flexible tool,
>   but demands very careful usage in order to get the desired results
>   http://rtyley.github.io/bfg-repo-cleaner/#examples
>
> Obviously, filter-branch has it's advantages too - it permits very
> complex rewrites, and doesn't require a JVM - but for the common
> use-case of deleting unwanted data, it's helpful to users to be aware
> that an alternative exists.
>
> The BFG was released under the GPL in February 2013, and has since seen
> widespread production use (The Guardian, RedHat, Google, UK Government
> Digital Service), been tested against large repos (~300K commits, ~5GB
> packfiles) and received significant positive feedback from users:
>
> http://rtyley.github.io/bfg-repo-cleaner/#feedback
>
> Signed-off-by: Roberto Tyley <roberto.tyley@gmail.com>
> ---
>  Documentation/git-filter-branch.txt | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/git-filter-branch.txt b/Documentation/git-filter-branch.txt
> index e4c8e82..918e965 100644
> --- a/Documentation/git-filter-branch.txt
> +++ b/Documentation/git-filter-branch.txt
> @@ -18,6 +18,12 @@ SYNOPSIS
>  
>  DESCRIPTION
>  -----------
> +
> +NOTE: For simply removing unwanted data from repository history, you may
> +want to use link:http://rtyley.github.io/bfg-repo-cleaner/[The BFG Repo-Cleaner]
> +instead - it's generally faster and simpler for eliminating large files
> +or private data.
> +

My understanding is that the primary speed up of BFG comes from the
design decision it made to fitler each blob only once, unlike
filter-branch that allows you to (and forces you to) decide how the
same blob is filtered depending on the places it appears in space
(i.e. the path in the project's directory hierarchy) and time
(i.e. the commit it appears in).  For "removing unwanted data", I
think nobody needs the flexibility to filter differently depending
on the context, an it is a good idea to refer those with such need
to BFG.

Having said that, "You may want to use ..." without giving the
reason why we recommend the other tool leaves the reader wondering
what the pros and cons are, and why git-filter-branch exists if BFG
is the first thing its document recommends even before it describes
what git-filter-branch is and does.  "You may want to check ..."
might be slightly better, but probably by not that much improvement.

Rewriting "it's generally faster ..."  part to give a bit more info
to allow readers decide the pros and cons themselves may be needed.

>  Lets you rewrite Git revision history by rewriting the branches mentioned
>  in the <rev-list options>, applying custom filters on each revision.
>  Those filters can modify each tree (e.g. removing a file or running
> @@ -393,7 +399,7 @@ git filter-branch --index-filter \
>  Checklist for Shrinking a Repository
>  ------------------------------------
>  
> -git-filter-branch is often used to get rid of a subset of files,
> +git-filter-branch can be used to get rid of a subset of files,
>  usually with some combination of `--index-filter` and
>  `--subdirectory-filter`.  People expect the resulting repository to
>  be smaller than the original, but you need a few more steps to
> @@ -429,6 +435,12 @@ warned.
>    (or if your git-gc is not new enough to support arguments to
>    `--prune`, use `git repack -ad; git prune` instead).
>  
> +SEE ALSO
> +--------
> +link:http://rtyley.github.io/bfg-repo-cleaner/[The BFG Repo-Cleaner]
> +- a tool specifically designed for removing unwanted data from Git
> +repository history.
> +
>  GIT
>  ---
>  Part of the linkgit:git[1] suite

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] docs: add filter-branch note about The BFG
  2013-12-17 10:53 [PATCH] docs: add filter-branch note about The BFG Roberto Tyley
  2013-12-17 18:13 ` Junio C Hamano
@ 2013-12-17 18:40 ` Jonathan Nieder
  1 sibling, 0 replies; 7+ messages in thread
From: Jonathan Nieder @ 2013-12-17 18:40 UTC (permalink / raw)
  To: Roberto Tyley; +Cc: git, peff, tr

Hi,

Roberto Tyley wrote:

> The BFG is a tool specifically designed for the task of removing
> unwanted data from Git repository history - a common use-case for which
> git-filter-branch has been the traditional workhorse.
>
> It's beneficial to let users know that filter-branch has an alternative
> here:

That sounds like a good suggestion for the SEE ALSO section, or an
explanation of when each tool should be used would be a good thing to
put in NOTES, but...

[...]
> --- a/Documentation/git-filter-branch.txt
> +++ b/Documentation/git-filter-branch.txt
> @@ -18,6 +18,12 @@ SYNOPSIS
>  
>  DESCRIPTION
>  -----------
> +
> +NOTE: For simply removing unwanted data from repository history, you may
> +want to use link:http://rtyley.github.io/bfg-repo-cleaner/[The BFG Repo-Cleaner]
> +instead - it's generally faster and simpler for eliminating large files
> +or private data.

... this shouting NOTE at the top of the description is way over the
top.

So as is, this patch looks like a net negative.

Thanks and hope that helps,
Jonathan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] docs: add filter-branch note about The BFG
  2013-12-17 18:13 ` Junio C Hamano
@ 2013-12-18  1:04   ` Roberto Tyley
  2013-12-18  5:57     ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: Roberto Tyley @ 2013-12-18  1:04 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, jrnieder, Jeff King, tr

On 17 December 2013 18:13, Junio C Hamano <gitster@pobox.com> wrote:
>
> Having said that, "You may want to use ..." without giving the
> reason why we recommend the other tool leaves the reader wondering
> what the pros and cons are, and why git-filter-branch exists if BFG
> is the first thing its document recommends even before it describes
> what git-filter-branch is and does.  "You may want to check ..."
> might be slightly better, but probably by not that much improvement.
>
> Rewriting "it's generally faster ..."  part to give a bit more info
> to allow readers decide the pros and cons themselves may be needed.

Thanks for that feedback, it makes sense. Here's an alternative
version which gives more information on the pros and cons of each
tool, and why you might want to use either - as Jonathan suggested,
this would be for the NOTES section at the bottom of the file, where
it's less intrusive:

Notes
-----

git-filter-branch allows you to make complex shell-scripted rewrites
of your Git history, but you may not need this flexibility if you're
simply _removing unwanted data_ like large files or passwords. For
those operations you may want to consider
link:http://rtyley.github.io/bfg-repo-cleaner/[The BFG Repo-Cleaner],
a JVM-based alternative to git-filter-branch, typically at least
10-50x faster for those use-cases, and with quite different
properties:

* The BFG takes advantage of multi-core machines, cleaning commit
file-trees in parallel, which git-filter-branch currently does not do.
* Any particular version of a file is cleaned exactly _once_. The BFG,
unlike git-filter-branch, does not give you the opportunity to handle
a file differently based on where or when it was committed within your
history.
* The link:http://rtyley.github.io/bfg-repo-cleaner/#examples[command-set]
is much more restrictive than git-filter branch, and dedicated just to
the tasks of removing unwanted data - e.g. `--strip-blobs-bigger-than
1M`.


I can re-submit this as a patch if it's acceptable?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] docs: add filter-branch note about The BFG
  2013-12-18  1:04   ` Roberto Tyley
@ 2013-12-18  5:57     ` Junio C Hamano
  2013-12-18 14:25       ` [PATCH v2] Tweaked notes on gfb<->bfg differences Roberto Tyley
  0 siblings, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2013-12-18  5:57 UTC (permalink / raw)
  To: Roberto Tyley; +Cc: git, jrnieder, Jeff King, tr

Roberto Tyley <roberto.tyley@gmail.com> writes:

> * The BFG takes advantage of multi-core machines, cleaning commit
> file-trees in parallel, which git-filter-branch currently does not do.
> * Any particular version of a file is cleaned exactly _once_. The BFG,
> unlike git-filter-branch, does not give you the opportunity to handle
> a file differently based on where or when it was committed within your
> history.
> * The link:http://rtyley.github.io/bfg-repo-cleaner/#examples[command-set]
> is much more restrictive than git-filter branch, and dedicated just to
> the tasks of removing unwanted data - e.g. `--strip-blobs-bigger-than
> 1M`.

I do not know offhand if the above formats well with AsciiDoc.  You
may have to do it like this:

* The first line of the bulletted paragraph is
  followed by the second and subsequent lines indented
  to align with the first one.

The first bullet point may be somewhat misleading, though.  Nothing
stops your script you use in filter-branch from processing blobs
belonging to a single tree in parallel---the user just needs to do a
bit more work to do so.

I think the second point is the most characteristic in BFG (and that
is what allows easy parallelization of the filtering).  Also, it
cannot be stressed enough that the "removing unwanted contents" use
case can take advantage of the "bad contents in a blob is bad, no
matter where in the tree and when in the history the blob appears".
That is what makes BFG particularly shine  for the use case. Its
design very much aligns the objective the use case wants to achieve.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v2] Tweaked notes on gfb<->bfg differences
  2013-12-18  5:57     ` Junio C Hamano
@ 2013-12-18 14:25       ` Roberto Tyley
  2013-12-18 14:25         ` [PATCH v2] docs: add filter-branch notes on The BFG Roberto Tyley
  0 siblings, 1 reply; 7+ messages in thread
From: Roberto Tyley @ 2013-12-18 14:25 UTC (permalink / raw)
  To: git; +Cc: gitster, jrnieder, Roberto Tyley

On 18 December 2013 05:57, Junio C Hamano <gitster@pobox.com> wrote:
> The first bullet point may be somewhat misleading, though.  Nothing
> stops your script you use in filter-branch from processing blobs
> belonging to a single tree in parallel---the user just needs to do a
> bit more work to do so.

Thanks, I've moved this entry down and clarified the capabilities -
I think there's quite a big difference to the user (of a multi-core
machine) as to whether parallelism happens by default, or whether they
have to work out how to introduce it into the operation they're trying
to perform.

> I think the second point is the most characteristic in BFG (and that
> is what allows easy parallelization of the filtering).  Also, it
> cannot be stressed enough that the "removing unwanted contents" use
> case can take advantage of the "bad contents in a blob is bad, no
> matter where in the tree and when in the history the blob appears".
> That is what makes BFG particularly shine  for the use case. Its
> design very much aligns the objective the use case wants to achieve.

I've moved this up to the first bullet-point and added text emphasising
the significance the constraint plays in making the BFG work quickly
while satisfying the common use-case.


Roberto Tyley (1):
  docs: add filter-branch notes on The BFG

 Documentation/git-filter-branch.txt | 33 ++++++++++++++++++++++++++++++++-
 1 file changed, 32 insertions(+), 1 deletion(-)

-- 
1.8.3.4 (Apple Git-47)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v2] docs: add filter-branch notes on The BFG
  2013-12-18 14:25       ` [PATCH v2] Tweaked notes on gfb<->bfg differences Roberto Tyley
@ 2013-12-18 14:25         ` Roberto Tyley
  0 siblings, 0 replies; 7+ messages in thread
From: Roberto Tyley @ 2013-12-18 14:25 UTC (permalink / raw)
  To: git; +Cc: gitster, jrnieder, Roberto Tyley

The BFG is a tool specifically designed for the task of removing
unwanted data from Git repository history - a common use-case for which
git-filter-branch has been the traditional workhorse.

It's beneficial to let users know that filter-branch has an alternative
here:

* speed : The BFG is 10-50x faster
  http://rtyley.github.io/bfg-repo-cleaner/#speed
* complexity of configuration : filter-branch is a very flexible tool,
  but demands very careful usage in order to get the desired results
  http://rtyley.github.io/bfg-repo-cleaner/#examples

Obviously, filter-branch has it's advantages too - it permits very
complex rewrites, and doesn't require a JVM - but for the common
use-case of deleting unwanted data, it's helpful to users to be aware
that an alternative exists.

The BFG was released under the GPL in February 2013, and has since seen
widespread production use (The Guardian, RedHat, Google, UK Government
Digital Service), been tested against large repos (~300K commits, ~5GB
packfiles) and received significant positive feedback from users:

http://rtyley.github.io/bfg-repo-cleaner/#feedback

Signed-off-by: Roberto Tyley <roberto.tyley@gmail.com>
---
 Documentation/git-filter-branch.txt | 33 ++++++++++++++++++++++++++++++++-
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-filter-branch.txt b/Documentation/git-filter-branch.txt
index e4c8e82..2eba627 100644
--- a/Documentation/git-filter-branch.txt
+++ b/Documentation/git-filter-branch.txt
@@ -393,7 +393,7 @@ git filter-branch --index-filter \
 Checklist for Shrinking a Repository
 ------------------------------------
 
-git-filter-branch is often used to get rid of a subset of files,
+git-filter-branch can be used to get rid of a subset of files,
 usually with some combination of `--index-filter` and
 `--subdirectory-filter`.  People expect the resulting repository to
 be smaller than the original, but you need a few more steps to
@@ -429,6 +429,37 @@ warned.
   (or if your git-gc is not new enough to support arguments to
   `--prune`, use `git repack -ad; git prune` instead).
 
+Notes
+-----
+
+git-filter-branch allows you to make complex shell-scripted rewrites
+of your Git history, but you probably don't need this flexibility if
+you're simply _removing unwanted data_ like large files or passwords.
+For those operations you may want to consider
+link:http://rtyley.github.io/bfg-repo-cleaner/[The BFG Repo-Cleaner],
+a JVM-based alternative to git-filter-branch, typically at least
+10-50x faster for those use-cases, and with quite different
+characteristics:
+
+* Any particular version of a file is cleaned exactly _once_. The BFG,
+  unlike git-filter-branch, does not give you the opportunity to
+  handle a file differently based on where or when it was committed
+  within your history. This constraint gives the core performance
+  benefit of The BFG, and is well-suited to the task of cleansing bad
+  data - you don't care _where_ the bad data is, you just want it
+  _gone_.
+
+* By default The BFG takes full advantage of multi-core machines,
+  cleansing commit file-trees in parallel. git-filter-branch cleans
+  commits sequentially (ie in a single-threaded manner), though it
+  _is_ possible to write filters that include their own parallellism,
+  in the scripts executed against each commit.
+
+* The link:http://rtyley.github.io/bfg-repo-cleaner/#examples[command options]
+  are much more restrictive than git-filter branch, and dedicated just
+  to the tasks of removing unwanted data- e.g:
+  `--strip-blobs-bigger-than 1M`.
+
 GIT
 ---
 Part of the linkgit:git[1] suite
-- 
1.8.3.4 (Apple Git-47)

^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2013-12-18 14:25 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-17 10:53 [PATCH] docs: add filter-branch note about The BFG Roberto Tyley
2013-12-17 18:13 ` Junio C Hamano
2013-12-18  1:04   ` Roberto Tyley
2013-12-18  5:57     ` Junio C Hamano
2013-12-18 14:25       ` [PATCH v2] Tweaked notes on gfb<->bfg differences Roberto Tyley
2013-12-18 14:25         ` [PATCH v2] docs: add filter-branch notes on The BFG Roberto Tyley
2013-12-17 18:40 ` [PATCH] docs: add filter-branch note about " Jonathan Nieder

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).