serious performance issues with images, audio files, and other "non-code" data

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* serious performance issues with images, audio files, and other "non-code" data
@ 2010-05-12 18:53 John
  2010-05-12 19:15 ` Jakub Narebski
  2010-05-14  5:10 ` Jeff King
  0 siblings, 2 replies; 28+ messages in thread
From: John @ 2010-05-12 18:53 UTC (permalink / raw)
  To: git

Hi,

We're seeing serious performance issues with repos that store media files, even relatively small 
files. For example, a web site with less than 100 MB of images can take minutes to commit, push, or 
pull when images have changed.

Our first guess was that git is repeatedly attempting to compress/decompress data that had already 
been compressed. We tried these configuration settings (shooting in the dark) to no avail:

    core.compression 0   ## Docs say this disables compression. Didn't seem to work.
    pack.depth 1     ## Unclear what this does.
    pack.window 0    ## No idea what this does.
    gc.auto 0        ## We hope this disables automatic packing.

Our guess that re-compression is to blame may not even be valid since we can manually re-compress 
these files in seconds, not minutes.

Is there a trick to getting git to simply "copy files as is"?  In other words, don't attempt to 
compress them, don't attempt to "diff" them, just store/copy/transfer the files as-is?

Thanks,
  -John

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: serious performance issues with images, audio files, and other "non-code" data
  2010-05-12 18:53 serious performance issues with images, audio files, and other "non-code" data John
@ 2010-05-12 19:15 ` Jakub Narebski
  2010-05-14  5:10 ` Jeff King
  1 sibling, 0 replies; 28+ messages in thread
From: Jakub Narebski @ 2010-05-12 19:15 UTC (permalink / raw)
  To: John; +Cc: git

John <john@puckerupgames.com> writes:

> We're seeing serious performance issues with repos that store media
> files, even relatively small files. For example, a web site with less
> than 100 MB of images can take minutes to commit, push, or pull when
> images have changed.
> 
> Our first guess was that git is repeatedly attempting to
> compress/decompress data that had already been compressed. We tried
> these configuration settings (shooting in the dark) to no avail:
> 
>     core.compression 0   ## Docs say this disables compression. Didn't seem to work.
>     pack.depth 1     ## Unclear what this does.
>     pack.window 0    ## No idea what this does.
>     gc.auto 0        ## We hope this disables automatic packing.
> 
> Our guess that re-compression is to blame may not even be valid since
> we can manually re-compress these files in seconds, not minutes.
> 
> Is there a trick to getting git to simply "copy files as is"?  In
> other words, don't attempt to compress them, don't attempt to "diff"
> them, just store/copy/transfer the files as-is?

Search for `delta` attribute, which should be unset for files that you
don't want for git to attempt (binary) delta against, in gitattributes
manpage.

P.S. There is also git-bigfiles project that migth be of interest to
you.
-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: serious performance issues with images, audio files, and other "non-code" data
  2010-05-12 18:53 serious performance issues with images, audio files, and other "non-code" data John
  2010-05-12 19:15 ` Jakub Narebski
@ 2010-05-14  5:10 ` Jeff King
  2010-05-14 12:54   ` John
  1 sibling, 1 reply; 28+ messages in thread
From: Jeff King @ 2010-05-14  5:10 UTC (permalink / raw)
  To: John; +Cc: git

On Wed, May 12, 2010 at 02:53:53PM -0400, John wrote:

> We're seeing serious performance issues with repos that store media
> files, even relatively small files. For example, a web site with less
> than 100 MB of images can take minutes to commit, push, or pull when
> images have changed.

That sounds way too slow from my experiences. I have a repository with 3
gigabytes of photos and videos. Committing 20M of new images takes a
second or two. The biggest slowdown is doing the sha1 over the new data
(which actually happens during "git add").

What version of git are you using? Have you tried "commit -q" to
suppress the diff at the end of commit?

Can you show us exactly what commands you're using, along with timings
so we can see where the slowness is?

For pushing and pulling, you're probably seeing delta compression, which
can be slow for large files (though again, minutes seems kind of slow to
me). It _can_ be worth doing for images, if you do things like change
only exif tags but not the image data itself. But if the images
themselves are changing, you probably want to try setting the "-delta"
attribute. Like:

  echo '*.jpg -delta' >.gitattributes

Also, consider repacking your repository, which will generate a packfile
that will be re-used during push and pull.

> Our first guess was that git is repeatedly attempting to
> compress/decompress data that had already been compressed. We tried

Git does spend a fair bit of time in zlib for some workloads, but it
should not create problems on the order of minutes.

>    core.compression 0   ## Docs say this disables compression. Didn't seem to work.

That should disable zlib compression of loose objects and objects within
packfiles. It can save a little time for objects which won't compress,
but you will lose the size benefits for any text files.

But it won't turn off delta compression, which is what the
"compressing..." phase during push and pull is doing. And which is much
more likely the cause of slowness.

>    pack.depth 1     ## Unclear what this does.

It says you can't make a chain of deltas deeper than 1. It's probably
not what you want.

>    pack.window 0    ## No idea what this does.

It sets the number of other objects git will consider when doing delta
compression. Setting it low should improve your push/pull times. But you
will lose the substantial benefit of delta-compression of your non-image
files (and git's meta objects). So the "-delta" option above for
specific files is a much better solution.

>    gc.auto 0        ## We hope this disables automatic packing.

It disables automatic repacking when you have a lot of objects. You
_have_ to pack when pushing and pulling, since packfiles are the
on-the-wire format. What will help is:

  1. Having repositories already packed, since git can re-use the packed
     data.

  2. Using -delta so that things which delta poorly are just copied into
     the packfile as-is.

> Is there a trick to getting git to simply "copy files as is"?  In
> other words, don't attempt to compress them, don't attempt to "diff"
> them, just store/copy/transfer the files as-is?

Hopefully you can pick out the answer to that question from the above
statements. :)

-Peff

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: serious performance issues with images, audio files, and other "non-code" data
  2010-05-14  5:10 ` Jeff King
@ 2010-05-14 12:54   ` John
  2010-05-14 17:26     ` Dirk Süsserott
  2010-05-17 23:16     ` Jeff King
  0 siblings, 2 replies; 28+ messages in thread
From: John @ 2010-05-14 12:54 UTC (permalink / raw)
  To: git; +Cc: Jeff King

Thanks so much. It's version 1.5.6.5. I compiled it 3 months ago. For example, 
in one repo, there are 1200 source files, each on average 109K in size, for a 
total size of 127M. The largest source file is 82M. Most of the non-text source 
files are already compressed.

I packed the bare repo, then ran `gc --aggressive`. Then I did a `git pull`, 
which took 35 minutes. The git processes in `top` seemed to peak at around 300M 
of memory. Since then, I added 'binary -delta' to the .gitattributes for various 
files, based on suggestions from this mailing list, but by that time did not 
wish to repeat the 35 minute pull to test it out. Let's hope that made a difference.

You can simulate it all by generating a batch of 1-100 MB files from 
/dev/urandom (since they won't compress), commit them, then do it again many 
times to simulate edits. Every few iterates, push it somewhere.


I noticed some other folks on this list apparently having the same issues, but 
they don't know it yet ("git hangs while compressing objects", etc.). That's 
probably the first symptom they'll see. It *appears* to hang, but it's really 
spinning away on the `pack` gizmo.

I'm open to alternative suggestions -- some kind of dual-mode, where text files 
are "fully" version'd, diff'd, delta'd, index'd, stash'd, pack'd, compress'd, 
object'd and whatever else git needs to do, while non-text files are archived in 
a "lesser" manner.  On the other hand, I get the sense that the LAST thing git 
needs is another "mode"!




On 05/14/2010 01:10 AM, Jeff King wrote:
> On Wed, May 12, 2010 at 02:53:53PM -0400, John wrote:
>
>> We're seeing serious performance issues with repos that store media
>> files, even relatively small files. For example, a web site with less
>> than 100 MB of images can take minutes to commit, push, or pull when
>> images have changed.
>
> That sounds way too slow from my experiences. I have a repository with 3
> gigabytes of photos and videos. Committing 20M of new images takes a
> second or two. The biggest slowdown is doing the sha1 over the new data
> (which actually happens during "git add").
>
> What version of git are you using? Have you tried "commit -q" to
> suppress the diff at the end of commit?
>
> Can you show us exactly what commands you're using, along with timings
> so we can see where the slowness is?
>
> For pushing and pulling, you're probably seeing delta compression, which
> can be slow for large files (though again, minutes seems kind of slow to
> me). It _can_ be worth doing for images, if you do things like change
> only exif tags but not the image data itself. But if the images
> themselves are changing, you probably want to try setting the "-delta"
> attribute. Like:
>
>    echo '*.jpg -delta'>.gitattributes
>
> Also, consider repacking your repository, which will generate a packfile
> that will be re-used during push and pull.
>
>> Our first guess was that git is repeatedly attempting to
>> compress/decompress data that had already been compressed. We tried
>
> Git does spend a fair bit of time in zlib for some workloads, but it
> should not create problems on the order of minutes.
>
>>     core.compression 0   ## Docs say this disables compression. Didn't seem to work.
>
> That should disable zlib compression of loose objects and objects within
> packfiles. It can save a little time for objects which won't compress,
> but you will lose the size benefits for any text files.
>
> But it won't turn off delta compression, which is what the
> "compressing..." phase during push and pull is doing. And which is much
> more likely the cause of slowness.
>
>>     pack.depth 1     ## Unclear what this does.
>
> It says you can't make a chain of deltas deeper than 1. It's probably
> not what you want.
>
>>     pack.window 0    ## No idea what this does.
>
> It sets the number of other objects git will consider when doing delta
> compression. Setting it low should improve your push/pull times. But you
> will lose the substantial benefit of delta-compression of your non-image
> files (and git's meta objects). So the "-delta" option above for
> specific files is a much better solution.
>
>>     gc.auto 0        ## We hope this disables automatic packing.
>
> It disables automatic repacking when you have a lot of objects. You
> _have_ to pack when pushing and pulling, since packfiles are the
> on-the-wire format. What will help is:
>
>    1. Having repositories already packed, since git can re-use the packed
>       data.
>
>    2. Using -delta so that things which delta poorly are just copied into
>       the packfile as-is.
>
>> Is there a trick to getting git to simply "copy files as is"?  In
>> other words, don't attempt to compress them, don't attempt to "diff"
>> them, just store/copy/transfer the files as-is?
>
> Hopefully you can pick out the answer to that question from the above
> statements. :)
>
> -Peff

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: serious performance issues with images, audio files, and other "non-code" data
  2010-05-14 12:54   ` John
@ 2010-05-14 17:26     ` Dirk Süsserott
  2010-05-17 23:16     ` Jeff King
  1 sibling, 0 replies; 28+ messages in thread
From: Dirk Süsserott @ 2010-05-14 17:26 UTC (permalink / raw)
  To: John; +Cc: git, Jeff King

Am 14.05.2010 14:54 schrieb John:
> Thanks so much. It's version 1.5.6.5. I compiled it 3 months ago. For 
> example, in one repo, there are 1200 source files, each on average 109K 
> in size, for a total size of 127M. The largest source file is 82M. Most 
> of the non-text source files are already compressed.
> 
> I packed the bare repo, then ran `gc --aggressive`. Then I did a `git 
> pull`, which took 35 minutes. The git processes in `top` seemed to peak 
> at around 300M of memory. Since then, I added 'binary -delta' to the 
> .gitattributes for various files, based on suggestions from this mailing 
> list, but by that time did not wish to repeat the 35 minute pull to test 
> it out. Let's hope that made a difference.
> 
> You can simulate it all by generating a batch of 1-100 MB files from 
> /dev/urandom (since they won't compress), commit them, then do it again 
> many times to simulate edits. Every few iterates, push it somewhere.
> 
> 
> I noticed some other folks on this list apparently having the same 
> issues, but they don't know it yet ("git hangs while compressing 
> objects", etc.). That's probably the first symptom they'll see. It 
> *appears* to hang, but it's really spinning away on the `pack` gizmo.
> 
> I'm open to alternative suggestions -- some kind of dual-mode, where 
> text files are "fully" version'd, diff'd, delta'd, index'd, stash'd, 
> pack'd, compress'd, object'd and whatever else git needs to do, while 
> non-text files are archived in a "lesser" manner.  On the other hand, I 
> get the sense that the LAST thing git needs is another "mode"!
> 
> 
> 
> 
> On 05/14/2010 01:10 AM, Jeff King wrote:
>> On Wed, May 12, 2010 at 02:53:53PM -0400, John wrote:
>>
>>> We're seeing serious performance issues with repos that store media
>>> files, even relatively small files. For example, a web site with less
>>> than 100 MB of images can take minutes to commit, push, or pull when
>>> images have changed.
>>
>> That sounds way too slow from my experiences. I have a repository with 3
>> gigabytes of photos and videos. Committing 20M of new images takes a
>> second or two. The biggest slowdown is doing the sha1 over the new data
>> (which actually happens during "git add").
>>
>> What version of git are you using? Have you tried "commit -q" to
>> suppress the diff at the end of commit?
>>
>> Can you show us exactly what commands you're using, along with timings
>> so we can see where the slowness is?
>>
>> For pushing and pulling, you're probably seeing delta compression, which
>> can be slow for large files (though again, minutes seems kind of slow to
>> me). It _can_ be worth doing for images, if you do things like change
>> only exif tags but not the image data itself. But if the images
>> themselves are changing, you probably want to try setting the "-delta"
>> attribute. Like:
>>
>>    echo '*.jpg -delta'>.gitattributes
>>
>> Also, consider repacking your repository, which will generate a packfile
>> that will be re-used during push and pull.
>>
>>> Our first guess was that git is repeatedly attempting to
>>> compress/decompress data that had already been compressed. We tried
>>
>> Git does spend a fair bit of time in zlib for some workloads, but it
>> should not create problems on the order of minutes.
>>
>>>     core.compression 0   ## Docs say this disables compression. 
>>> Didn't seem to work.
>>
>> That should disable zlib compression of loose objects and objects within
>> packfiles. It can save a little time for objects which won't compress,
>> but you will lose the size benefits for any text files.
>>
>> But it won't turn off delta compression, which is what the
>> "compressing..." phase during push and pull is doing. And which is much
>> more likely the cause of slowness.
>>
>>>     pack.depth 1     ## Unclear what this does.
>>
>> It says you can't make a chain of deltas deeper than 1. It's probably
>> not what you want.
>>
>>>     pack.window 0    ## No idea what this does.
>>
>> It sets the number of other objects git will consider when doing delta
>> compression. Setting it low should improve your push/pull times. But you
>> will lose the substantial benefit of delta-compression of your non-image
>> files (and git's meta objects). So the "-delta" option above for
>> specific files is a much better solution.
>>
>>>     gc.auto 0        ## We hope this disables automatic packing.
>>
>> It disables automatic repacking when you have a lot of objects. You
>> _have_ to pack when pushing and pulling, since packfiles are the
>> on-the-wire format. What will help is:
>>
>>    1. Having repositories already packed, since git can re-use the packed
>>       data.
>>
>>    2. Using -delta so that things which delta poorly are just copied into
>>       the packfile as-is.
>>
>>> Is there a trick to getting git to simply "copy files as is"?  In
>>> other words, don't attempt to compress them, don't attempt to "diff"
>>> them, just store/copy/transfer the files as-is?
>>
>> Hopefully you can pick out the answer to that question from the above
>> statements. :)
>>
>> -Peff
> 

Hi John,

Peff explained it very well. Some time ago, I had a similar problem:
http://www.mentby.com/Group/git/how-to-prevent-git-from-compressing-certain-files.html
and he helped me as well. Probably you may want to have a look at that 
thread.

	Dirk


> -- 
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: serious performance issues with images, audio files, and other "non-code" data
  2010-05-14 12:54   ` John
  2010-05-14 17:26     ` Dirk Süsserott
@ 2010-05-17 23:16     ` Jeff King
  2010-05-17 23:33       ` Sverre Rabbelier
  2010-05-18 18:50       ` John
  1 sibling, 2 replies; 28+ messages in thread
From: Jeff King @ 2010-05-17 23:16 UTC (permalink / raw)
  To: John; +Cc: git

On Fri, May 14, 2010 at 08:54:02AM -0400, John wrote:

> Thanks so much. It's version 1.5.6.5. I compiled it 3 months ago. For

By git standards, that version is ancient. You may want to try with a
more recent version of git (at the very least, multithreaded delta
compression has been enabled by default since then).

> I packed the bare repo, then ran `gc --aggressive`.

Note that "gc --aggressive" will repack from scratch, throwing away the
previous pack.

> Then I did a `git pull`, which took 35 minutes.

That sounds like a long time. What was taking so long? Was delta
compression pegging the CPU? Was it limited during the "Writing objects"
phase, which is going to be limited by either disk I/O or network speed?

How big is your packed repo? Given the pattern you describe below, I am
beginning to wonder if it is simply the case that even though a single
checkout of your repo isn't that large, the complete history of your
project may simply be gigantic (e.g., because you are repeatedly writing
new apparently-random versions of each file, so your repository size
will grow quite quickly).

Remember that a git clone transfers the full history (and a pull will
transfer all of the intermediate history). If you have rewritten those
files many times, you may be transferring many times your working
directory size in history.

> You can simulate it all by generating a batch of 1-100 MB files from
> /dev/urandom (since they won't compress), commit them, then do it
> again many times to simulate edits. Every few iterates, push it
> somewhere.

I tried this script to make a 100M working directory with a 400M .git
directory:

-- >8 --
#!/bin/sh

rm -rf big-repo
mkdir big-repo && cd big-repo && git init

mark() {
  echo "`date` $*"
}

randomize() {
  mark randomize start
  for i in `seq 1 100`; do
    openssl rand $((1024*1024)) >$i.rand
  done
  mark randomize end
}

commit() {
  mark add start
  git add .
  mark add end
  mark commit start
  git commit -m "$1"
  mark commit end
}

randomize; commit base
randomize; commit one
randomize; commit two
randomize; commit three
-- 8< --

Here are a few timings I noted:

  - it takes about 5 seconds to generate and write the random data

  - git add runs in about 13 seconds. It pegs the CPU hashing all of the
    data.

  - the first commit is nearly instantaneous, as the summary diff takes
    no work; subsequent commits spend about 9 seconds to create the
    summary diff.  Changing commit to "commit -q" drops that to back to
    near-instantaneous.

  - with no attributes set, "time git gc --aggressive" reports:

      real    1m31.983s
      user    2m29.621s
      sys     0m3.732s

    Note the real/user discrepancy. It's a dual-core machine, and recent
    git will multi-thread the delta phase, which is what dominates the
    time. This should correspond roughly to the delta-compression phase
    of your pull time, as that was just making a pack on the fly (but
    now that we are packed, pulls will be limited only by the time to
    transfer the objects themselves).

  - Turning off delta compression for the .rand files makes repacking
    much faster:

      $ echo '*.rand -delta' >.gitattributes
      $ time git gc --aggressive
      ...
      real    0m25.354s
      user    0m22.057s
      sys     0m1.316s

    The delta compression phase is very quick, and we spend most of our
    time writing out the packfile to disk.

So I stand by my earlier statements:

  1. Use "git commit -q" to avoid wasting time on the commit diff
     summary (we should perhaps have a commit.quiet config option for
     repos like this where you would almost always want to suppress it).

  2. Make sure your upstream repo is packed so pullers do not have to
     generate a new packfile all the time.

  3. Use -delta where appropriate to avoid useless delta compression.

If things are still slow after that, you'll need to be more specific
about your exact workload and exactly what is slow (I am still not sure
if delta compression or network bandwidth is the limiting factor for
your slow pulls).

-Peff

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: serious performance issues with images, audio files, and other  "non-code" data
  2010-05-17 23:16     ` Jeff King
@ 2010-05-17 23:33       ` Sverre Rabbelier
  2010-05-18 19:07         ` Jeff King
  2010-05-18 18:50       ` John
  1 sibling, 1 reply; 28+ messages in thread
From: Sverre Rabbelier @ 2010-05-17 23:33 UTC (permalink / raw)
  To: Jeff King; +Cc: John, git

Heya,

On Tue, May 18, 2010 at 01:16, Jeff King <peff@peff.net> wrote:
>  1. Use "git commit -q" to avoid wasting time on the commit diff
>     summary (we should perhaps have a commit.quiet config option for
>     repos like this where you would almost always want to suppress it).

Do we respect the .gitattributef and not try to generate the diffstat
for files that are uncompressable?

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: serious performance issues with images, audio files, and other "non-code" data
  2010-05-17 23:16     ` Jeff King
  2010-05-17 23:33       ` Sverre Rabbelier
@ 2010-05-18 18:50       ` John
  2010-05-18 18:54         ` Sverre Rabbelier
  2010-05-18 19:19         ` Jeff King
  1 sibling, 2 replies; 28+ messages in thread
From: John @ 2010-05-18 18:50 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On 05/17/2010 07:16 PM, Jeff King wrote:
> On Fri, May 14, 2010 at 08:54:02AM -0400, John wrote:
>
>> Thanks so much. It's version 1.5.6.5. I compiled it 3 months ago. For
>
> By git standards, that version is ancient. You may want to try with a
> more recent version of git (at the very least, multithreaded delta
> compression has been enabled by default since then).

I just compiled the latest git. It got worse!!

$  git --version
git version 1.5.6.5
$ time git gc --aggressive
Counting objects: 2086, done.
Compressing objects: 100% (2054/2054), done.
Writing objects: 100% (2086/2086), done.
Total 2086 (delta 676), reused 0 (delta 0)

real    4m28.573s
user    3m38.650s
sys     0m5.156s
$  git --version
git version 1.7.1
$ time git gc --aggressive
Counting objects: 2086, done.
Compressing objects: 100% (2054/2054), done.
Writing objects: 100% (2086/2086), done.
Total 2086 (delta 676), reused 0 (delta 0)

real    6m16.406s
user    5m28.665s
sys     0m6.196s
$ du -hs .git
203M	.git


>
>> I packed the bare repo, then ran `gc --aggressive`.
>
> Note that "gc --aggressive" will repack from scratch, throwing away the
> previous pack.
>
>> Then I did a `git pull`, which took 35 minutes.
>
> That sounds like a long time. What was taking so long? Was delta
> compression pegging the CPU? Was it limited during the "Writing objects"
> phase, which is going to be limited by either disk I/O or network speed?

The compressing objects phase. Yes, pegging the CPU and hogging memory.


> How big is your packed repo? Given the pattern you describe below, I am
> beginning to wonder if it is simply the case that even though a single
> checkout of your repo isn't that large, the complete history of your
> project may simply be gigantic (e.g., because you are repeatedly writing
> new apparently-random versions of each file, so your repository size
> will grow quite quickly).

The packed .git dir is 203 MB. Yes, we make frequent changes to these files, and push/pull 
frequently as well. Just a normal development pattern, though. It's all manually done -- i.e., 
there's no automated bot doing excessive git operations.

>
> Remember that a git clone transfers the full history (and a pull will
> transfer all of the intermediate history). If you have rewritten those
> files many times, you may be transferring many times your working
> directory size in history.
>
>> You can simulate it all by generating a batch of 1-100 MB files from
>> /dev/urandom (since they won't compress), commit them, then do it
>> again many times to simulate edits. Every few iterates, push it
>> somewhere.
>
> I tried this script to make a 100M working directory with a 400M .git
> directory:
>
> -- >8 --
> #!/bin/sh
>
> rm -rf big-repo
> mkdir big-repo&&  cd big-repo&&  git init
>
> mark() {
>    echo "`date` $*"
> }
>
> randomize() {
>    mark randomize start
>    for i in `seq 1 100`; do
>      openssl rand $((1024*1024))>$i.rand
>    done
>    mark randomize end
> }
>
> commit() {
>    mark add start
>    git add .
>    mark add end
>    mark commit start
>    git commit -m "$1"
>    mark commit end
> }
>
> randomize; commit base
> randomize; commit one
> randomize; commit two
> randomize; commit three
> -- 8<  --
>
> Here are a few timings I noted:
>
>    - it takes about 5 seconds to generate and write the random data
>
>    - git add runs in about 13 seconds. It pegs the CPU hashing all of the
>      data.
>
>    - the first commit is nearly instantaneous, as the summary diff takes
>      no work; subsequent commits spend about 9 seconds to create the
>      summary diff.  Changing commit to "commit -q" drops that to back to
>      near-instantaneous.
>
>    - with no attributes set, "time git gc --aggressive" reports:
>
>        real    1m31.983s
>        user    2m29.621s
>        sys     0m3.732s
>
>      Note the real/user discrepancy. It's a dual-core machine, and recent
>      git will multi-thread the delta phase, which is what dominates the
>      time. This should correspond roughly to the delta-compression phase
>      of your pull time, as that was just making a pack on the fly (but
>      now that we are packed, pulls will be limited only by the time to
>      transfer the objects themselves).
>
>    - Turning off delta compression for the .rand files makes repacking
>      much faster:
>
>        $ echo '*.rand -delta'>.gitattributes
>        $ time git gc --aggressive
>        ...
>        real    0m25.354s
>        user    0m22.057s
>        sys     0m1.316s
>
>      The delta compression phase is very quick, and we spend most of our
>      time writing out the packfile to disk.
>
> So I stand by my earlier statements:
>
>    1. Use "git commit -q" to avoid wasting time on the commit diff
>       summary (we should perhaps have a commit.quiet config option for
>       repos like this where you would almost always want to suppress it).

Thanks, I will try that,

>
>    2. Make sure your upstream repo is packed so pullers do not have to
>       generate a new packfile all the time.

Got that in cron now.


>    3. Use -delta where appropriate to avoid useless delta compression.

Already in there (thanks to your previous advice).


> If things are still slow after that, you'll need to be more specific
> about your exact workload and exactly what is slow (I am still not sure
> if delta compression or network bandwidth is the limiting factor for
> your slow pulls).

It's definitely the pull/push in git. Not knowing my way around git internals at all, I don't know 
(nor do I really want to know, to be honest) which "sub-processes" of `git pull` or `git push` are 
the culprit. Yes, network bandwidth is always a factor, but I guess my expectation is that git 
shouldn't transfer too much more info than the amount of recent changes. For example, if we change 
10 files for a total of 10MB, then my admittedly naive expectation is that git will send that 10MB 
of changes, plus some small constant amount of meta info... not the whole repo every time. No?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: serious performance issues with images, audio files, and other  "non-code" data
  2010-05-18 18:50       ` John
@ 2010-05-18 18:54         ` Sverre Rabbelier
  2010-05-18 19:19         ` Jeff King
  1 sibling, 0 replies; 28+ messages in thread
From: Sverre Rabbelier @ 2010-05-18 18:54 UTC (permalink / raw)
  To: John; +Cc: Jeff King, git

Heya,

On Tue, May 18, 2010 at 20:50, John <john@puckerupgames.com> wrote:
> I just compiled the latest git. It got worse!!

I think that's got --aggressive got more aggressive :). We now do
--window=200 and --depth=200 for --aggressive gc's.

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: serious performance issues with images, audio files, and other "non-code" data
  2010-05-17 23:33       ` Sverre Rabbelier
@ 2010-05-18 19:07         ` Jeff King
  2010-05-18 19:10           ` Sverre Rabbelier
  0 siblings, 1 reply; 28+ messages in thread
From: Jeff King @ 2010-05-18 19:07 UTC (permalink / raw)
  To: Sverre Rabbelier; +Cc: John, git

On Tue, May 18, 2010 at 01:33:35AM +0200, Sverre Rabbelier wrote:

> On Tue, May 18, 2010 at 01:16, Jeff King <peff@peff.net> wrote:
> >  1. Use "git commit -q" to avoid wasting time on the commit diff
> >     summary (we should perhaps have a commit.quiet config option for
> >     repos like this where you would almost always want to suppress it).
> 
> Do we respect the .gitattributef and not try to generate the diffstat
> for files that are uncompressable?

No, not to my knowledge. Even the "binary" attribute just says "this
file is binary, don't text diff it". I think we will always still do
rewrite-detection for operations like "git status" and the diff summary
of "git commit".

-Peff

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: serious performance issues with images, audio files, and other  "non-code" data
  2010-05-18 19:07         ` Jeff King
@ 2010-05-18 19:10           ` Sverre Rabbelier
  2010-05-18 19:27             ` Jeff King
  0 siblings, 1 reply; 28+ messages in thread
From: Sverre Rabbelier @ 2010-05-18 19:10 UTC (permalink / raw)
  To: Jeff King; +Cc: John, git

Heya,

On Tue, May 18, 2010 at 21:07, Jeff King <peff@peff.net> wrote:
> No, not to my knowledge. Even the "binary" attribute just says "this
> file is binary, don't text diff it". I think we will always still do
> rewrite-detection for operations like "git status" and the diff summary
> of "git commit".

Would that not be a very sensible optimization that would help John
(and other users of big files) a lot?

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: serious performance issues with images, audio files, and other "non-code" data
  2010-05-18 18:50       ` John
  2010-05-18 18:54         ` Sverre Rabbelier
@ 2010-05-18 19:19         ` Jeff King
  2010-05-18 19:33           ` Nicolas Pitre
  1 sibling, 1 reply; 28+ messages in thread
From: Jeff King @ 2010-05-18 19:19 UTC (permalink / raw)
  To: John; +Cc: git

On Tue, May 18, 2010 at 02:50:16PM -0400, John wrote:

> I just compiled the latest git. It got worse!!

I think Sverre is right that this is simply that --aggressive got more
so in the last few versions. But do note that aggressive implies that we
should pack from scratch, not reusing previously found deltas (or
accepting that we didn't find deltas previously).

So you might want "git gc --aggressive" the _first_ time you pack, or
possibly even very occasionally. But if you are packing every day, you
should just use "git gc", which will run much more quickly (and would
probably have acceptable behavior even without the -delta attribute, as
it would only have to look at _new_ objects).

It will have to write the whole 200M packfile out each time, though.
From your timings that looks to take about 50 seconds or so (just
looking at the difference between wall clock time and CPU time, which is
presumably spent in I/O).

Packing nightly won't hurt, but is perhaps excessive. It sounds like you
actually have a fairly normal workload.

> >How big is your packed repo? Given the pattern you describe below, I am
> [...]
> The packed .git dir is 203 MB. Yes, we make frequent changes to these
> files, and push/pull frequently as well. Just a normal development
> pattern, though. It's all manually done -- i.e., there's no automated
> bot doing excessive git operations.

OK, that is not very big. Once packed, you really should not see
performance issues.

> culprit. Yes, network bandwidth is always a factor, but I guess my
> expectation is that git shouldn't transfer too much more info than
> the amount of recent changes. For example, if we change 10 files for
> a total of 10MB, then my admittedly naive expectation is that git
> will send that 10MB of changes, plus some small constant amount of
> meta info... not the whole repo every time. No?

Your assumption is correct. Git should transmit at _worst_ 10MB in such
a scenario (i.e., often much less because of delta compression, but in
your case of apparently-random media files, probably about 10MB).

I wasn't clear from your message: you indicated the changes you made,
but are you still having performance problems, or are you still waiting
to get data?

-Peff

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: serious performance issues with images, audio files, and other "non-code" data
  2010-05-18 19:10           ` Sverre Rabbelier
@ 2010-05-18 19:27             ` Jeff King
  2010-05-18 19:37               ` Nicolas Pitre
  0 siblings, 1 reply; 28+ messages in thread
From: Jeff King @ 2010-05-18 19:27 UTC (permalink / raw)
  To: Sverre Rabbelier; +Cc: John, git

On Tue, May 18, 2010 at 09:10:58PM +0200, Sverre Rabbelier wrote:

> On Tue, May 18, 2010 at 21:07, Jeff King <peff@peff.net> wrote:
> > No, not to my knowledge. Even the "binary" attribute just says "this
> > file is binary, don't text diff it". I think we will always still do
> > rewrite-detection for operations like "git status" and the diff summary
> > of "git commit".
> 
> Would that not be a very sensible optimization that would help John
> (and other users of big files) a lot?

It might help some, but I worry about overloading the meaning of
"-delta". Right now it has a very clear meaning: don't delta for
packfiles. But that doesn't mean I might not want to see break detection
(or inexact rename detection, for that matter) at some time.

Large binary files shouldn't be taxing on regular diffs.  If you have
marked a file as "binary" and we are not creating a binary diff (i.e.,
just printing "binary files differ"), then we shouldn't even need to
pull the blob from storage (since we can tell from the sha1 that it is
different). I haven't checked to see if we do that simple optimization
(if you haven't marked it with a binary attribute, then obviously we do
have to look at the blob to find out that it is binary).

So:

  1. I think it would need a separate attribute that is about diffing
     (possibly even just options to a custom diff filter).

  2. I am not clear exactly what options would work best. Do you want to
     disable diffing entirely? Disable just inexact rename detection and
     break detection? If break detection is disabled, do you assume it
     is _always_ a rewrite, or never?

So I am open to the idea, but I think we would need a more concrete
proposal and some timings to show how it is a benefit.

-Peff

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: serious performance issues with images, audio files, and other "non-code" data
  2010-05-18 19:19         ` Jeff King
@ 2010-05-18 19:33           ` Nicolas Pitre
  2010-05-18 19:41             ` Jeff King
  0 siblings, 1 reply; 28+ messages in thread
From: Nicolas Pitre @ 2010-05-18 19:33 UTC (permalink / raw)
  To: Jeff King; +Cc: John, git

On Tue, 18 May 2010, Jeff King wrote:

> So you might want "git gc --aggressive" the _first_ time you pack, or
> possibly even very occasionally. But if you are packing every day, you
> should just use "git gc", which will run much more quickly (and would
> probably have acceptable behavior even without the -delta attribute, as
> it would only have to look at _new_ objects).
> 
> It will have to write the whole 200M packfile out each time, though.

No.  gc will only create a pack with new loose objects by default.  
Only if the number of packs grow too large will it combine them into one 
pack.

> Packing nightly won't hurt, but is perhaps excessive. It sounds like you
> actually have a fairly normal workload.

Packing nightly with a simple "git gc" i.e. without extra options should 
be perfectly fine.


Nicolas

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: serious performance issues with images, audio files, and other "non-code" data
  2010-05-18 19:27             ` Jeff King
@ 2010-05-18 19:37               ` Nicolas Pitre
  0 siblings, 0 replies; 28+ messages in thread
From: Nicolas Pitre @ 2010-05-18 19:37 UTC (permalink / raw)
  To: Jeff King; +Cc: Sverre Rabbelier, John, git

On Tue, 18 May 2010, Jeff King wrote:

> On Tue, May 18, 2010 at 09:10:58PM +0200, Sverre Rabbelier wrote:
> 
> > On Tue, May 18, 2010 at 21:07, Jeff King <peff@peff.net> wrote:
> > > No, not to my knowledge. Even the "binary" attribute just says "this
> > > file is binary, don't text diff it". I think we will always still do
> > > rewrite-detection for operations like "git status" and the diff summary
> > > of "git commit".
> > 
> > Would that not be a very sensible optimization that would help John
> > (and other users of big files) a lot?
> 
> It might help some, but I worry about overloading the meaning of
> "-delta". Right now it has a very clear meaning: don't delta for
> packfiles. But that doesn't mean I might not want to see break detection
> (or inexact rename detection, for that matter) at some time.

Indeed. Please keep the delta attribute for what it is named after: 
deltas. And those are meant to be used in the context of object packing 
only.


Nicolas

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: serious performance issues with images, audio files, and other "non-code" data
  2010-05-18 19:33           ` Nicolas Pitre
@ 2010-05-18 19:41             ` Jeff King
  2010-05-18 19:59               ` Nicolas Pitre
  0 siblings, 1 reply; 28+ messages in thread
From: Jeff King @ 2010-05-18 19:41 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: John, git

On Tue, May 18, 2010 at 03:33:58PM -0400, Nicolas Pitre wrote:

> > It will have to write the whole 200M packfile out each time, though.
> 
> No.  gc will only create a pack with new loose objects by default.  
> Only if the number of packs grow too large will it combine them into one 
> pack.

I think that is only "gc --auto". With regular gc:

  $ git init
  $ echo content >file && git add file && git commit -m one
  $ git gc
  Counting objects: 3, done.
  Writing objects: 100% (3/3), done.
  Total 3 (delta 0), reused 0 (delta 0)
  $ du -a .git/objects/pack
  4  .git/objects/pack/pack-5f6fe4b14529d73f51d7c8efa69306edd35f2302.idx
  4  .git/objects/pack/pack-5f6fe4b14529d73f51d7c8efa69306edd35f2302.pack
  12 .git/objects/pack

  $ echo content >>file && git commit -a -m two
  $ git gc
  Counting objects: 6, done.
  Delta compression using up to 2 threads.
  Compressing objects: 100% (2/2), done.
  Writing objects: 100% (6/6), done.
  Total 6 (delta 0), reused 3 (delta 0)
  $ du -a .git/objects/pack
  4  .git/objects/pack/pack-ecf41a1c120eb911f50fdd2c159e94d5832974f7.idx
  4  .git/objects/pack/pack-ecf41a1c120eb911f50fdd2c159e94d5832974f7.pack
  12 .git/objects/pack

So six objects written in the second gc, and obviously a brand new
single pack.

-Peff

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: serious performance issues with images, audio files, and other "non-code" data
  2010-05-18 19:41             ` Jeff King
@ 2010-05-18 19:59               ` Nicolas Pitre
  2010-05-24  0:21                 ` John
  0 siblings, 1 reply; 28+ messages in thread
From: Nicolas Pitre @ 2010-05-18 19:59 UTC (permalink / raw)
  To: Jeff King; +Cc: John, git

On Tue, 18 May 2010, Jeff King wrote:

> On Tue, May 18, 2010 at 03:33:58PM -0400, Nicolas Pitre wrote:
> 
> > > It will have to write the whole 200M packfile out each time, though.
> > 
> > No.  gc will only create a pack with new loose objects by default.  
> > Only if the number of packs grow too large will it combine them into one 
> > pack.
> 
> I think that is only "gc --auto".

Argh. You're right.  And "gc --auto" is already ran by many commands 
already.

It is "git repack" that doesn't combine packs by default.


Nicolas

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: serious performance issues with images, audio files, and other "non-code" data
  2010-05-18 19:59               ` Nicolas Pitre
@ 2010-05-24  0:21                 ` John
  2010-05-24  1:16                   ` Junio C Hamano
  2010-05-24  5:39                   ` Jeff King
  0 siblings, 2 replies; 28+ messages in thread
From: John @ 2010-05-24  0:21 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Jeff King, git

Just to follow up, the two solutions which have had a noticeable effect are, 
first to run daily `gc`s, and, second, to configure a ".gitattributes" file as such:

*.jpg  binary -delta
*.png  binary -delta
*.psd  binary -delta
*.gz  binary -delta
*.bz2  binary -delta
.. and so on.

On my first go-round with ".gitattributes" (earlier in this thread), my patterns 
were setup incorrectly, as in,

*.{gz,bz2,tgz,psd,png,jpg} binary -delta

Since git does not perform brace expansion, the above patterns never matched. 
After revising the .gitattributes file, a ~6 minute gc dropped down to just 
under ~3 minutes.

Is there any reason why someone would NOT want the above ".gitattributes" 
defined by default?

On 05/18/2010 03:59 PM, Nicolas Pitre wrote:
> On Tue, 18 May 2010, Jeff King wrote:
>
>> On Tue, May 18, 2010 at 03:33:58PM -0400, Nicolas Pitre wrote:
>>
>>>> It will have to write the whole 200M packfile out each time, though.
>>>
>>> No.  gc will only create a pack with new loose objects by default.
>>> Only if the number of packs grow too large will it combine them into one
>>> pack.
>>
>> I think that is only "gc --auto".
>
> Argh. You're right.  And "gc --auto" is already ran by many commands
> already.
>
> It is "git repack" that doesn't combine packs by default.
>
>
> Nicolas

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: serious performance issues with images, audio files, and other "non-code" data
  2010-05-24  0:21                 ` John
@ 2010-05-24  1:16                   ` Junio C Hamano
  2010-05-24  7:01                     ` John
  2010-05-25  7:28                     ` Michael J Gruber
  2010-05-24  5:39                   ` Jeff King
  1 sibling, 2 replies; 28+ messages in thread
From: Junio C Hamano @ 2010-05-24  1:16 UTC (permalink / raw)
  To: John; +Cc: Nicolas Pitre, Jeff King, git

John <john@puckerupgames.com> writes:

> Is there any reason why someone would NOT want the above
> ".gitattributes" defined by default?

Other than that our originally intended target audience are people who use
git as a source code control system, not much.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: serious performance issues with images, audio files, and other "non-code" data
  2010-05-24  0:21                 ` John
  2010-05-24  1:16                   ` Junio C Hamano
@ 2010-05-24  5:39                   ` Jeff King
  2010-05-24  6:44                     ` John
  1 sibling, 1 reply; 28+ messages in thread
From: Jeff King @ 2010-05-24  5:39 UTC (permalink / raw)
  To: John; +Cc: Nicolas Pitre, git

On Sun, May 23, 2010 at 08:21:12PM -0400, John wrote:

> *.jpg  binary -delta
> *.png  binary -delta
> *.psd  binary -delta
> *.gz  binary -delta
> *.bz2  binary -delta
> .. and so on.
>
> [...]
>
> Is there any reason why someone would NOT want the above
> ".gitattributes" defined by default?

I delta jpgs in one of my repositories. It is useful if the exif
metadata changes but the image data does not. I assume you could do the
same with other formats which have compressed and uncompressed portions
(I also do it with video containers).  I don't think it would ever make
sense to try to delta gzip'd or bzip'd contents.

I also don't use "binary", as I use a custom diff driver instead (binary
implies "-diff").

As for what should be the default, until now the default has always
been that no gitattributes are defined by default. This is nice because
it's simple to understand; git doesn't care about filenames unless you
tell it to. The downside obviously is that it may not perform optimally
for some unusual workloads without extra configuration.

We could probably do defaults for some common extensions, but I'm not
really sure where such a thing should end up. For example, I consider
*.psd a uselessly obscure extension, as Adobe doesn't write software for
my platform of choice. Not that I mind having it in git, but rather that
we are inevitably going to miss somebody's pet extension, and then we
are right back where we started with them needing to configure, except
now they also have to figure out which extensions have default
attributes.

-Peff

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: serious performance issues with images, audio files, and other "non-code" data
  2010-05-24  5:39                   ` Jeff King
@ 2010-05-24  6:44                     ` John
  2010-05-24  6:45                       ` Jeff King
  0 siblings, 1 reply; 28+ messages in thread
From: John @ 2010-05-24  6:44 UTC (permalink / raw)
  To: Jeff King; +Cc: Nicolas Pitre, git

I agree, no defaults are better than arbitrary defaults.  So why is the default 
"text"?


On 05/24/2010 01:39 AM, Jeff King wrote:
> On Sun, May 23, 2010 at 08:21:12PM -0400, John wrote:
>
>> *.jpg  binary -delta
>> *.png  binary -delta
>> *.psd  binary -delta
>> *.gz  binary -delta
>> *.bz2  binary -delta
>> .. and so on.
>>
>> [...]
>>
>> Is there any reason why someone would NOT want the above
>> ".gitattributes" defined by default?
>
> I delta jpgs in one of my repositories. It is useful if the exif
> metadata changes but the image data does not. I assume you could do the
> same with other formats which have compressed and uncompressed portions
> (I also do it with video containers).  I don't think it would ever make
> sense to try to delta gzip'd or bzip'd contents.
>
> I also don't use "binary", as I use a custom diff driver instead (binary
> implies "-diff").
>
> As for what should be the default, until now the default has always
> been that no gitattributes are defined by default. This is nice because
> it's simple to understand; git doesn't care about filenames unless you
> tell it to. The downside obviously is that it may not perform optimally
> for some unusual workloads without extra configuration.
>
> We could probably do defaults for some common extensions, but I'm not
> really sure where such a thing should end up. For example, I consider
> *.psd a uselessly obscure extension, as Adobe doesn't write software for
> my platform of choice. Not that I mind having it in git, but rather that
> we are inevitably going to miss somebody's pet extension, and then we
> are right back where we started with them needing to configure, except
> now they also have to figure out which extensions have default
> attributes.
>
> -Peff

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: serious performance issues with images, audio files, and other "non-code" data
  2010-05-24  6:44                     ` John
@ 2010-05-24  6:45                       ` Jeff King
  0 siblings, 0 replies; 28+ messages in thread
From: Jeff King @ 2010-05-24  6:45 UTC (permalink / raw)
  To: John; +Cc: Nicolas Pitre, git

On Mon, May 24, 2010 at 02:44:36AM -0400, John wrote:

> I agree, no defaults are better than arbitrary defaults.  So why is
> the default "text"?

Because git was designed as a source control system?

-Peff

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: serious performance issues with images, audio files, and other "non-code" data
  2010-05-24  1:16                   ` Junio C Hamano
@ 2010-05-24  7:01                     ` John
  2010-05-25  6:33                       ` Jeff King
  2010-05-25  7:28                     ` Michael J Gruber
  1 sibling, 1 reply; 28+ messages in thread
From: John @ 2010-05-24  7:01 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Nicolas Pitre, Jeff King, git

Ok, fair enough. It's your project, and you are defining "source control" as 
that which git supports: non-binary, line-by-line text only, C, bash .. no 
images, documents, etc.

I only wish that definition of "source" had been more clear from the get-go.

Perhaps a front and center blurb on the git home page or mission statement might 
clarify things for those of us who have different definitions of "source"?  That 
way, you wouldn't have to be bothered by folks trying to version all their 
project assets with git. For example, you could specify that non-text is out of 
scope for git, (or however you wish to define "source").

On 05/23/2010 09:16 PM, Junio C Hamano wrote:
> John<john@puckerupgames.com>  writes:
>
>> Is there any reason why someone would NOT want the above
>> ".gitattributes" defined by default?
>
> Other than that our originally intended target audience are people who use
> git as a source code control system, not much.
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: serious performance issues with images, audio files, and other "non-code" data
  2010-05-24  7:01                     ` John
@ 2010-05-25  6:33                       ` Jeff King
  0 siblings, 0 replies; 28+ messages in thread
From: Jeff King @ 2010-05-25  6:33 UTC (permalink / raw)
  To: John; +Cc: Junio C Hamano, Nicolas Pitre, git

On Mon, May 24, 2010 at 03:01:37AM -0400, John wrote:

> Ok, fair enough. It's your project, and you are defining "source
> control" as that which git supports: non-binary, line-by-line text
> only, C, bash .. no images, documents, etc.
>
> I only wish that definition of "source" had been more clear from the get-go.

I think both Junio's and my responses were not "we can't do a better job
with non-text sources" but rather "this is how we ended up with the
current state".

I'll admit I have some reservations about trying to figure out a sane
set of extensions for default .gitattributes, but that doesn't mean you
can't try if you want to. :)

> Perhaps a front and center blurb on the git home page or mission
> statement might clarify things for those of us who have different
> definitions of "source"?  That way, you wouldn't have to be bothered
> by folks trying to version all their project assets with git. For
> example, you could specify that non-text is out of scope for git, (or
> however you wish to define "source").

I don't know that we want to explicitly discourage such use. Obviously
certain workflows don't work as well with randomly-changing binary blobs
(e.g., reading format-patch output is next to useless, though it does
still work as a transport if your project relies on emailed patch
submissions).

In general, I think we are happy to take patches making binary storage
more pleasant (e.g., textconv) as long as they don't somehow make the
"normal" case of text worse. There are some things for which git is
simply not well suited (single files in the gigabytes, for example), and
those aren't likely to change because the some of the issues are
fundamental to how git works (though there are often workarounds, like
putting gigantic files in their own individual packs). But certainly
100M of jpgs does not seem like an unusable workload to me (as I
mentioned, I have a several-gigabyte photo repository that git does just
fine with).

-Peff

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: serious performance issues with images, audio files, and other "non-code" data
  2010-05-24  1:16                   ` Junio C Hamano
  2010-05-24  7:01                     ` John
@ 2010-05-25  7:28                     ` Michael J Gruber
  2010-05-25 16:12                       ` John
  1 sibling, 1 reply; 28+ messages in thread
From: Michael J Gruber @ 2010-05-25  7:28 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: John, Nicolas Pitre, Jeff King, git

Junio C Hamano venit, vidit, dixit 24.05.2010 03:16:
> John <john@puckerupgames.com> writes:
> 
>> Is there any reason why someone would NOT want the above
>> ".gitattributes" defined by default?
> 
> Other than that our originally intended target audience are people who use
> git as a source code control system, not much.
> 

and other than that many people use clean/smudge filters to make git
happily and efficiently deltify compressed file formats (such as gz,
bz2, zip) and still keep compressed checkouts...

and other than that which you (plural) and I are not thinking of right now.

Let the defaults be as they are (fit for source control in the proper
sense), it's easy enough to change them for other use cases.

Michael

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: serious performance issues with images, audio files, and other "non-code" data
  2010-05-25  7:28                     ` Michael J Gruber
@ 2010-05-25 16:12                       ` John
  2010-05-25 17:18                         ` Nicolas Pitre
  0 siblings, 1 reply; 28+ messages in thread
From: John @ 2010-05-25 16:12 UTC (permalink / raw)
  To: Michael J Gruber; +Cc: Junio C Hamano, Nicolas Pitre, Jeff King, git

On 05/25/2010 03:28 AM, Michael J Gruber wrote:
> Junio C Hamano venit, vidit, dixit 24.05.2010 03:16:
>> John<john@puckerupgames.com>  writes:
>>
>>> Is there any reason why someone would NOT want the above
>>> ".gitattributes" defined by default?
>>
>> Other than that our originally intended target audience are people who use
>> git as a source code control system, not much.
>>
>
> and other than that many people use clean/smudge filters to make git
> happily and efficiently deltify compressed file formats (such as gz,
> bz2, zip) and still keep compressed checkouts...
>
> and other than that which you (plural) and I are not thinking of right now.
>
> Let the defaults be as they are (fit for source control in the proper
> sense), it's easy enough to change them for other use cases.

That's fine. We all have different ideas what revision control means. So long as it's clear what git 
considers "source" and what it considers out of scope, what the defaults are, and what the 
limitations are, potential users can more fairly evaluate git to see if it fits their needs.

For example, code libraries and shell utilities may not require anything more complicated than 
line-by-line text-based patches in revision control.

On the other hand, projects such as web sites, mobile phone apps, desktop applications, (and games 
:) have lots of "source" that is not code.  Even XML, which is text-based, but not line-based (and 
need not contain any newlines), may present a problem for git in this respect.

Perhaps a section in the manual with a header such as "Handling non-text files", or "Revision 
control for media, XML, and other non line-oriented files" would clear this all up. You could almost 
cull the body of it from this thread and other similar threads.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: serious performance issues with images, audio files, and other "non-code" data
  2010-05-25 16:12                       ` John
@ 2010-05-25 17:18                         ` Nicolas Pitre
  2010-05-25 17:47                           ` John
  0 siblings, 1 reply; 28+ messages in thread
From: Nicolas Pitre @ 2010-05-25 17:18 UTC (permalink / raw)
  To: John; +Cc: Michael J Gruber, Junio C Hamano, Jeff King, git

On Tue, 25 May 2010, John wrote:

> Perhaps a section in the manual with a header such as "Handling non-text
> files", or "Revision control for media, XML, and other non line-oriented
> files" would clear this all up. You could almost cull the body of it from this
> thread and other similar threads.

That is indeed a good idea.

Do you volunteer?


Nicolas

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: serious performance issues with images, audio files, and other "non-code" data
  2010-05-25 17:18                         ` Nicolas Pitre
@ 2010-05-25 17:47                           ` John
  0 siblings, 0 replies; 28+ messages in thread
From: John @ 2010-05-25 17:47 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Michael J Gruber, Junio C Hamano, Jeff King, git

On 05/25/2010 01:18 PM, Nicolas Pitre wrote:
> On Tue, 25 May 2010, John wrote:
>
>> Perhaps a section in the manual with a header such as "Handling non-text
>> files", or "Revision control for media, XML, and other non line-oriented
>> files" would clear this all up. You could almost cull the body of it from this
>> thread and other similar threads.
>
> That is indeed a good idea.
>
> Do you volunteer?

Yes, of course. If y'all are not up to it, I'd be happy to give it a shot.

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2010-05-25 17:49 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-12 18:53 serious performance issues with images, audio files, and other "non-code" data John
2010-05-12 19:15 ` Jakub Narebski
2010-05-14  5:10 ` Jeff King
2010-05-14 12:54   ` John
2010-05-14 17:26     ` Dirk Süsserott
2010-05-17 23:16     ` Jeff King
2010-05-17 23:33       ` Sverre Rabbelier
2010-05-18 19:07         ` Jeff King
2010-05-18 19:10           ` Sverre Rabbelier
2010-05-18 19:27             ` Jeff King
2010-05-18 19:37               ` Nicolas Pitre
2010-05-18 18:50       ` John
2010-05-18 18:54         ` Sverre Rabbelier
2010-05-18 19:19         ` Jeff King
2010-05-18 19:33           ` Nicolas Pitre
2010-05-18 19:41             ` Jeff King
2010-05-18 19:59               ` Nicolas Pitre
2010-05-24  0:21                 ` John
2010-05-24  1:16                   ` Junio C Hamano
2010-05-24  7:01                     ` John
2010-05-25  6:33                       ` Jeff King
2010-05-25  7:28                     ` Michael J Gruber
2010-05-25 16:12                       ` John
2010-05-25 17:18                         ` Nicolas Pitre
2010-05-25 17:47                           ` John
2010-05-24  5:39                   ` Jeff King
2010-05-24  6:44                     ` John
2010-05-24  6:45                       ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).