git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Performance impact of a large number of commits
@ 2008-10-24 19:02 Samuel Abels
  2008-10-24 19:43 ` david
  0 siblings, 1 reply; 7+ messages in thread
From: Samuel Abels @ 2008-10-24 19:02 UTC (permalink / raw)
  To: git

Hi,

I am considering Git to maintain a repository of approximately 300.000
files totaling 1 GB, with a number of ~100.000 commits per day, all in
one single branch. The only operations performed are "git commit", "git
show", and "git checkout", and all on one local machine. Does this sound
like a reasonable thing to do with Git?

-Samuel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Performance impact of a large number of commits
  2008-10-24 19:02 Performance impact of a large number of commits Samuel Abels
@ 2008-10-24 19:43 ` david
  2008-10-24 19:56   ` Samuel Abels
  0 siblings, 1 reply; 7+ messages in thread
From: david @ 2008-10-24 19:43 UTC (permalink / raw)
  To: Samuel Abels; +Cc: git

On Fri, 24 Oct 2008, Samuel Abels wrote:

> Hi,
>
> I am considering Git to maintain a repository of approximately 300.000
> files totaling 1 GB, with a number of ~100.000 commits per day, all in
> one single branch. The only operations performed are "git commit", "git
> show", and "git checkout", and all on one local machine. Does this sound
> like a reasonable thing to do with Git?

100,000 commits per day??

that's 1.5 commits/second. what is updating files that quickly?

I suspect that you will have some issues here, but it's going to depend on 
how many files get updated each 3/4 of a second.

David Lang

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Performance impact of a large number of commits
  2008-10-24 19:43 ` david
@ 2008-10-24 19:56   ` Samuel Abels
  2008-10-24 20:11     ` david
  0 siblings, 1 reply; 7+ messages in thread
From: Samuel Abels @ 2008-10-24 19:56 UTC (permalink / raw)
  To: david; +Cc: git

On Fri, 2008-10-24 at 12:43 -0700, david@lang.hm wrote:
> 100,000 commits per day??
> 
> that's 1.5 commits/second. what is updating files that quickly?

This is an automated process taking snapshots of rapidly changing files
containing statistical data. Unfortunately, our needs go beyond what a
versioning file system has to offer, and the data is an unstructured
text file (in other words, using a relational database is not a good
option).

> I suspect that you will have some issues here, but it's going to depend on 
> how many files get updated each 3/4 of a second.

That would be 5 to 10 changed files per commit, and those are passed to
git commit explicitly (i.e., walking the tree to stat files for finding
changes is not necessary).

-Samuel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Performance impact of a large number of commits
  2008-10-24 19:56   ` Samuel Abels
@ 2008-10-24 20:11     ` david
  2008-10-25  5:18       ` Samuel Abels
  0 siblings, 1 reply; 7+ messages in thread
From: david @ 2008-10-24 20:11 UTC (permalink / raw)
  To: Samuel Abels; +Cc: git

On Fri, 24 Oct 2008, Samuel Abels wrote:

> On Fri, 2008-10-24 at 12:43 -0700, david@lang.hm wrote:
>> 100,000 commits per day??
>>
>> that's 1.5 commits/second. what is updating files that quickly?
>
> This is an automated process taking snapshots of rapidly changing files
> containing statistical data. Unfortunately, our needs go beyond what a
> versioning file system has to offer, and the data is an unstructured
> text file (in other words, using a relational database is not a good
> option).
>
>> I suspect that you will have some issues here, but it's going to depend on
>> how many files get updated each 3/4 of a second.
>
> That would be 5 to 10 changed files per commit, and those are passed to
> git commit explicitly (i.e., walking the tree to stat files for finding
> changes is not necessary).

I suspect that your limits would be filesystem/OS limits more than git 
limits

at 5-10 files/commit you are going to be creating .5-1m files/day, even 
spread across 256 directories this is going to be a _lot_ of files.

packing this may help (depending on how much the files change), but with 
this many files the work of doing the packing would be expensive.

David Lang

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Performance impact of a large number of commits
  2008-10-24 20:11     ` david
@ 2008-10-25  5:18       ` Samuel Abels
  2008-10-25  5:29         ` david
  0 siblings, 1 reply; 7+ messages in thread
From: Samuel Abels @ 2008-10-25  5:18 UTC (permalink / raw)
  To: david; +Cc: git

On Fri, 2008-10-24 at 13:11 -0700, david@lang.hm wrote:
> > git commit explicitly (i.e., walking the tree to stat files for finding
> > changes is not necessary).
> 
> I suspect that your limits would be filesystem/OS limits more than git 
> limits
> 
> at 5-10 files/commit you are going to be creating .5-1m files/day, even 
> spread across 256 directories this is going to be a _lot_ of files.

The files are organized in a way that places no more than ~1.000 files
into each directory. Will Git create a directory containing a larger
number of object files? I can see that this would be a problem in our
use case.

> packing this may help (depending on how much the files change), but with 
> this many files the work of doing the packing would be expensive.

We can probably do that even if it takes several hours.

-Samuel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Performance impact of a large number of commits
  2008-10-25  5:18       ` Samuel Abels
@ 2008-10-25  5:29         ` david
  2008-10-25 15:12           ` Samuel Abels
  0 siblings, 1 reply; 7+ messages in thread
From: david @ 2008-10-25  5:29 UTC (permalink / raw)
  To: Samuel Abels; +Cc: git

On Sat, 25 Oct 2008, Samuel Abels wrote:

> On Fri, 2008-10-24 at 13:11 -0700, david@lang.hm wrote:
>>> git commit explicitly (i.e., walking the tree to stat files for finding
>>> changes is not necessary).
>>
>> I suspect that your limits would be filesystem/OS limits more than git
>> limits
>>
>> at 5-10 files/commit you are going to be creating .5-1m files/day, even
>> spread across 256 directories this is going to be a _lot_ of files.
>
> The files are organized in a way that places no more than ~1.000 files
> into each directory. Will Git create a directory containing a larger
> number of object files? I can see that this would be a problem in our
> use case.

when git stores the copies of the files it does a sha1 hash of the file 
contents and then stores the file in the directory
.git/objects/<first two digits of the hash>/<hash>
this means that if you have files that have the same content they all fold 
togeather, but with lots of files changing rapidly the result is a lot of 
files in these object directories.

it would be a pretty minor change to git to have it use more directories 
(in fact, there's another thread going on today where people are looking 
at making this configurable, in that case to reduce the number of 
directories)

the other storage format that git has is the pack file. it takes a bunch 
of the objects, does some comparisons between them (to find duplicate bits 
of files), and then stores the result (base files plus deltas to re-create 
other files). the resulting compression is _extremely_ efficiant, and it 
collapses many file objects into one pack file (addressing the issues of 
many files in one directory)

>> packing this may help (depending on how much the files change), but with
>> this many files the work of doing the packing would be expensive.
>
> We can probably do that even if it takes several hours.

my concern is that spending time creating the pack files will mean that 
you don't have time to insert the new files.

that being said, there may be other ways of dealing with this data rather 
than putting it into files and then adding it to the git repository.

Git has a fast-import streaming format that is designed for programs to 
use that are converting repositories from other SCM systems. if you can 
tell more about what you are doing (how the data is being gathered, are 
the files re-created for each commit, or are they being modified? if they 
are being modified is it appending data, changing some data, or randomly 
writing throughout the file? etc) there may be some other options 
available.

at this point I don't know if git can work for you or not, but I'm pretty 
sure nothing else will have a chance with your size.

David Lang

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Performance impact of a large number of commits
  2008-10-25  5:29         ` david
@ 2008-10-25 15:12           ` Samuel Abels
  0 siblings, 0 replies; 7+ messages in thread
From: Samuel Abels @ 2008-10-25 15:12 UTC (permalink / raw)
  To: david; +Cc: git

On Fri, 2008-10-24 at 22:29 -0700, david@lang.hm wrote:
> when git stores the copies of the files it does a sha1 hash of the file 
> contents and then stores the file in the directory
> .git/objects/<first two digits of the hash>/<hash>

> it would be a pretty minor change to git to have it use more directories 

Ah, I see how this works. Well, I'll think of a way to cope with this (I
might patch my Git installation, or see how well it performs on an
indexed file system). If all else fails we'll have to slash the number
of commits even if this means that some files are not added to the
history.

> my concern is that spending time creating the pack files will mean that 
> you don't have time to insert the new files.
> 
> that being said, there may be other ways of dealing with this data rather 
> than putting it into files and then adding it to the git repository.
> 
> Git has a fast-import streaming format that is designed for programs to 
> use that are converting repositories from other SCM systems.

I'm pretty sure that the streaming format won't do us much good, as the
files are re-created from scratch between commits.

Thanks a lot for the information, this was very helpful.

-Samuel

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2008-10-25 15:14 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-24 19:02 Performance impact of a large number of commits Samuel Abels
2008-10-24 19:43 ` david
2008-10-24 19:56   ` Samuel Abels
2008-10-24 20:11     ` david
2008-10-25  5:18       ` Samuel Abels
2008-10-25  5:29         ` david
2008-10-25 15:12           ` Samuel Abels

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).