GIT and large files

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* GIT and large files
@ 2014-05-20 15:37 Stewart, Louis (IS)
  2014-05-20 16:03 ` Jason Pyeron
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Stewart, Louis (IS) @ 2014-05-20 15:37 UTC (permalink / raw)
  To: git@vger.kernel.org

Can GIT handle versioning of large 20+ GB files in a directory?

Lou Stewart
AOCWS Software Configuration Management
757-269-2388

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: GIT and large files
  2014-05-20 15:37 GIT and large files Stewart, Louis (IS)
@ 2014-05-20 16:03 ` Jason Pyeron
       [not found] ` <CALygMcCifDd4LAddZJ4tNcqqwBSvb6BGzTODHBzshBOjCwSrHQ@mail.gmail.com>
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Jason Pyeron @ 2014-05-20 16:03 UTC (permalink / raw)
  To: 'Stewart, Louis (IS)', git

> -----Original Message-----
> From: Stewart, Louis (IS)
> Sent: Tuesday, May 20, 2014 11:38
> 
> Can GIT handle versioning of large 20+ GB files in a directory?

Are you asking 20 files of a GB each or files 20GB each?

A what and why may help with the underlying questions.

v/r,

Jason Pyeron

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
-                                                               -
- Jason Pyeron                      PD Inc. http://www.pdinc.us -
- Principal Consultant              10 West 24th Street #100    -
- +1 (443) 269-1555 x333            Baltimore, Maryland 21218   -
-                                                               -
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
This message is copyright PD Inc, subject to license 20080407P00.

 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: EXT :Re: GIT and large files
       [not found] ` <CALygMcCifDd4LAddZJ4tNcqqwBSvb6BGzTODHBzshBOjCwSrHQ@mail.gmail.com>
@ 2014-05-20 16:53   ` Stewart, Louis (IS)
  0 siblings, 0 replies; 10+ messages in thread
From: Stewart, Louis (IS) @ 2014-05-20 16:53 UTC (permalink / raw)
  To: Gary Fixler; +Cc: git@vger.kernel.org

The files in question would be in directory containing many files some small other huge (example: text files, docs,and jpgs are Mbs, but executables and ova images are GBs, etc).

Lou

From: Gary Fixler [mailto:gfixler@gmail.com] 
Sent: Tuesday, May 20, 2014 12:09 PM
To: Stewart, Louis (IS)
Cc: git@vger.kernel.org
Subject: EXT :Re: GIT and large files

Technically yes, but from a practical standpoint, not really. Facebook recently revealed that they have a 54GB git repo[1], but I doubt it has 20+GB files in it. I've put 18GB of photos into a git repo, but everything about the process was fairly painful, and I don't plan to do it again.
Are your files non-mergeable binaries (e.g. videos)? The biggest problem here is with branching and merging. Conflict resolution with non-mergeable assets ends up an us-vs-them fight, and I don't understand all of the particulars of that. From git's standpoint it's simple - you just have to choose one or the other. From a workflow standpoint, you end up causing trouble if two people have changed an asset, and both people consider their change important. Centralized systems get around this problem with locks.
Git could do this, and I've thought about it quite a bit. I work in games - we have code, but also a lot of binaries, that I'd like to keep in sync with the code. For awhile I considered suggesting some ideas to this group, but I'm pretty sure the locking issue makes it a non-starter. The basic idea - skipping locking for the moment - would be to allow setting git attributes by file type, file size threshold, folder, etc., to allow git to know that some files are considered "bigfiles." These could be placed into the objects folder, but I'd actually prefer they go into a .git/bigfile folder. They'd still be saved as contents under their hash, but a normal git transfer wouldn't send them. They'd be in the tree as 'big' or 'bigfile' (instead of 'blob', 'tree', or 'commit' (for submodules)).

Git would warn you on push that there were bigfiles to send, and you could add, say, --with-big to also send them, or send them later with, say, `git push --big`. They'd simply be zipped up and sent over, without any packfile fanciness. When you clone, you wouldn't get the bigfiles, unless you specified --with-big, and it would warn you that there are also bigfiles, and tell you what command to run to get also get them (`git fetch --big`, perhaps). Git status would always let you know if you were missing bigfiles. I think hopping around between commits would follow the same strategy, you'd always have to, e.g. `git checkout foo --with-big`, or `git checkout foo` and then `git update big` (or whatever - I'm not married to any of these names).

Resolving conflicts on merge would simply have to be up to you. It would be documented clearly that you're entering weird territory, and that your team has to deal with bigfiles somehow, perhaps with some suggested strategies ("Pass the conch?"). I could imagine some strategies for this. Maybe bigfiles require connecting to a blessed repo to grab the right to make a commit on it. That has many problems, of course, and now I can feel everyone reading this shifting uneasily in their seats :)
-g

[1] https://twitter.com/feross/status/459259593630433280

On Tue, May 20, 2014 at 8:37 AM, Stewart, Louis (IS) <louis.stewart@ngc.com> wrote:
Can GIT handle versioning of large 20+ GB files in a directory?

Lou Stewart
AOCWS Software Configuration Management
757-269-2388

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: GIT and large files
  2014-05-20 15:37 GIT and large files Stewart, Louis (IS)
  2014-05-20 16:03 ` Jason Pyeron
       [not found] ` <CALygMcCifDd4LAddZJ4tNcqqwBSvb6BGzTODHBzshBOjCwSrHQ@mail.gmail.com>
@ 2014-05-20 17:08 ` Marius Storm-Olsen
  2014-05-20 17:18 ` Junio C Hamano
  3 siblings, 0 replies; 10+ messages in thread
From: Marius Storm-Olsen @ 2014-05-20 17:08 UTC (permalink / raw)
  To: Stewart, Louis (IS), git@vger.kernel.org

On 5/20/2014 10:37 AM, Stewart, Louis (IS) wrote:
> Can GIT handle versioning of large 20+ GB files in a directory?

Maybe you're looking for git-annex?

https://git-annex.branchable.com/

-- 
.marius

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: GIT and large files
  2014-05-20 15:37 GIT and large files Stewart, Louis (IS)
                   ` (2 preceding siblings ...)
  2014-05-20 17:08 ` Marius Storm-Olsen
@ 2014-05-20 17:18 ` Junio C Hamano
  2014-05-20 17:24   ` EXT :Re: " Stewart, Louis (IS)
  3 siblings, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2014-05-20 17:18 UTC (permalink / raw)
  To: Stewart, Louis (IS); +Cc: git@vger.kernel.org

"Stewart, Louis (IS)" <louis.stewart@ngc.com> writes:

> Can GIT handle versioning of large 20+ GB files in a directory?

I think you can "git add" such files, push/fetch histories that
contains such files over the wire, and "git checkout" such files,
but naturally reading, processing and writing 20+GB would take some
time.  In order to run operations that need to see the changes,
e.g. "git log -p", a real content-level merge, etc., you would also
need sufficient memory because we do things in-core.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: EXT :Re: GIT and large files
  2014-05-20 17:18 ` Junio C Hamano
@ 2014-05-20 17:24   ` Stewart, Louis (IS)
  2014-05-20 18:14     ` Junio C Hamano
  2014-05-20 18:27     ` Thomas Braun
  0 siblings, 2 replies; 10+ messages in thread
From: Stewart, Louis (IS) @ 2014-05-20 17:24 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git@vger.kernel.org

Thanks for the reply.  I just read the intro to GIT and I am concerned about the part that it will copy the whole repository to the developers work area.  They really just need the one directory and files under that one directory. The history has TBs of data.

Lou

-----Original Message-----
From: Junio C Hamano [mailto:gitster@pobox.com] 
Sent: Tuesday, May 20, 2014 1:18 PM
To: Stewart, Louis (IS)
Cc: git@vger.kernel.org
Subject: EXT :Re: GIT and large files

"Stewart, Louis (IS)" <louis.stewart@ngc.com> writes:

> Can GIT handle versioning of large 20+ GB files in a directory?

I think you can "git add" such files, push/fetch histories that contains such files over the wire, and "git checkout" such files, but naturally reading, processing and writing 20+GB would take some time.  In order to run operations that need to see the changes, e.g. "git log -p", a real content-level merge, etc., you would also need sufficient memory because we do things in-core.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: EXT :Re: GIT and large files
  2014-05-20 17:24   ` EXT :Re: " Stewart, Louis (IS)
@ 2014-05-20 18:14     ` Junio C Hamano
  2014-05-20 18:18       ` Stewart, Louis (IS)
  2014-05-20 18:27     ` Thomas Braun
  1 sibling, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2014-05-20 18:14 UTC (permalink / raw)
  To: Stewart, Louis (IS); +Cc: git@vger.kernel.org

"Stewart, Louis (IS)" <louis.stewart@ngc.com> writes:

> Thanks for the reply.  I just read the intro to GIT and I am
> concerned about the part that it will copy the whole repository to
> the developers work area.  They really just need the one directory
> and files under that one directory. The history has TBs of data.

Then you will spend time reading, processing and writing TBs of data
when you clone, unless your developers do something to limit the
history they fetch, e.g. by shallowly cloning.

>
> Lou
>
> -----Original Message-----
> From: Junio C Hamano [mailto:gitster@pobox.com] 
> Sent: Tuesday, May 20, 2014 1:18 PM
> To: Stewart, Louis (IS)
> Cc: git@vger.kernel.org
> Subject: EXT :Re: GIT and large files
>
> "Stewart, Louis (IS)" <louis.stewart@ngc.com> writes:
>
>> Can GIT handle versioning of large 20+ GB files in a directory?
>
> I think you can "git add" such files, push/fetch histories that contains such files over the wire, and "git checkout" such files, but naturally reading, processing and writing 20+GB would take some time.  In order to run operations that need to see the changes, e.g. "git log -p", a real content-level merge, etc., you would also need sufficient memory because we do things in-core.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: EXT :Re: GIT and large files
  2014-05-20 18:14     ` Junio C Hamano
@ 2014-05-20 18:18       ` Stewart, Louis (IS)
  2014-05-20 19:01         ` Konstantin Khomoutov
  0 siblings, 1 reply; 10+ messages in thread
From: Stewart, Louis (IS) @ 2014-05-20 18:18 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git@vger.kernel.org

>From you response then there is a method to only obtain the Project, Directory and Files (which could hold 80 GBs of data) and not the rest of the Repository that contained the full overall Projects?

-----Original Message-----
From: Junio C Hamano [mailto:gitster@pobox.com] 
Sent: Tuesday, May 20, 2014 2:15 PM
To: Stewart, Louis (IS)
Cc: git@vger.kernel.org
Subject: Re: EXT :Re: GIT and large files

"Stewart, Louis (IS)" <louis.stewart@ngc.com> writes:

> Thanks for the reply.  I just read the intro to GIT and I am concerned 
> about the part that it will copy the whole repository to the 
> developers work area.  They really just need the one directory and 
> files under that one directory. The history has TBs of data.

Then you will spend time reading, processing and writing TBs of data when you clone, unless your developers do something to limit the history they fetch, e.g. by shallowly cloning.

>
> Lou
>
> -----Original Message-----
> From: Junio C Hamano [mailto:gitster@pobox.com]
> Sent: Tuesday, May 20, 2014 1:18 PM
> To: Stewart, Louis (IS)
> Cc: git@vger.kernel.org
> Subject: EXT :Re: GIT and large files
>
> "Stewart, Louis (IS)" <louis.stewart@ngc.com> writes:
>
>> Can GIT handle versioning of large 20+ GB files in a directory?
>
> I think you can "git add" such files, push/fetch histories that contains such files over the wire, and "git checkout" such files, but naturally reading, processing and writing 20+GB would take some time.  In order to run operations that need to see the changes, e.g. "git log -p", a real content-level merge, etc., you would also need sufficient memory because we do things in-core.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: EXT :Re: GIT and large files
  2014-05-20 17:24   ` EXT :Re: " Stewart, Louis (IS)
  2014-05-20 18:14     ` Junio C Hamano
@ 2014-05-20 18:27     ` Thomas Braun
  1 sibling, 0 replies; 10+ messages in thread
From: Thomas Braun @ 2014-05-20 18:27 UTC (permalink / raw)
  To: Stewart, Louis (IS); +Cc: Junio C Hamano, git@vger.kernel.org

Am Dienstag, den 20.05.2014, 17:24 +0000 schrieb Stewart, Louis (IS):
> Thanks for the reply.  I just read the intro to GIT and I am concerned
> about the part that it will copy the whole repository to the developers
> work area.  They really just need the one directory and files under
> that one directory. The history has TBs of data.
> 
> Lou
> 
> -----Original Message-----
> From: Junio C Hamano [mailto:gitster@pobox.com] 
> Sent: Tuesday, May 20, 2014 1:18 PM
> To: Stewart, Louis (IS)
> Cc: git@vger.kernel.org
> Subject: EXT :Re: GIT and large files
> 
> "Stewart, Louis (IS)" <louis.stewart@ngc.com> writes:
> 
> > Can GIT handle versioning of large 20+ GB files in a directory?
> 
> I think you can "git add" such files, push/fetch histories that
> contains such files over the wire, and "git checkout" such files, but
> naturally reading, processing and writing 20+GB would take some time. 
> In order to run operations that need to see the changes, e.g. "git log
> -p", a real content-level merge, etc., you would also need sufficient
> memory because we do things in-core.

Preventing that a clone fetches the whole history can be done with the
--depth option of git clone.

The question is what do you want to do with these 20G files?
Just store them in the repo and *very* occasionally change them?
For that you need a 64bit compiled version of git with enough ram. 32G
does the trick here. Everything with git 1.9.1.

Doing some tests on my machine with a normal harddisc gives (sorry for
LC_ALL != C):
$time git add file.dat; time git commit -m "add file"; time git status

real    16m17.913s
user    13m3.965s
sys     0m22.461s
[master 15fa953] add file
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 file.dat

real    15m36.666s
user    13m26.962s
sys     0m16.185s
# Auf Branch master
nichts zu committen, Arbeitsverzeichnis unverändert

real    11m58.936s
user    11m50.300s
sys     0m5.468s

$ls -lh
-rw-r--r-- 1 thomas thomas 20G Mai 20 19:01 file.dat

So this works but aint fast.

Playing some tricks with --assume-unchanged helps here:
$git update-index --assume-unchanged file.dat
$time git status
# Auf Branch master
nichts zu committen, Arbeitsverzeichnis unverändert

real    0m0.003s
user    0m0.000s
sys     0m0.000s

This trick is only save if you *know* that file.dat does not change.

And btw I also set 
$cat .gitattributes 
*.dat -delta
as delta compresssion should be skipped in any case.

Pushing and pulling these files to and from a server needs some tweaking
on the server side, otherwise the occasional git gc might kill the box.
 
Btw. I happily have files with 1.5GB in my git repositories and also
change them. And also work with git for windows. So in this region of
file sizes things work quite well.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: EXT :Re: GIT and large files
  2014-05-20 18:18       ` Stewart, Louis (IS)
@ 2014-05-20 19:01         ` Konstantin Khomoutov
  0 siblings, 0 replies; 10+ messages in thread
From: Konstantin Khomoutov @ 2014-05-20 19:01 UTC (permalink / raw)
  To: Stewart, Louis (IS); +Cc: Junio C Hamano, git@vger.kernel.org

On Tue, 20 May 2014 18:18:08 +0000
"Stewart, Louis (IS)" <louis.stewart@ngc.com> wrote:

> From you response then there is a method to only obtain the Project,
> Directory and Files (which could hold 80 GBs of data) and not the
> rest of the Repository that contained the full overall Projects?

Please google the phrase "Git shallow cloning".

I would also recommend to read up on git-annex [1].

You might also consider using Subversion as it seems you do not need
most benefits Git has over it and want certain benefits Subversion has
over Git:
* You don't need a distributed VCS (as you don't want each developer to
  have a full clone).
* You only need a single slice of the repository history at any given
  revision on a developer's machine, and this is *almost* what
  Subversion does: it will keep the so-called "base" (or "pristine")
  versions of files comprising the revision you will check out, plus
  the checked out files theirselves.  So, twice the space of the files
  comprising a revision.
* Subversion allows you to check out only a single folder out of the
  entire revision.
* IIRC, subversion supports locks, when a developer might tell the
  server they're editing a file, and this will prevent other devs from
  locking the same file.  This might be used to serialize editions of
  huge and/or unmergeable files.  Git can't do that (without
  non-standard tools deployed on the side or a centralized "meeting
  point" repository).

My point is that while Git is fantastic for managing source code
projects and project of similar types with regard to their contents,
it seems your requirements are mainly not suitable for the use case
Git is best tailored for.  Your apparent lack of familiarity with Git
might as well bite you later should you pick it right now.  At least
please consider reading a book or some other introduction-level
material on Git to get the feeling of typical workflows used with it.

1. https://git-annex.branchable.com/

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-05-20 19:09 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-20 15:37 GIT and large files Stewart, Louis (IS)
2014-05-20 16:03 ` Jason Pyeron
     [not found] ` <CALygMcCifDd4LAddZJ4tNcqqwBSvb6BGzTODHBzshBOjCwSrHQ@mail.gmail.com>
2014-05-20 16:53   ` EXT :Re: " Stewart, Louis (IS)
2014-05-20 17:08 ` Marius Storm-Olsen
2014-05-20 17:18 ` Junio C Hamano
2014-05-20 17:24   ` EXT :Re: " Stewart, Louis (IS)
2014-05-20 18:14     ` Junio C Hamano
2014-05-20 18:18       ` Stewart, Louis (IS)
2014-05-20 19:01         ` Konstantin Khomoutov
2014-05-20 18:27     ` Thomas Braun

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).