* GIT and large files
@ 2014-05-20 15:37 Stewart, Louis (IS)
2014-05-20 16:03 ` Jason Pyeron
` (3 more replies)
0 siblings, 4 replies; 10+ messages in thread
From: Stewart, Louis (IS) @ 2014-05-20 15:37 UTC (permalink / raw)
To: git@vger.kernel.org
Can GIT handle versioning of large 20+ GB files in a directory?
Lou Stewart
AOCWS Software Configuration Management
757-269-2388
^ permalink raw reply [flat|nested] 10+ messages in thread
* RE: GIT and large files
2014-05-20 15:37 GIT and large files Stewart, Louis (IS)
@ 2014-05-20 16:03 ` Jason Pyeron
[not found] ` <CALygMcCifDd4LAddZJ4tNcqqwBSvb6BGzTODHBzshBOjCwSrHQ@mail.gmail.com>
` (2 subsequent siblings)
3 siblings, 0 replies; 10+ messages in thread
From: Jason Pyeron @ 2014-05-20 16:03 UTC (permalink / raw)
To: 'Stewart, Louis (IS)', git
> -----Original Message-----
> From: Stewart, Louis (IS)
> Sent: Tuesday, May 20, 2014 11:38
>
> Can GIT handle versioning of large 20+ GB files in a directory?
Are you asking 20 files of a GB each or files 20GB each?
A what and why may help with the underlying questions.
v/r,
Jason Pyeron
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
- -
- Jason Pyeron PD Inc. http://www.pdinc.us -
- Principal Consultant 10 West 24th Street #100 -
- +1 (443) 269-1555 x333 Baltimore, Maryland 21218 -
- -
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
This message is copyright PD Inc, subject to license 20080407P00.
^ permalink raw reply [flat|nested] 10+ messages in thread
* RE: EXT :Re: GIT and large files
[not found] ` <CALygMcCifDd4LAddZJ4tNcqqwBSvb6BGzTODHBzshBOjCwSrHQ@mail.gmail.com>
@ 2014-05-20 16:53 ` Stewart, Louis (IS)
0 siblings, 0 replies; 10+ messages in thread
From: Stewart, Louis (IS) @ 2014-05-20 16:53 UTC (permalink / raw)
To: Gary Fixler; +Cc: git@vger.kernel.org
The files in question would be in directory containing many files some small other huge (example: text files, docs,and jpgs are Mbs, but executables and ova images are GBs, etc).
Lou
From: Gary Fixler [mailto:gfixler@gmail.com]
Sent: Tuesday, May 20, 2014 12:09 PM
To: Stewart, Louis (IS)
Cc: git@vger.kernel.org
Subject: EXT :Re: GIT and large files
Technically yes, but from a practical standpoint, not really. Facebook recently revealed that they have a 54GB git repo[1], but I doubt it has 20+GB files in it. I've put 18GB of photos into a git repo, but everything about the process was fairly painful, and I don't plan to do it again.
Are your files non-mergeable binaries (e.g. videos)? The biggest problem here is with branching and merging. Conflict resolution with non-mergeable assets ends up an us-vs-them fight, and I don't understand all of the particulars of that. From git's standpoint it's simple - you just have to choose one or the other. From a workflow standpoint, you end up causing trouble if two people have changed an asset, and both people consider their change important. Centralized systems get around this problem with locks.
Git could do this, and I've thought about it quite a bit. I work in games - we have code, but also a lot of binaries, that I'd like to keep in sync with the code. For awhile I considered suggesting some ideas to this group, but I'm pretty sure the locking issue makes it a non-starter. The basic idea - skipping locking for the moment - would be to allow setting git attributes by file type, file size threshold, folder, etc., to allow git to know that some files are considered "bigfiles." These could be placed into the objects folder, but I'd actually prefer they go into a .git/bigfile folder. They'd still be saved as contents under their hash, but a normal git transfer wouldn't send them. They'd be in the tree as 'big' or 'bigfile' (instead of 'blob', 'tree', or 'commit' (for submodules)).
Git would warn you on push that there were bigfiles to send, and you could add, say, --with-big to also send them, or send them later with, say, `git push --big`. They'd simply be zipped up and sent over, without any packfile fanciness. When you clone, you wouldn't get the bigfiles, unless you specified --with-big, and it would warn you that there are also bigfiles, and tell you what command to run to get also get them (`git fetch --big`, perhaps). Git status would always let you know if you were missing bigfiles. I think hopping around between commits would follow the same strategy, you'd always have to, e.g. `git checkout foo --with-big`, or `git checkout foo` and then `git update big` (or whatever - I'm not married to any of these names).
Resolving conflicts on merge would simply have to be up to you. It would be documented clearly that you're entering weird territory, and that your team has to deal with bigfiles somehow, perhaps with some suggested strategies ("Pass the conch?"). I could imagine some strategies for this. Maybe bigfiles require connecting to a blessed repo to grab the right to make a commit on it. That has many problems, of course, and now I can feel everyone reading this shifting uneasily in their seats :)
-g
[1] https://twitter.com/feross/status/459259593630433280
On Tue, May 20, 2014 at 8:37 AM, Stewart, Louis (IS) <louis.stewart@ngc.com> wrote:
Can GIT handle versioning of large 20+ GB files in a directory?
Lou Stewart
AOCWS Software Configuration Management
757-269-2388
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: GIT and large files
2014-05-20 15:37 GIT and large files Stewart, Louis (IS)
2014-05-20 16:03 ` Jason Pyeron
[not found] ` <CALygMcCifDd4LAddZJ4tNcqqwBSvb6BGzTODHBzshBOjCwSrHQ@mail.gmail.com>
@ 2014-05-20 17:08 ` Marius Storm-Olsen
2014-05-20 17:18 ` Junio C Hamano
3 siblings, 0 replies; 10+ messages in thread
From: Marius Storm-Olsen @ 2014-05-20 17:08 UTC (permalink / raw)
To: Stewart, Louis (IS), git@vger.kernel.org
On 5/20/2014 10:37 AM, Stewart, Louis (IS) wrote:
> Can GIT handle versioning of large 20+ GB files in a directory?
Maybe you're looking for git-annex?
https://git-annex.branchable.com/
--
.marius
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: GIT and large files
2014-05-20 15:37 GIT and large files Stewart, Louis (IS)
` (2 preceding siblings ...)
2014-05-20 17:08 ` Marius Storm-Olsen
@ 2014-05-20 17:18 ` Junio C Hamano
2014-05-20 17:24 ` EXT :Re: " Stewart, Louis (IS)
3 siblings, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2014-05-20 17:18 UTC (permalink / raw)
To: Stewart, Louis (IS); +Cc: git@vger.kernel.org
"Stewart, Louis (IS)" <louis.stewart@ngc.com> writes:
> Can GIT handle versioning of large 20+ GB files in a directory?
I think you can "git add" such files, push/fetch histories that
contains such files over the wire, and "git checkout" such files,
but naturally reading, processing and writing 20+GB would take some
time. In order to run operations that need to see the changes,
e.g. "git log -p", a real content-level merge, etc., you would also
need sufficient memory because we do things in-core.
^ permalink raw reply [flat|nested] 10+ messages in thread
* RE: EXT :Re: GIT and large files
2014-05-20 17:18 ` Junio C Hamano
@ 2014-05-20 17:24 ` Stewart, Louis (IS)
2014-05-20 18:14 ` Junio C Hamano
2014-05-20 18:27 ` Thomas Braun
0 siblings, 2 replies; 10+ messages in thread
From: Stewart, Louis (IS) @ 2014-05-20 17:24 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git@vger.kernel.org
Thanks for the reply. I just read the intro to GIT and I am concerned about the part that it will copy the whole repository to the developers work area. They really just need the one directory and files under that one directory. The history has TBs of data.
Lou
-----Original Message-----
From: Junio C Hamano [mailto:gitster@pobox.com]
Sent: Tuesday, May 20, 2014 1:18 PM
To: Stewart, Louis (IS)
Cc: git@vger.kernel.org
Subject: EXT :Re: GIT and large files
"Stewart, Louis (IS)" <louis.stewart@ngc.com> writes:
> Can GIT handle versioning of large 20+ GB files in a directory?
I think you can "git add" such files, push/fetch histories that contains such files over the wire, and "git checkout" such files, but naturally reading, processing and writing 20+GB would take some time. In order to run operations that need to see the changes, e.g. "git log -p", a real content-level merge, etc., you would also need sufficient memory because we do things in-core.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: EXT :Re: GIT and large files
2014-05-20 17:24 ` EXT :Re: " Stewart, Louis (IS)
@ 2014-05-20 18:14 ` Junio C Hamano
2014-05-20 18:18 ` Stewart, Louis (IS)
2014-05-20 18:27 ` Thomas Braun
1 sibling, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2014-05-20 18:14 UTC (permalink / raw)
To: Stewart, Louis (IS); +Cc: git@vger.kernel.org
"Stewart, Louis (IS)" <louis.stewart@ngc.com> writes:
> Thanks for the reply. I just read the intro to GIT and I am
> concerned about the part that it will copy the whole repository to
> the developers work area. They really just need the one directory
> and files under that one directory. The history has TBs of data.
Then you will spend time reading, processing and writing TBs of data
when you clone, unless your developers do something to limit the
history they fetch, e.g. by shallowly cloning.
>
> Lou
>
> -----Original Message-----
> From: Junio C Hamano [mailto:gitster@pobox.com]
> Sent: Tuesday, May 20, 2014 1:18 PM
> To: Stewart, Louis (IS)
> Cc: git@vger.kernel.org
> Subject: EXT :Re: GIT and large files
>
> "Stewart, Louis (IS)" <louis.stewart@ngc.com> writes:
>
>> Can GIT handle versioning of large 20+ GB files in a directory?
>
> I think you can "git add" such files, push/fetch histories that contains such files over the wire, and "git checkout" such files, but naturally reading, processing and writing 20+GB would take some time. In order to run operations that need to see the changes, e.g. "git log -p", a real content-level merge, etc., you would also need sufficient memory because we do things in-core.
^ permalink raw reply [flat|nested] 10+ messages in thread
* RE: EXT :Re: GIT and large files
2014-05-20 18:14 ` Junio C Hamano
@ 2014-05-20 18:18 ` Stewart, Louis (IS)
2014-05-20 19:01 ` Konstantin Khomoutov
0 siblings, 1 reply; 10+ messages in thread
From: Stewart, Louis (IS) @ 2014-05-20 18:18 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git@vger.kernel.org
>From you response then there is a method to only obtain the Project, Directory and Files (which could hold 80 GBs of data) and not the rest of the Repository that contained the full overall Projects?
-----Original Message-----
From: Junio C Hamano [mailto:gitster@pobox.com]
Sent: Tuesday, May 20, 2014 2:15 PM
To: Stewart, Louis (IS)
Cc: git@vger.kernel.org
Subject: Re: EXT :Re: GIT and large files
"Stewart, Louis (IS)" <louis.stewart@ngc.com> writes:
> Thanks for the reply. I just read the intro to GIT and I am concerned
> about the part that it will copy the whole repository to the
> developers work area. They really just need the one directory and
> files under that one directory. The history has TBs of data.
Then you will spend time reading, processing and writing TBs of data when you clone, unless your developers do something to limit the history they fetch, e.g. by shallowly cloning.
>
> Lou
>
> -----Original Message-----
> From: Junio C Hamano [mailto:gitster@pobox.com]
> Sent: Tuesday, May 20, 2014 1:18 PM
> To: Stewart, Louis (IS)
> Cc: git@vger.kernel.org
> Subject: EXT :Re: GIT and large files
>
> "Stewart, Louis (IS)" <louis.stewart@ngc.com> writes:
>
>> Can GIT handle versioning of large 20+ GB files in a directory?
>
> I think you can "git add" such files, push/fetch histories that contains such files over the wire, and "git checkout" such files, but naturally reading, processing and writing 20+GB would take some time. In order to run operations that need to see the changes, e.g. "git log -p", a real content-level merge, etc., you would also need sufficient memory because we do things in-core.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: EXT :Re: GIT and large files
2014-05-20 17:24 ` EXT :Re: " Stewart, Louis (IS)
2014-05-20 18:14 ` Junio C Hamano
@ 2014-05-20 18:27 ` Thomas Braun
1 sibling, 0 replies; 10+ messages in thread
From: Thomas Braun @ 2014-05-20 18:27 UTC (permalink / raw)
To: Stewart, Louis (IS); +Cc: Junio C Hamano, git@vger.kernel.org
Am Dienstag, den 20.05.2014, 17:24 +0000 schrieb Stewart, Louis (IS):
> Thanks for the reply. I just read the intro to GIT and I am concerned
> about the part that it will copy the whole repository to the developers
> work area. They really just need the one directory and files under
> that one directory. The history has TBs of data.
>
> Lou
>
> -----Original Message-----
> From: Junio C Hamano [mailto:gitster@pobox.com]
> Sent: Tuesday, May 20, 2014 1:18 PM
> To: Stewart, Louis (IS)
> Cc: git@vger.kernel.org
> Subject: EXT :Re: GIT and large files
>
> "Stewart, Louis (IS)" <louis.stewart@ngc.com> writes:
>
> > Can GIT handle versioning of large 20+ GB files in a directory?
>
> I think you can "git add" such files, push/fetch histories that
> contains such files over the wire, and "git checkout" such files, but
> naturally reading, processing and writing 20+GB would take some time.
> In order to run operations that need to see the changes, e.g. "git log
> -p", a real content-level merge, etc., you would also need sufficient
> memory because we do things in-core.
Preventing that a clone fetches the whole history can be done with the
--depth option of git clone.
The question is what do you want to do with these 20G files?
Just store them in the repo and *very* occasionally change them?
For that you need a 64bit compiled version of git with enough ram. 32G
does the trick here. Everything with git 1.9.1.
Doing some tests on my machine with a normal harddisc gives (sorry for
LC_ALL != C):
$time git add file.dat; time git commit -m "add file"; time git status
real 16m17.913s
user 13m3.965s
sys 0m22.461s
[master 15fa953] add file
1 file changed, 0 insertions(+), 0 deletions(-)
create mode 100644 file.dat
real 15m36.666s
user 13m26.962s
sys 0m16.185s
# Auf Branch master
nichts zu committen, Arbeitsverzeichnis unverändert
real 11m58.936s
user 11m50.300s
sys 0m5.468s
$ls -lh
-rw-r--r-- 1 thomas thomas 20G Mai 20 19:01 file.dat
So this works but aint fast.
Playing some tricks with --assume-unchanged helps here:
$git update-index --assume-unchanged file.dat
$time git status
# Auf Branch master
nichts zu committen, Arbeitsverzeichnis unverändert
real 0m0.003s
user 0m0.000s
sys 0m0.000s
This trick is only save if you *know* that file.dat does not change.
And btw I also set
$cat .gitattributes
*.dat -delta
as delta compresssion should be skipped in any case.
Pushing and pulling these files to and from a server needs some tweaking
on the server side, otherwise the occasional git gc might kill the box.
Btw. I happily have files with 1.5GB in my git repositories and also
change them. And also work with git for windows. So in this region of
file sizes things work quite well.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: EXT :Re: GIT and large files
2014-05-20 18:18 ` Stewart, Louis (IS)
@ 2014-05-20 19:01 ` Konstantin Khomoutov
0 siblings, 0 replies; 10+ messages in thread
From: Konstantin Khomoutov @ 2014-05-20 19:01 UTC (permalink / raw)
To: Stewart, Louis (IS); +Cc: Junio C Hamano, git@vger.kernel.org
On Tue, 20 May 2014 18:18:08 +0000
"Stewart, Louis (IS)" <louis.stewart@ngc.com> wrote:
> From you response then there is a method to only obtain the Project,
> Directory and Files (which could hold 80 GBs of data) and not the
> rest of the Repository that contained the full overall Projects?
Please google the phrase "Git shallow cloning".
I would also recommend to read up on git-annex [1].
You might also consider using Subversion as it seems you do not need
most benefits Git has over it and want certain benefits Subversion has
over Git:
* You don't need a distributed VCS (as you don't want each developer to
have a full clone).
* You only need a single slice of the repository history at any given
revision on a developer's machine, and this is *almost* what
Subversion does: it will keep the so-called "base" (or "pristine")
versions of files comprising the revision you will check out, plus
the checked out files theirselves. So, twice the space of the files
comprising a revision.
* Subversion allows you to check out only a single folder out of the
entire revision.
* IIRC, subversion supports locks, when a developer might tell the
server they're editing a file, and this will prevent other devs from
locking the same file. This might be used to serialize editions of
huge and/or unmergeable files. Git can't do that (without
non-standard tools deployed on the side or a centralized "meeting
point" repository).
My point is that while Git is fantastic for managing source code
projects and project of similar types with regard to their contents,
it seems your requirements are mainly not suitable for the use case
Git is best tailored for. Your apparent lack of familiarity with Git
might as well bite you later should you pick it right now. At least
please consider reading a book or some other introduction-level
material on Git to get the feeling of typical workflows used with it.
1. https://git-annex.branchable.com/
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2014-05-20 19:09 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-20 15:37 GIT and large files Stewart, Louis (IS)
2014-05-20 16:03 ` Jason Pyeron
[not found] ` <CALygMcCifDd4LAddZJ4tNcqqwBSvb6BGzTODHBzshBOjCwSrHQ@mail.gmail.com>
2014-05-20 16:53 ` EXT :Re: " Stewart, Louis (IS)
2014-05-20 17:08 ` Marius Storm-Olsen
2014-05-20 17:18 ` Junio C Hamano
2014-05-20 17:24 ` EXT :Re: " Stewart, Louis (IS)
2014-05-20 18:14 ` Junio C Hamano
2014-05-20 18:18 ` Stewart, Louis (IS)
2014-05-20 19:01 ` Konstantin Khomoutov
2014-05-20 18:27 ` Thomas Braun
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).