Git as electronic lab notebook

Git development
 help / color / mirror / Atom feed

* Git as electronic lab notebook
@ 2009-12-19 12:23 Thomas Johnson
  2009-12-19 13:38 ` Ciprian Dorin, Craciun
  0 siblings, 1 reply; 6+ messages in thread
From: Thomas Johnson @ 2009-12-19 12:23 UTC (permalink / raw)
  To: git

Hello group,

I've been using git on a few different projects over the last couple of months,
and as a former svn user I really like it. Recently, I've been using it as an
'electronic lab notebook' for an empirical project. My workflow looks like this:
1. Start with the stable code base on head
2. Create  and change to branch 'Experiment123'
3. Make some changes
4. Run the program, which generates a giant (10MB-4G) output text file,
Experiment123.log. Update my LabNotebook.txt file.
5. Were the new changes helpful?
5.yes: Bzip Experiment123.log, and commit it on the branch. Merge the
Experiment123 branch to head and goto 1.
5.no: Bzip Experiment123.log, and commit it on the branch. Merge LabNotebook.txt
and Experiment123.log back to head. Switch back to head and goto 1.

The thing is, Experiment123.log is going to be very similar to Experiment122.log
and Experiment124.log except for a few details. My understanding is that git is
great at compressing groups of files like this, is that correct? Should I not be
bzipping them myself? On the other hand, I don't want HEAD to contain hundreds
of gigs of uncompressed files that bzip down to only a few hundred megs.

Any thoughts on the workflow itself would also be very welcome.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Git as electronic lab notebook
  2009-12-19 12:23 Git as electronic lab notebook Thomas Johnson
@ 2009-12-19 13:38 ` Ciprian Dorin, Craciun
  2009-12-20  0:15   ` Johan 't Hart
  0 siblings, 1 reply; 6+ messages in thread
From: Ciprian Dorin, Craciun @ 2009-12-19 13:38 UTC (permalink / raw)
  To: Thomas Johnson; +Cc: git

On Sat, Dec 19, 2009 at 2:23 PM, Thomas Johnson
<thomas.j.johnson@gmail.com> wrote:
> Hello group,
>
> I've been using git on a few different projects over the last couple of months,
> and as a former svn user I really like it. Recently, I've been using it as an
> 'electronic lab notebook' for an empirical project. My workflow looks like this:
> 1. Start with the stable code base on head
> 2. Create  and change to branch 'Experiment123'
> 3. Make some changes
> 4. Run the program, which generates a giant (10MB-4G) output text file,
> Experiment123.log. Update my LabNotebook.txt file.
> 5. Were the new changes helpful?
> 5.yes: Bzip Experiment123.log, and commit it on the branch. Merge the
> Experiment123 branch to head and goto 1.
> 5.no: Bzip Experiment123.log, and commit it on the branch. Merge LabNotebook.txt
> and Experiment123.log back to head. Switch back to head and goto 1.
>
> The thing is, Experiment123.log is going to be very similar to Experiment122.log
> and Experiment124.log except for a few details. My understanding is that git is
> great at compressing groups of files like this, is that correct? Should I not be
> bzipping them myself? On the other hand, I don't want HEAD to contain hundreds
> of gigs of uncompressed files that bzip down to only a few hundred megs.
>
> Any thoughts on the workflow itself would also be very welcome.


    I have used myself such a similar workflow for parametric studies
on some genetic algorithms, and below are my observations related to
your question:
    * saving the entire log file (either zipped or not) in the
repository has some drawbacks with repository clonning; (in my setup
I've runned the tests in parallel on a different machine, and used Git
to synchronize between the development machine and the test machine;)
the problem lies in the fact that when I wanted to "clean" the test
machine and start over I had to clone the repository, which also held
all the unneeded log files;
    * (actually I've used two Git repositories -- one for the actual
source code where I make the commits by hand, and another one which I
use for the synchronization;)
    * even if you prefer having the logs, it's best to let Git handle
the compression; because even if only some small parts change from the
original txt file, I would guess that the BZip-ped file looks quite
different;
    * maybe it would be better than instead of holding the experiment
log, you just keep a sumarization of it (only the important stuff);
and even if you do need the entire log, you could always recreate it
by running the code again; (this was the road I took in the end, by
keeping a small SQLite database of each experiment;)
    * (and of course there is also another little trick I've used:
just put the logs file in a `log` directory which is "git-ignored",
that way you can switch between branches, but Git won't touch the
`log` directory, unless you force it by issuing `git clean -f -d -x`;)

    Hope I've been useful,
    Ciprian.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Git as electronic lab notebook
  2009-12-19 13:38 ` Ciprian Dorin, Craciun
@ 2009-12-20  0:15   ` Johan 't Hart
  2009-12-20  3:15     ` Nicolas Pitre
  0 siblings, 1 reply; 6+ messages in thread
From: Johan 't Hart @ 2009-12-20  0:15 UTC (permalink / raw)
  To: Ciprian Dorin, Craciun; +Cc: Thomas Johnson, git

Ciprian Dorin, Craciun schreef:
> On Sat, Dec 19, 2009 at 2:23 PM, Thomas Johnson
> <thomas.j.johnson@gmail.com> wrote:

>> 4. Run the program, which generates a giant (10MB-4G) output text file,
>> Experiment123.log. Update my LabNotebook.txt file.

>     * even if you prefer having the logs, it's best to let Git handle
> the compression; because even if only some small parts change from the
> original txt file, I would guess that the BZip-ped file looks quite
> different;
>

Is git able to handle 4Gig files? I've heard git loads every file 
completely in memory before handling it...

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Git as electronic lab notebook
  2009-12-20  0:15   ` Johan 't Hart
@ 2009-12-20  3:15     ` Nicolas Pitre
  2009-12-20  4:43       ` Bill Lear
  0 siblings, 1 reply; 6+ messages in thread
From: Nicolas Pitre @ 2009-12-20  3:15 UTC (permalink / raw)
  To: Johan 't Hart; +Cc: Ciprian Dorin, Craciun, Thomas Johnson, git

On Sun, 20 Dec 2009, Johan 't Hart wrote:

> Is git able to handle 4Gig files? I've heard git loads every file completely
> in memory before handling it...

Right.  Sowith current Git you will be able to deal with 4GB files only 
if you have a 64-bit machine and more than 4GB of RAM.


Nicolas

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Git as electronic lab notebook
  2009-12-20  3:15     ` Nicolas Pitre
@ 2009-12-20  4:43       ` Bill Lear
  2009-12-20  4:55         ` Nicolas Pitre
  0 siblings, 1 reply; 6+ messages in thread
From: Bill Lear @ 2009-12-20  4:43 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Johan 't Hart, Ciprian Dorin, Craciun, Thomas Johnson, git

On Saturday, December 19, 2009 at 22:15:00 (-0500) Nicolas Pitre writes:
>On Sun, 20 Dec 2009, Johan 't Hart wrote:
>
>> Is git able to handle 4Gig files? I've heard git loads every file completely
>> in memory before handling it...
>
>Right.  Sowith current Git you will be able to deal with 4GB files only 
>if you have a 64-bit machine and more than 4GB of RAM.

??

% uname -a
Linux pppp 2.6.31.6-166.fc12.i686 #1 SMP Wed Dec 9 11:14:59 EST 2009 i686 i686 i386 GNU/Linux
% cat /proc/meminfo  | grep MemTotal
MemTotal:        3095296 kB
% mkdir gogle
% cd gogle
% git init
% dd if=/dev/zero of=zerofile.tst bs=1k count=4700000
% git add *
% git commit -a -m new
[master (root-commit) 35a25be] new
 1 files changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 zerofile.tst
% git --version
git version 1.6.5.7

Seems ok to me...

Though, I find this interesting:

% git log -p
commit 35a25be3fff2f8bbd6ec22c94b9a5c0d66053d21
Author: Bill Lear <rael@zopyra.com>
Date:   Sat Dec 19 22:38:48 2009 -0600

    new

diff --git a/zerofile.tst b/zerofile.tst
new file mode 100644
index 0000000..e5bd39d
Binary files /dev/null and b/zerofile.tst differ


Bill

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Git as electronic lab notebook
  2009-12-20  4:43       ` Bill Lear
@ 2009-12-20  4:55         ` Nicolas Pitre
  0 siblings, 0 replies; 6+ messages in thread
From: Nicolas Pitre @ 2009-12-20  4:55 UTC (permalink / raw)
  To: Bill Lear; +Cc: Johan 't Hart, Ciprian Dorin, Craciun, Thomas Johnson, git

On Sat, 19 Dec 2009, Bill Lear wrote:

> On Saturday, December 19, 2009 at 22:15:00 (-0500) Nicolas Pitre writes:
> >On Sun, 20 Dec 2009, Johan 't Hart wrote:
> >
> >> Is git able to handle 4Gig files? I've heard git loads every file completely
> >> in memory before handling it...
> >
> >Right.  Sowith current Git you will be able to deal with 4GB files only 
> >if you have a 64-bit machine and more than 4GB of RAM.
> 
> ??
> 
> % uname -a
> Linux pppp 2.6.31.6-166.fc12.i686 #1 SMP Wed Dec 9 11:14:59 EST 2009 i686 i686 i386 GNU/Linux
> % cat /proc/meminfo  | grep MemTotal
> MemTotal:        3095296 kB
> % mkdir gogle
> % cd gogle
> % git init
> % dd if=/dev/zero of=zerofile.tst bs=1k count=4700000
> % git add *
> % git commit -a -m new
> [master (root-commit) 35a25be] new
>  1 files changed, 0 insertions(+), 0 deletions(-)
>  create mode 100644 zerofile.tst
> % git --version
> git version 1.6.5.7
> 
> Seems ok to me...

That's the easy part.  Diffing such files and delta compressing them, or 
even checking them out especially when delta compressed, just won't work 
if you don't have the RAM.  Fixing this limitation would introduce 
significant complexity in the code that no one felt was worth it.

I had some thoughts about supporting the addition of really huge files 
in a Git repository where only add/commit/checkout/fetch/push would work 
with no delta compression.  That didn't materialized yet though.


Nicolas

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2009-12-20  4:57 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-12-19 12:23 Git as electronic lab notebook Thomas Johnson
2009-12-19 13:38 ` Ciprian Dorin, Craciun
2009-12-20  0:15   ` Johan 't Hart
2009-12-20  3:15     ` Nicolas Pitre
2009-12-20  4:43       ` Bill Lear
2009-12-20  4:55         ` Nicolas Pitre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox