Tackling Git Limitations with Singular Large Line-seperated Plaintext files

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Tackling Git Limitations with Singular Large Line-seperated Plaintext files
@ 2014-06-27  8:45 Jarrad Hope
  2014-06-27 15:45 ` Shawn Pearce
  0 siblings, 1 reply; 10+ messages in thread
From: Jarrad Hope @ 2014-06-27  8:45 UTC (permalink / raw)
  To: git

Hello,

As a software developer I've used git for years and have found it the
perfect solution for source control.

Lately I have found myself using git in a unique use-case - modifying
DNA/RNA sequences and storing them in git, which are essentially
software/source code for cells/life. For Bacteria and Viruses the
repo's are very small <10mb & compress nicely.

However on the extreme end of the spectrum a human genome can run in
at 50gb or say ~1gb per file/chromosome.

Now, this is not the binary problem and it is not the same as storing
media inside git - I have reviewed the solutions that exist for the
binary problem, such as git-annex, git-media & bup. But they don't
provide the featureset of git and the data i'm storing is more like
plaintext sourcecode with relatively small edits per commit.

I have googled and asked in #git which discussion mostly revolved
around these tools.

The only project that holds interest is a 2009 project, git-bigfiles -
however it is abit dated & the author is not interested in reviving
this project - referring me to git-annex. Unfortunately.

With that background;
I wanted to discuss the problems with git and how I can contribute to
the core project to best solve them.

>From my understanding the largest problem revolves around git's delta
discovery method, holding 2 files in memory at once - is there a
reason this could not be adapted to page/chunk the data in a sliding
window fashion ?

Are there any other issues I need to know about, is anyone else
working on making git more capable of handling large source files that
I can collaborate with?

Thanks for your time,
Jarrad

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Tackling Git Limitations with Singular Large Line-seperated Plaintext files
  2014-06-27  8:45 Tackling Git Limitations with Singular Large Line-seperated Plaintext files Jarrad Hope
@ 2014-06-27 15:45 ` Shawn Pearce
  2014-06-27 17:48   ` Junio C Hamano
  0 siblings, 1 reply; 10+ messages in thread
From: Shawn Pearce @ 2014-06-27 15:45 UTC (permalink / raw)
  To: Jarrad Hope; +Cc: git

On Fri, Jun 27, 2014 at 1:45 AM, Jarrad Hope <me@jarradhope.com> wrote:
> As a software developer I've used git for years and have found it the
> perfect solution for source control.
>
> Lately I have found myself using git in a unique use-case - modifying
> DNA/RNA sequences and storing them in git, which are essentially
> software/source code for cells/life. For Bacteria and Viruses the
> repo's are very small <10mb & compress nicely.
>
> However on the extreme end of the spectrum a human genome can run in
> at 50gb or say ~1gb per file/chromosome.

Interesting. Unfortunately not everything is used like source code. :)

Git does source code well. I don't know enough to judge if DNA/RNA
sequence storage is similar enough to source code to benefit from
things like `git log -p` showing deltas over time, or if some other
algorithm would be more effective.

> From my understanding the largest problem revolves around git's delta
> discovery method, holding 2 files in memory at once - is there a
> reason this could not be adapted to page/chunk the data in a sliding
> window fashion ?

During delta discovery Git holds like 11 files in memory at once. One
T is the target file that you are trying to delta compress. The other
10 are in a window and Git compares T to each one of them in turn,
selecting the file that produces the smallest delta instruction
sequence to recreate T.

Because T is compared to 10ish other files (the window size is
tuneable), Git needs a full copy of T in memory for the entire compare
step. For any single compare, T is scanned through only once. If you
were doing a single compare (window size of "1"), T could be "on disk"
and paged through sequentially.

The files in the window need to be held entirely in memory, along with
a matching index. The actual delta compression algorithm is a
Rabin-Karp sliding window hash function. Copies can be made from any
part of the source file with no regard to ordering. This makes
paging/chunking the source file at both compression and decompression
time nearly impossible. Git jumps around the source file many times,
but it allows for efficient storage for movement of long sequences
within a file (e.g. move function foo() later in the file).

Maybe if you limited the window to 1 and limited the hash function to
avoid backing up in the source file so it could be paged, you can get
somewhere.

But you mentioned the files are O(1 GiB). Just buy more RAM? Modern
workstations have pretty good memory capacity.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Tackling Git Limitations with Singular Large Line-seperated Plaintext files
  2014-06-27 15:45 ` Shawn Pearce
@ 2014-06-27 17:48   ` Junio C Hamano
  2014-06-27 19:38     ` Linus Torvalds
  0 siblings, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2014-06-27 17:48 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Jarrad Hope, git

Shawn Pearce <spearce@spearce.org> writes:

> Git does source code well. I don't know enough to judge if DNA/RNA
> sequence storage is similar enough to source code to benefit from
> things like `git log -p` showing deltas over time, or if some other
> algorithm would be more effective.
>
>> From my understanding the largest problem revolves around git's delta
>> discovery method, holding 2 files in memory at once - is there a
>> reason this could not be adapted to page/chunk the data in a sliding
>> window fashion ?
>
> During delta discovery Git holds like 11 files in memory at once....

Even though the original question mentioned "delta discovery", I
think what was being asked is not "delta" in the Git sense (which
your answer is about) but is "can we diff two long sequences of text
(that happens to consist of only 4-letter alphabet but that is a
irrelevant detail) without holding both in-core in their entirety?",
which is a more relevant question/desire from the application point
of view.

"Is there a reason this could not be adapted?"  No, there is no
particular reason why this "could not".  I think that the only
reason we only do in-core diff is because "adapting to page/chunk"
hasn't been anybody's high priority list of itches to scratch.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Tackling Git Limitations with Singular Large Line-seperated Plaintext files
  2014-06-27 17:48   ` Junio C Hamano
@ 2014-06-27 19:38     ` Linus Torvalds
  2014-06-27 19:47       ` Linus Torvalds
                         ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Linus Torvalds @ 2014-06-27 19:38 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Shawn Pearce, Jarrad Hope, git

On Fri, Jun 27, 2014 at 10:48 AM, Junio C Hamano <gitster@pobox.com> wrote:
>
> Even though the original question mentioned "delta discovery", I
> think what was being asked is not "delta" in the Git sense (which
> your answer is about) but is "can we diff two long sequences of text
> (that happens to consist of only 4-letter alphabet but that is a
> irrelevant detail) without holding both in-core in their entirety?",
> which is a more relevant question/desire from the application point
> of view.

.. even there, there's another issue. With enough memory, the diff
itself should be fairly reasonable to do, but we do not have any sane
*format* for diffing those kinds of things.

The regular textual diff is line-based, and is not amenable to
comparing two long lines. You'll just get a diff that says "the two
really long lines are different".

The binary diff option should work, but it is a horrible output
format, and not very helpful. It contains all the relevant data ("copy
this chunk from here to here"), but it's then shown in a binary
encoding that isn't really all that useful if you want to say "what
are the differences between these two chromosomes".

I think it might be possible to just specify a special diff algorithm
(git already supports that, obviously), and just introduce a new "use
binary diffs with a textual representation" model.

But it also sounds like there might be some actual performance problem
with these 1GB file delta-calculations. Which I wouldn't be surprised
about at all.

Jarrad - is there any public data you could give as an example and for
people to play with?

                Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Tackling Git Limitations with Singular Large Line-seperated Plaintext files
  2014-06-27 19:38     ` Linus Torvalds
@ 2014-06-27 19:47       ` Linus Torvalds
  2014-06-27 19:55       ` Jason Pyeron
  2014-06-30 12:56       ` Jakub Narębski
  2 siblings, 0 replies; 10+ messages in thread
From: Linus Torvalds @ 2014-06-27 19:47 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Shawn Pearce, Jarrad Hope, git

On Fri, Jun 27, 2014 at 12:38 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I think it might be possible to just specify a special diff algorithm
> (git already supports that, obviously), and just introduce a new "use
> binary diffs with a textual representation" model.

Another model would be to just insert newlines in the data, and use
the regular textual diff on that "preprocessed" format.

The problem of *where* to insert the newlines is somewhat interesting,
since the stupid approaches ("chunk it up in 64-byte lines") don't
work with data insertion/deletion (all the lines will now be different
just because the data is offset), but there are algorithms that handle
that reasonably well, like breaking lines at certain well-defined
patterns (the patterns can then be defined either explicitly or
algorithmically - like calculating a hash/crc over the last rolling N
characters and breaking if the result  matches some modulo
calculation).

                Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: Tackling Git Limitations with Singular Large Line-seperated Plaintext files
  2014-06-27 19:38     ` Linus Torvalds
  2014-06-27 19:47       ` Linus Torvalds
@ 2014-06-27 19:55       ` Jason Pyeron
  2014-06-27 20:13         ` Linus Torvalds
  2014-06-30 12:56       ` Jakub Narębski
  2 siblings, 1 reply; 10+ messages in thread
From: Jason Pyeron @ 2014-06-27 19:55 UTC (permalink / raw)
  To: 'Linus Torvalds', 'Junio C Hamano'
  Cc: 'Shawn Pearce', 'Jarrad Hope', 'git'

> -----Original Message-----
> From: Linus Torvalds
> Sent: Friday, June 27, 2014 15:39
> 
> On Fri, Jun 27, 2014 at 10:48 AM, Junio C Hamano 
> <gitster@pobox.com> wrote:
> >
> > Even though the original question mentioned "delta discovery", I
> > think what was being asked is not "delta" in the Git sense (which
> > your answer is about) but is "can we diff two long sequences of text
> > (that happens to consist of only 4-letter alphabet but that is a
> > irrelevant detail) without holding both in-core in their entirety?",
> > which is a more relevant question/desire from the application point
> > of view.
> 
> .. even there, there's another issue. With enough memory, the diff
> itself should be fairly reasonable to do, but we do not have any sane
> *format* for diffing those kinds of things.
> 
> The regular textual diff is line-based, and is not amenable to
> comparing two long lines. You'll just get a diff that says "the two
> really long lines are different".
> 
> The binary diff option should work, but it is a horrible output
> format, and not very helpful. It contains all the relevant data ("copy
> this chunk from here to here"), but it's then shown in a binary
> encoding that isn't really all that useful if you want to say "what
> are the differences between these two chromosomes".
> 
> I think it might be possible to just specify a special diff algorithm
> (git already supports that, obviously), and just introduce a new "use
> binary diffs with a textual representation" model.
> 
> But it also sounds like there might be some actual performance problem
> with these 1GB file delta-calculations. Which I wouldn't be surprised
> about at all.
> 
> Jarrad - is there any public data you could give as an example and for
> people to play with?

Until Jarrad replies see sample here:
http://www.genomatix.de/online_help/help/sequence_formats.html

The issue will be, if we talk about changes other than same length substitutions
(e.g. Down's Syndrome where it has an insertion of code) would require one code
per line for the diffs to work nicely.

-Jason

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
-                                                               -
- Jason Pyeron                      PD Inc. http://www.pdinc.us -
- Principal Consultant              10 West 24th Street #100    -
- +1 (443) 269-1555 x333            Baltimore, Maryland 21218   -
-                                                               -
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
This message is copyright PD Inc, subject to license 20080407P00.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Tackling Git Limitations with Singular Large Line-seperated Plaintext files
  2014-06-27 19:55       ` Jason Pyeron
@ 2014-06-27 20:13         ` Linus Torvalds
  2014-06-28  6:51           ` Jarrad Hope
  0 siblings, 1 reply; 10+ messages in thread
From: Linus Torvalds @ 2014-06-27 20:13 UTC (permalink / raw)
  To: Jason Pyeron; +Cc: Junio C Hamano, Shawn Pearce, Jarrad Hope, git

On Fri, Jun 27, 2014 at 12:55 PM, Jason Pyeron <jpyeron@pdinc.us> wrote:
>
> The issue will be, if we talk about changes other than same length substitutions
> (e.g. Down's Syndrome where it has an insertion of code) would require one code
> per line for the diffs to work nicely.

Not my area of expertise, but depending on what you are interested in
- like protein encoding etc, I really think you don't need to do
things character-per-character. You might want to break at interesting
sequences (TATA box, and/or known long repeating sequences).

So you could basically turn the "one long line" representation into
multiple lines, by just looking for particular known interesting (or
known particularly *UN*interesting) patterns, and whenever you see the
pattern you create a new line, describing the pattern ("TATAAA" or
"run of 128 U"), and then continue on the next line.

Then you diff those "semantically enriched" streams instead of the raw data.

But it probably depends on what you are looking for and at. Sometimes
you might be looking at individual base pairs. And sometimes maybe you
want to look at the codons, and consider condons that transcribe to
the same amino acid to be the same, and not show up as a difference.
So I could well imagine that you might want to have multiple different
ways to generate these diffs. No?

               Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Tackling Git Limitations with Singular Large Line-seperated Plaintext files
  2014-06-27 20:13         ` Linus Torvalds
@ 2014-06-28  6:51           ` Jarrad Hope
  0 siblings, 0 replies; 10+ messages in thread
From: Jarrad Hope @ 2014-06-28  6:51 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jason Pyeron, Junio C Hamano, Shawn Pearce, git

Thank-you all for replying,

It's just as Jason suggests - Genbank, FASTA & EMBL are more or less
the defacto standards, I suspect FASTA will be phased out because (to
my knowledge) it does not support gene annotation, nevertheless, they
are all text based.

These formats usually insert linebreaks around 80 characters (a
cultural/human readability relic, whatever terminal output they had
the time)

I tried to find a Penguin genome sequence for you, The best I can find
is the complete penguin mitochrondrian dna, as you can see, fairly
small.
http://www.ncbi.nlm.nih.gov/nuccore/558603183?report=fasta
http://www.ncbi.nlm.nih.gov/nuccore/558603183?report=genbank

If you would like to checkout the source for a Human, please see
ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/
Don't ask me for a Makefile :) in near future you'll be able to print
sequences of this length, today we're limited to small sequences (such
as bacteria/virus) at ~30cents per basepair

Each chromosome packs quite well ~80MB packed, ~240MB unpacked
However these formats allow you to repesent multiple sequences in one file
ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz
<- ~850MB packed

Sidenote, Humans aren't that particulary more complicated than Rice
(in terms of genome size)
http://www.plantgdb.org/XGDB/phplib/download.php?GDB=Os

Other animal sequences - http://www.ensembl.org/index.html

Git is already being used very successfully for SBML, Synthetic
Biology Markup Language, an XML dialect for cell modelling.

I would show an example git repo of some open source cancer treatments
(various oncolytic viruses) I've been working on, unfortunately it's
not finished yet, but you can imagine something the size of penguin
mitochrondrial dna with essentially just text being deleted (gene
deletions) as commits.

I hope that helps - With the advancement of Synthetic and Systems
Biology, I really see these sequences benefiting from git.


On Sat, Jun 28, 2014 at 3:13 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Jun 27, 2014 at 12:55 PM, Jason Pyeron <jpyeron@pdinc.us> wrote:
>>
>> The issue will be, if we talk about changes other than same length substitutions
>> (e.g. Down's Syndrome where it has an insertion of code) would require one code
>> per line for the diffs to work nicely.
>
> Not my area of expertise, but depending on what you are interested in
> - like protein encoding etc, I really think you don't need to do
> things character-per-character. You might want to break at interesting
> sequences (TATA box, and/or known long repeating sequences).
>
> So you could basically turn the "one long line" representation into
> multiple lines, by just looking for particular known interesting (or
> known particularly *UN*interesting) patterns, and whenever you see the
> pattern you create a new line, describing the pattern ("TATAAA" or
> "run of 128 U"), and then continue on the next line.
>
> Then you diff those "semantically enriched" streams instead of the raw data.
>
> But it probably depends on what you are looking for and at. Sometimes
> you might be looking at individual base pairs. And sometimes maybe you
> want to look at the codons, and consider condons that transcribe to
> the same amino acid to be the same, and not show up as a difference.
> So I could well imagine that you might want to have multiple different
> ways to generate these diffs. No?
>
>                Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Tackling Git Limitations with Singular Large Line-seperated Plaintext files
  2014-06-27 19:38     ` Linus Torvalds
  2014-06-27 19:47       ` Linus Torvalds
  2014-06-27 19:55       ` Jason Pyeron
@ 2014-06-30 12:56       ` Jakub Narębski
  2014-08-10 21:45         ` Øyvind A. Holm
  2 siblings, 1 reply; 10+ messages in thread
From: Jakub Narębski @ 2014-06-30 12:56 UTC (permalink / raw)
  To: Linus Torvalds, Junio C Hamano; +Cc: Shawn Pearce, Jarrad Hope, git

Linus Torvalds wrote:
> On Fri, Jun 27, 2014 at 10:48 AM, Junio C Hamano <gitster@pobox.com> wrote:
>>
>> Even though the original question mentioned "delta discovery", I
>> think what was being asked is not "delta" in the Git sense (which
>> your answer is about) but is "can we diff two long sequences of text
>> (that happens to consist of only 4-letter alphabet but that is a
>> irrelevant detail) without holding both in-core in their entirety?",
>> which is a more relevant question/desire from the application point
>> of view.
>
> .. even there, there's another issue. With enough memory, the diff
> itself should be fairly reasonable to do, but we do not have any sane
> *format* for diffing those kinds of things.
>
> The regular textual diff is line-based, and is not amenable to
> comparing two long lines. You'll just get a diff that says "the two
> really long lines are different".
>
> The binary diff option should work, but it is a horrible output
> format, and not very helpful. It contains all the relevant data ("copy
> this chunk from here to here"), but it's then shown in a binary
> encoding that isn't really all that useful if you want to say "what
> are the differences between these two chromosomes".

There is also --word-diff[=<mode>] word-based textual diff,
and I think one can abuse --word-diff-regex=<regex> for
character-based diff... or maybe not, as <regex> specifies
word characters, not words or word separators.

-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Tackling Git Limitations with Singular Large Line-seperated Plaintext files
  2014-06-30 12:56       ` Jakub Narębski
@ 2014-08-10 21:45         ` Øyvind A. Holm
  0 siblings, 0 replies; 10+ messages in thread
From: Øyvind A. Holm @ 2014-08-10 21:45 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: Linus Torvalds, Junio C Hamano, Shawn Pearce, Jarrad Hope, git

On 30 June 2014 14:56, Jakub Narębski <jnareb@gmail.com> wrote:
> Linus Torvalds wrote:
> > .. even there, there's another issue. With enough memory, the diff
> > itself should be fairly reasonable to do, but we do not have any
> > sane *format* for diffing those kinds of things.
> >
> > The regular textual diff is line-based, and is not amenable to
> > comparing two long lines. You'll just get a diff that says "the two
> > really long lines are different".
> >
> > The binary diff option should work, but it is a horrible output
> > format, and not very helpful. It contains all the relevant data
> > ("copy this chunk from here to here"), but it's then shown in a
> > binary encoding that isn't really all that useful if you want to say
> > "what are the differences between these two chromosomes".
>
> There is also --word-diff[=<mode>] word-based textual diff, and I
> think one can abuse --word-diff-regex=<regex> for character-based
> diff... or maybe not, as <regex> specifies word characters, not words
> or word separators.

Yes, I have this alias defined:

  dww = diff --word-diff --word-diff-regex=.

It creates nice diffs on a character level. Sometimes specifying
--patience to this helps.

-- Øyvind

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-08-10 21:45 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-06-27  8:45 Tackling Git Limitations with Singular Large Line-seperated Plaintext files Jarrad Hope
2014-06-27 15:45 ` Shawn Pearce
2014-06-27 17:48   ` Junio C Hamano
2014-06-27 19:38     ` Linus Torvalds
2014-06-27 19:47       ` Linus Torvalds
2014-06-27 19:55       ` Jason Pyeron
2014-06-27 20:13         ` Linus Torvalds
2014-06-28  6:51           ` Jarrad Hope
2014-06-30 12:56       ` Jakub Narębski
2014-08-10 21:45         ` Øyvind A. Holm

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).