* Tackling Git Limitations with Singular Large Line-seperated Plaintext files
@ 2014-06-27 8:45 Jarrad Hope
2014-06-27 15:45 ` Shawn Pearce
0 siblings, 1 reply; 10+ messages in thread
From: Jarrad Hope @ 2014-06-27 8:45 UTC (permalink / raw)
To: git
Hello,
As a software developer I've used git for years and have found it the
perfect solution for source control.
Lately I have found myself using git in a unique use-case - modifying
DNA/RNA sequences and storing them in git, which are essentially
software/source code for cells/life. For Bacteria and Viruses the
repo's are very small <10mb & compress nicely.
However on the extreme end of the spectrum a human genome can run in
at 50gb or say ~1gb per file/chromosome.
Now, this is not the binary problem and it is not the same as storing
media inside git - I have reviewed the solutions that exist for the
binary problem, such as git-annex, git-media & bup. But they don't
provide the featureset of git and the data i'm storing is more like
plaintext sourcecode with relatively small edits per commit.
I have googled and asked in #git which discussion mostly revolved
around these tools.
The only project that holds interest is a 2009 project, git-bigfiles -
however it is abit dated & the author is not interested in reviving
this project - referring me to git-annex. Unfortunately.
With that background;
I wanted to discuss the problems with git and how I can contribute to
the core project to best solve them.
>From my understanding the largest problem revolves around git's delta
discovery method, holding 2 files in memory at once - is there a
reason this could not be adapted to page/chunk the data in a sliding
window fashion ?
Are there any other issues I need to know about, is anyone else
working on making git more capable of handling large source files that
I can collaborate with?
Thanks for your time,
Jarrad
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Tackling Git Limitations with Singular Large Line-seperated Plaintext files 2014-06-27 8:45 Tackling Git Limitations with Singular Large Line-seperated Plaintext files Jarrad Hope @ 2014-06-27 15:45 ` Shawn Pearce 2014-06-27 17:48 ` Junio C Hamano 0 siblings, 1 reply; 10+ messages in thread From: Shawn Pearce @ 2014-06-27 15:45 UTC (permalink / raw) To: Jarrad Hope; +Cc: git On Fri, Jun 27, 2014 at 1:45 AM, Jarrad Hope <me@jarradhope.com> wrote: > As a software developer I've used git for years and have found it the > perfect solution for source control. > > Lately I have found myself using git in a unique use-case - modifying > DNA/RNA sequences and storing them in git, which are essentially > software/source code for cells/life. For Bacteria and Viruses the > repo's are very small <10mb & compress nicely. > > However on the extreme end of the spectrum a human genome can run in > at 50gb or say ~1gb per file/chromosome. Interesting. Unfortunately not everything is used like source code. :) Git does source code well. I don't know enough to judge if DNA/RNA sequence storage is similar enough to source code to benefit from things like `git log -p` showing deltas over time, or if some other algorithm would be more effective. > From my understanding the largest problem revolves around git's delta > discovery method, holding 2 files in memory at once - is there a > reason this could not be adapted to page/chunk the data in a sliding > window fashion ? During delta discovery Git holds like 11 files in memory at once. One T is the target file that you are trying to delta compress. The other 10 are in a window and Git compares T to each one of them in turn, selecting the file that produces the smallest delta instruction sequence to recreate T. Because T is compared to 10ish other files (the window size is tuneable), Git needs a full copy of T in memory for the entire compare step. For any single compare, T is scanned through only once. If you were doing a single compare (window size of "1"), T could be "on disk" and paged through sequentially. The files in the window need to be held entirely in memory, along with a matching index. The actual delta compression algorithm is a Rabin-Karp sliding window hash function. Copies can be made from any part of the source file with no regard to ordering. This makes paging/chunking the source file at both compression and decompression time nearly impossible. Git jumps around the source file many times, but it allows for efficient storage for movement of long sequences within a file (e.g. move function foo() later in the file). Maybe if you limited the window to 1 and limited the hash function to avoid backing up in the source file so it could be paged, you can get somewhere. But you mentioned the files are O(1 GiB). Just buy more RAM? Modern workstations have pretty good memory capacity. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Tackling Git Limitations with Singular Large Line-seperated Plaintext files 2014-06-27 15:45 ` Shawn Pearce @ 2014-06-27 17:48 ` Junio C Hamano 2014-06-27 19:38 ` Linus Torvalds 0 siblings, 1 reply; 10+ messages in thread From: Junio C Hamano @ 2014-06-27 17:48 UTC (permalink / raw) To: Shawn Pearce; +Cc: Jarrad Hope, git Shawn Pearce <spearce@spearce.org> writes: > Git does source code well. I don't know enough to judge if DNA/RNA > sequence storage is similar enough to source code to benefit from > things like `git log -p` showing deltas over time, or if some other > algorithm would be more effective. > >> From my understanding the largest problem revolves around git's delta >> discovery method, holding 2 files in memory at once - is there a >> reason this could not be adapted to page/chunk the data in a sliding >> window fashion ? > > During delta discovery Git holds like 11 files in memory at once.... Even though the original question mentioned "delta discovery", I think what was being asked is not "delta" in the Git sense (which your answer is about) but is "can we diff two long sequences of text (that happens to consist of only 4-letter alphabet but that is a irrelevant detail) without holding both in-core in their entirety?", which is a more relevant question/desire from the application point of view. "Is there a reason this could not be adapted?" No, there is no particular reason why this "could not". I think that the only reason we only do in-core diff is because "adapting to page/chunk" hasn't been anybody's high priority list of itches to scratch. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Tackling Git Limitations with Singular Large Line-seperated Plaintext files 2014-06-27 17:48 ` Junio C Hamano @ 2014-06-27 19:38 ` Linus Torvalds 2014-06-27 19:47 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 10+ messages in thread From: Linus Torvalds @ 2014-06-27 19:38 UTC (permalink / raw) To: Junio C Hamano; +Cc: Shawn Pearce, Jarrad Hope, git On Fri, Jun 27, 2014 at 10:48 AM, Junio C Hamano <gitster@pobox.com> wrote: > > Even though the original question mentioned "delta discovery", I > think what was being asked is not "delta" in the Git sense (which > your answer is about) but is "can we diff two long sequences of text > (that happens to consist of only 4-letter alphabet but that is a > irrelevant detail) without holding both in-core in their entirety?", > which is a more relevant question/desire from the application point > of view. .. even there, there's another issue. With enough memory, the diff itself should be fairly reasonable to do, but we do not have any sane *format* for diffing those kinds of things. The regular textual diff is line-based, and is not amenable to comparing two long lines. You'll just get a diff that says "the two really long lines are different". The binary diff option should work, but it is a horrible output format, and not very helpful. It contains all the relevant data ("copy this chunk from here to here"), but it's then shown in a binary encoding that isn't really all that useful if you want to say "what are the differences between these two chromosomes". I think it might be possible to just specify a special diff algorithm (git already supports that, obviously), and just introduce a new "use binary diffs with a textual representation" model. But it also sounds like there might be some actual performance problem with these 1GB file delta-calculations. Which I wouldn't be surprised about at all. Jarrad - is there any public data you could give as an example and for people to play with? Linus ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Tackling Git Limitations with Singular Large Line-seperated Plaintext files 2014-06-27 19:38 ` Linus Torvalds @ 2014-06-27 19:47 ` Linus Torvalds 2014-06-27 19:55 ` Jason Pyeron 2014-06-30 12:56 ` Jakub Narębski 2 siblings, 0 replies; 10+ messages in thread From: Linus Torvalds @ 2014-06-27 19:47 UTC (permalink / raw) To: Junio C Hamano; +Cc: Shawn Pearce, Jarrad Hope, git On Fri, Jun 27, 2014 at 12:38 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I think it might be possible to just specify a special diff algorithm > (git already supports that, obviously), and just introduce a new "use > binary diffs with a textual representation" model. Another model would be to just insert newlines in the data, and use the regular textual diff on that "preprocessed" format. The problem of *where* to insert the newlines is somewhat interesting, since the stupid approaches ("chunk it up in 64-byte lines") don't work with data insertion/deletion (all the lines will now be different just because the data is offset), but there are algorithms that handle that reasonably well, like breaking lines at certain well-defined patterns (the patterns can then be defined either explicitly or algorithmically - like calculating a hash/crc over the last rolling N characters and breaking if the result matches some modulo calculation). Linus ^ permalink raw reply [flat|nested] 10+ messages in thread
* RE: Tackling Git Limitations with Singular Large Line-seperated Plaintext files 2014-06-27 19:38 ` Linus Torvalds 2014-06-27 19:47 ` Linus Torvalds @ 2014-06-27 19:55 ` Jason Pyeron 2014-06-27 20:13 ` Linus Torvalds 2014-06-30 12:56 ` Jakub Narębski 2 siblings, 1 reply; 10+ messages in thread From: Jason Pyeron @ 2014-06-27 19:55 UTC (permalink / raw) To: 'Linus Torvalds', 'Junio C Hamano' Cc: 'Shawn Pearce', 'Jarrad Hope', 'git' > -----Original Message----- > From: Linus Torvalds > Sent: Friday, June 27, 2014 15:39 > > On Fri, Jun 27, 2014 at 10:48 AM, Junio C Hamano > <gitster@pobox.com> wrote: > > > > Even though the original question mentioned "delta discovery", I > > think what was being asked is not "delta" in the Git sense (which > > your answer is about) but is "can we diff two long sequences of text > > (that happens to consist of only 4-letter alphabet but that is a > > irrelevant detail) without holding both in-core in their entirety?", > > which is a more relevant question/desire from the application point > > of view. > > .. even there, there's another issue. With enough memory, the diff > itself should be fairly reasonable to do, but we do not have any sane > *format* for diffing those kinds of things. > > The regular textual diff is line-based, and is not amenable to > comparing two long lines. You'll just get a diff that says "the two > really long lines are different". > > The binary diff option should work, but it is a horrible output > format, and not very helpful. It contains all the relevant data ("copy > this chunk from here to here"), but it's then shown in a binary > encoding that isn't really all that useful if you want to say "what > are the differences between these two chromosomes". > > I think it might be possible to just specify a special diff algorithm > (git already supports that, obviously), and just introduce a new "use > binary diffs with a textual representation" model. > > But it also sounds like there might be some actual performance problem > with these 1GB file delta-calculations. Which I wouldn't be surprised > about at all. > > Jarrad - is there any public data you could give as an example and for > people to play with? Until Jarrad replies see sample here: http://www.genomatix.de/online_help/help/sequence_formats.html The issue will be, if we talk about changes other than same length substitutions (e.g. Down's Syndrome where it has an insertion of code) would require one code per line for the diffs to work nicely. -Jason -- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- - - - Jason Pyeron PD Inc. http://www.pdinc.us - - Principal Consultant 10 West 24th Street #100 - - +1 (443) 269-1555 x333 Baltimore, Maryland 21218 - - - -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- This message is copyright PD Inc, subject to license 20080407P00. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Tackling Git Limitations with Singular Large Line-seperated Plaintext files 2014-06-27 19:55 ` Jason Pyeron @ 2014-06-27 20:13 ` Linus Torvalds 2014-06-28 6:51 ` Jarrad Hope 0 siblings, 1 reply; 10+ messages in thread From: Linus Torvalds @ 2014-06-27 20:13 UTC (permalink / raw) To: Jason Pyeron; +Cc: Junio C Hamano, Shawn Pearce, Jarrad Hope, git On Fri, Jun 27, 2014 at 12:55 PM, Jason Pyeron <jpyeron@pdinc.us> wrote: > > The issue will be, if we talk about changes other than same length substitutions > (e.g. Down's Syndrome where it has an insertion of code) would require one code > per line for the diffs to work nicely. Not my area of expertise, but depending on what you are interested in - like protein encoding etc, I really think you don't need to do things character-per-character. You might want to break at interesting sequences (TATA box, and/or known long repeating sequences). So you could basically turn the "one long line" representation into multiple lines, by just looking for particular known interesting (or known particularly *UN*interesting) patterns, and whenever you see the pattern you create a new line, describing the pattern ("TATAAA" or "run of 128 U"), and then continue on the next line. Then you diff those "semantically enriched" streams instead of the raw data. But it probably depends on what you are looking for and at. Sometimes you might be looking at individual base pairs. And sometimes maybe you want to look at the codons, and consider condons that transcribe to the same amino acid to be the same, and not show up as a difference. So I could well imagine that you might want to have multiple different ways to generate these diffs. No? Linus ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Tackling Git Limitations with Singular Large Line-seperated Plaintext files 2014-06-27 20:13 ` Linus Torvalds @ 2014-06-28 6:51 ` Jarrad Hope 0 siblings, 0 replies; 10+ messages in thread From: Jarrad Hope @ 2014-06-28 6:51 UTC (permalink / raw) To: Linus Torvalds; +Cc: Jason Pyeron, Junio C Hamano, Shawn Pearce, git Thank-you all for replying, It's just as Jason suggests - Genbank, FASTA & EMBL are more or less the defacto standards, I suspect FASTA will be phased out because (to my knowledge) it does not support gene annotation, nevertheless, they are all text based. These formats usually insert linebreaks around 80 characters (a cultural/human readability relic, whatever terminal output they had the time) I tried to find a Penguin genome sequence for you, The best I can find is the complete penguin mitochrondrian dna, as you can see, fairly small. http://www.ncbi.nlm.nih.gov/nuccore/558603183?report=fasta http://www.ncbi.nlm.nih.gov/nuccore/558603183?report=genbank If you would like to checkout the source for a Human, please see ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/ Don't ask me for a Makefile :) in near future you'll be able to print sequences of this length, today we're limited to small sequences (such as bacteria/virus) at ~30cents per basepair Each chromosome packs quite well ~80MB packed, ~240MB unpacked However these formats allow you to repesent multiple sequences in one file ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz <- ~850MB packed Sidenote, Humans aren't that particulary more complicated than Rice (in terms of genome size) http://www.plantgdb.org/XGDB/phplib/download.php?GDB=Os Other animal sequences - http://www.ensembl.org/index.html Git is already being used very successfully for SBML, Synthetic Biology Markup Language, an XML dialect for cell modelling. I would show an example git repo of some open source cancer treatments (various oncolytic viruses) I've been working on, unfortunately it's not finished yet, but you can imagine something the size of penguin mitochrondrial dna with essentially just text being deleted (gene deletions) as commits. I hope that helps - With the advancement of Synthetic and Systems Biology, I really see these sequences benefiting from git. On Sat, Jun 28, 2014 at 3:13 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, Jun 27, 2014 at 12:55 PM, Jason Pyeron <jpyeron@pdinc.us> wrote: >> >> The issue will be, if we talk about changes other than same length substitutions >> (e.g. Down's Syndrome where it has an insertion of code) would require one code >> per line for the diffs to work nicely. > > Not my area of expertise, but depending on what you are interested in > - like protein encoding etc, I really think you don't need to do > things character-per-character. You might want to break at interesting > sequences (TATA box, and/or known long repeating sequences). > > So you could basically turn the "one long line" representation into > multiple lines, by just looking for particular known interesting (or > known particularly *UN*interesting) patterns, and whenever you see the > pattern you create a new line, describing the pattern ("TATAAA" or > "run of 128 U"), and then continue on the next line. > > Then you diff those "semantically enriched" streams instead of the raw data. > > But it probably depends on what you are looking for and at. Sometimes > you might be looking at individual base pairs. And sometimes maybe you > want to look at the codons, and consider condons that transcribe to > the same amino acid to be the same, and not show up as a difference. > So I could well imagine that you might want to have multiple different > ways to generate these diffs. No? > > Linus ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Tackling Git Limitations with Singular Large Line-seperated Plaintext files 2014-06-27 19:38 ` Linus Torvalds 2014-06-27 19:47 ` Linus Torvalds 2014-06-27 19:55 ` Jason Pyeron @ 2014-06-30 12:56 ` Jakub Narębski 2014-08-10 21:45 ` Øyvind A. Holm 2 siblings, 1 reply; 10+ messages in thread From: Jakub Narębski @ 2014-06-30 12:56 UTC (permalink / raw) To: Linus Torvalds, Junio C Hamano; +Cc: Shawn Pearce, Jarrad Hope, git Linus Torvalds wrote: > On Fri, Jun 27, 2014 at 10:48 AM, Junio C Hamano <gitster@pobox.com> wrote: >> >> Even though the original question mentioned "delta discovery", I >> think what was being asked is not "delta" in the Git sense (which >> your answer is about) but is "can we diff two long sequences of text >> (that happens to consist of only 4-letter alphabet but that is a >> irrelevant detail) without holding both in-core in their entirety?", >> which is a more relevant question/desire from the application point >> of view. > > .. even there, there's another issue. With enough memory, the diff > itself should be fairly reasonable to do, but we do not have any sane > *format* for diffing those kinds of things. > > The regular textual diff is line-based, and is not amenable to > comparing two long lines. You'll just get a diff that says "the two > really long lines are different". > > The binary diff option should work, but it is a horrible output > format, and not very helpful. It contains all the relevant data ("copy > this chunk from here to here"), but it's then shown in a binary > encoding that isn't really all that useful if you want to say "what > are the differences between these two chromosomes". There is also --word-diff[=<mode>] word-based textual diff, and I think one can abuse --word-diff-regex=<regex> for character-based diff... or maybe not, as <regex> specifies word characters, not words or word separators. -- Jakub Narębski ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Tackling Git Limitations with Singular Large Line-seperated Plaintext files 2014-06-30 12:56 ` Jakub Narębski @ 2014-08-10 21:45 ` Øyvind A. Holm 0 siblings, 0 replies; 10+ messages in thread From: Øyvind A. Holm @ 2014-08-10 21:45 UTC (permalink / raw) To: Jakub Narębski Cc: Linus Torvalds, Junio C Hamano, Shawn Pearce, Jarrad Hope, git On 30 June 2014 14:56, Jakub Narębski <jnareb@gmail.com> wrote: > Linus Torvalds wrote: > > .. even there, there's another issue. With enough memory, the diff > > itself should be fairly reasonable to do, but we do not have any > > sane *format* for diffing those kinds of things. > > > > The regular textual diff is line-based, and is not amenable to > > comparing two long lines. You'll just get a diff that says "the two > > really long lines are different". > > > > The binary diff option should work, but it is a horrible output > > format, and not very helpful. It contains all the relevant data > > ("copy this chunk from here to here"), but it's then shown in a > > binary encoding that isn't really all that useful if you want to say > > "what are the differences between these two chromosomes". > > There is also --word-diff[=<mode>] word-based textual diff, and I > think one can abuse --word-diff-regex=<regex> for character-based > diff... or maybe not, as <regex> specifies word characters, not words > or word separators. Yes, I have this alias defined: dww = diff --word-diff --word-diff-regex=. It creates nice diffs on a character level. Sometimes specifying --patience to this helps. -- Øyvind ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2014-08-10 21:45 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-06-27 8:45 Tackling Git Limitations with Singular Large Line-seperated Plaintext files Jarrad Hope 2014-06-27 15:45 ` Shawn Pearce 2014-06-27 17:48 ` Junio C Hamano 2014-06-27 19:38 ` Linus Torvalds 2014-06-27 19:47 ` Linus Torvalds 2014-06-27 19:55 ` Jason Pyeron 2014-06-27 20:13 ` Linus Torvalds 2014-06-28 6:51 ` Jarrad Hope 2014-06-30 12:56 ` Jakub Narębski 2014-08-10 21:45 ` Øyvind A. Holm
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).