git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jarrad Hope <me@jarradhope.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jason Pyeron <jpyeron@pdinc.us>,
	Junio C Hamano <gitster@pobox.com>,
	Shawn Pearce <spearce@spearce.org>, git <git@vger.kernel.org>
Subject: Re: Tackling Git Limitations with Singular Large Line-seperated Plaintext files
Date: Sat, 28 Jun 2014 13:51:07 +0700	[thread overview]
Message-ID: <CAJoVafdyFWjnaKbz47n12ykLAn28TSFDxLvbWfT51Rim7SXLsA@mail.gmail.com> (raw)
In-Reply-To: <CA+55aFyaQJDq4dvPyS3oLJp57J_zEmqbXA5UxzL8fdgAaHpJOA@mail.gmail.com>

Thank-you all for replying,

It's just as Jason suggests - Genbank, FASTA & EMBL are more or less
the defacto standards, I suspect FASTA will be phased out because (to
my knowledge) it does not support gene annotation, nevertheless, they
are all text based.

These formats usually insert linebreaks around 80 characters (a
cultural/human readability relic, whatever terminal output they had
the time)

I tried to find a Penguin genome sequence for you, The best I can find
is the complete penguin mitochrondrian dna, as you can see, fairly
small.
http://www.ncbi.nlm.nih.gov/nuccore/558603183?report=fasta
http://www.ncbi.nlm.nih.gov/nuccore/558603183?report=genbank

If you would like to checkout the source for a Human, please see
ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/
Don't ask me for a Makefile :) in near future you'll be able to print
sequences of this length, today we're limited to small sequences (such
as bacteria/virus) at ~30cents per basepair

Each chromosome packs quite well ~80MB packed, ~240MB unpacked
However these formats allow you to repesent multiple sequences in one file
ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz
<- ~850MB packed

Sidenote, Humans aren't that particulary more complicated than Rice
(in terms of genome size)
http://www.plantgdb.org/XGDB/phplib/download.php?GDB=Os

Other animal sequences - http://www.ensembl.org/index.html

Git is already being used very successfully for SBML, Synthetic
Biology Markup Language, an XML dialect for cell modelling.

I would show an example git repo of some open source cancer treatments
(various oncolytic viruses) I've been working on, unfortunately it's
not finished yet, but you can imagine something the size of penguin
mitochrondrial dna with essentially just text being deleted (gene
deletions) as commits.

I hope that helps - With the advancement of Synthetic and Systems
Biology, I really see these sequences benefiting from git.


On Sat, Jun 28, 2014 at 3:13 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Jun 27, 2014 at 12:55 PM, Jason Pyeron <jpyeron@pdinc.us> wrote:
>>
>> The issue will be, if we talk about changes other than same length substitutions
>> (e.g. Down's Syndrome where it has an insertion of code) would require one code
>> per line for the diffs to work nicely.
>
> Not my area of expertise, but depending on what you are interested in
> - like protein encoding etc, I really think you don't need to do
> things character-per-character. You might want to break at interesting
> sequences (TATA box, and/or known long repeating sequences).
>
> So you could basically turn the "one long line" representation into
> multiple lines, by just looking for particular known interesting (or
> known particularly *UN*interesting) patterns, and whenever you see the
> pattern you create a new line, describing the pattern ("TATAAA" or
> "run of 128 U"), and then continue on the next line.
>
> Then you diff those "semantically enriched" streams instead of the raw data.
>
> But it probably depends on what you are looking for and at. Sometimes
> you might be looking at individual base pairs. And sometimes maybe you
> want to look at the codons, and consider condons that transcribe to
> the same amino acid to be the same, and not show up as a difference.
> So I could well imagine that you might want to have multiple different
> ways to generate these diffs. No?
>
>                Linus

  reply	other threads:[~2014-06-28  6:51 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-06-27  8:45 Tackling Git Limitations with Singular Large Line-seperated Plaintext files Jarrad Hope
2014-06-27 15:45 ` Shawn Pearce
2014-06-27 17:48   ` Junio C Hamano
2014-06-27 19:38     ` Linus Torvalds
2014-06-27 19:47       ` Linus Torvalds
2014-06-27 19:55       ` Jason Pyeron
2014-06-27 20:13         ` Linus Torvalds
2014-06-28  6:51           ` Jarrad Hope [this message]
2014-06-30 12:56       ` Jakub Narębski
2014-08-10 21:45         ` Øyvind A. Holm

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJoVafdyFWjnaKbz47n12ykLAn28TSFDxLvbWfT51Rim7SXLsA@mail.gmail.com \
    --to=me@jarradhope.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jpyeron@pdinc.us \
    --cc=spearce@spearce.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).