* A Python script to put CTAN into git (from DVDs)
@ 2011-11-06 15:17 Jonathan Fine
2011-11-06 16:42 ` Jakub Narebski
[not found] ` <mailman.2464.1320597747.27778.python-list@python.org>
0 siblings, 2 replies; 7+ messages in thread
From: Jonathan Fine @ 2011-11-06 15:17 UTC (permalink / raw)
To: python-list; +Cc: git
Hi
This it to let you know that I'm writing (in Python) a script that
places the content of CTAN into a git repository.
https://bitbucket.org/jfine/python-ctantools
I'm working from the TeX Collection DVDs that are published each year by
the TeX user groups, which contain a snapshot of CTAN (about 100,000
files occupying 4Gb), which means I have to unzip folders and do a few
other things.
CTAN is the Comprehensive TeX Archive Network. CTAN keeps only the
latest version of each file, but old CTAN snapshots will provide many
earlier versions.
I'm working on putting old CTAN files into modern version control.
Martin Scharrer is working in the other direction. He's putting new
files added to CTAN into Mercurial.
http://ctanhg.scharrer-online.de/
My script works already as a proof of concept, but needs more work (and
documentation) before it becomes useful. I've requested that follow up
goes to comp.text.tex.
Longer terms goals are git as
* http://en.wikipedia.org/wiki/Content-addressable_storage
* a resource editing and linking system
If you didn't know, a git tree is much like an immutable JSON object,
except that it does not have arrays or numbers.
If my project interests you, reply to this message or contact me
directly (or both).
--
Jonathan
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: A Python script to put CTAN into git (from DVDs)
2011-11-06 15:17 A Python script to put CTAN into git (from DVDs) Jonathan Fine
@ 2011-11-06 16:42 ` Jakub Narebski
[not found] ` <mailman.2464.1320597747.27778.python-list@python.org>
1 sibling, 0 replies; 7+ messages in thread
From: Jakub Narebski @ 2011-11-06 16:42 UTC (permalink / raw)
To: Jonathan Fine; +Cc: python-list, git
Jonathan Fine <jfine@pytex.org> writes:
> Hi
>
> This it to let you know that I'm writing (in Python) a script that
> places the content of CTAN into a git repository.
> https://bitbucket.org/jfine/python-ctantools
I hope that you meant "repositories" (plural) here, one per tool,
rather than putting all of CTAN into single Git repository.
> I'm working from the TeX Collection DVDs that are published each year
> by the TeX user groups, which contain a snapshot of CTAN (about
> 100,000 files occupying 4Gb), which means I have to unzip folders and
> do a few other things.
There is 'contrib/fast-import/import-zips.py' in git.git repository.
If you are not using it, or its equivalent, it might be worth checking
out.
> CTAN is the Comprehensive TeX Archive Network. CTAN keeps only the
> latest version of each file, but old CTAN snapshots will provide many
> earlier versions.
There was similar effort done in putting CPAN (Comprehensive _Perl_
Archive Network) in Git, hosting repositories on GitHub[1], by the name
of gitPAN, see e.g.:
"The gitPAN Import is Complete"
http://perlisalive.com/articles/36
[1]: https://github.com/gitpan
> I'm working on putting old CTAN files into modern version
> control. Martin Scharrer is working in the other direction. He's
> putting new files added to CTAN into Mercurial.
> http://ctanhg.scharrer-online.de/
Nb. thanks to tools such as git-hg and fast-import / fast-export
we have quite good interoperability and convertability between
Git and Mercurial.
P.S. I'd point to reposurgeon tool, which can be used to do fixups
after import, but it would probably won't work on such large (set of)
repositories.
P.P.S. Can you forward it to comp.text.tex?
--
Jakub Narębski
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: A Python script to put CTAN into git (from DVDs)
[not found] ` <mailman.2464.1320597747.27778.python-list@python.org>
@ 2011-11-06 18:19 ` Jonathan Fine
2011-11-06 20:29 ` Jakub Narebski
0 siblings, 1 reply; 7+ messages in thread
From: Jonathan Fine @ 2011-11-06 18:19 UTC (permalink / raw)
To: Jakub Narebski; +Cc: python-list, git
On 06/11/11 16:42, Jakub Narebski wrote:
> Jonathan Fine<jfine@pytex.org> writes:
>
>> Hi
>>
>> This it to let you know that I'm writing (in Python) a script that
>> places the content of CTAN into a git repository.
>> https://bitbucket.org/jfine/python-ctantools
>
> I hope that you meant "repositories" (plural) here, one per tool,
> rather than putting all of CTAN into single Git repository.
There are complex dependencies among LaTeX macro packages, and TeX is
often distributed and installed from a DVD. So it makes sense here to
put *all* the content of a DVD into a repository.
Once you've done that, it is then possible and sensible to select
suitable interesting subsets, such as releases of a particular package.
Users could even define their own subsets, such as "all resources needed
to process this file, exactly as it processes on my machine".
In addition, many TeX users have a TeX DVD. If they import it into a
git repository (using for example my script) then the update from 2011
to 2012 would require much less bandwidth.
Finally, I'd rather be working within git that modified copy of the ISO
when doing the subsetting. I'm pretty sure that I can manage to pull
the small repositories from the big git-CTAN repository.
But as I proceed, perhaps I'll change my mind (smile).
>> I'm working from the TeX Collection DVDs that are published each year
>> by the TeX user groups, which contain a snapshot of CTAN (about
>> 100,000 files occupying 4Gb), which means I have to unzip folders and
>> do a few other things.
>
> There is 'contrib/fast-import/import-zips.py' in git.git repository.
> If you are not using it, or its equivalent, it might be worth checking
> out.
Well, I didn't know about that. I took a look, and it doesn't do what I
want. I need to walk the tree (on a mounted ISO) and unpack some (but
not all) zip files as I come across them. For details see:
https://bitbucket.org/jfine/python-ctantools/src/tip/ctantools/filetools.py
In addition, I don't want to make a commit. I just want to make a ref
at the end of building the tree. This is because I want the import of a
TeX DVD to give effectively identical results for all users, and so any
commit information would be effectively constant.
>> CTAN is the Comprehensive TeX Archive Network. CTAN keeps only the
>> latest version of each file, but old CTAN snapshots will provide many
>> earlier versions.
>
> There was similar effort done in putting CPAN (Comprehensive _Perl_
> Archive Network) in Git, hosting repositories on GitHub[1], by the name
> of gitPAN, see e.g.:
>
> "The gitPAN Import is Complete"
> http://perlisalive.com/articles/36
>
> [1]: https://github.com/gitpan
This is really good to know!!! Not only has this been done already, for
similar reasons, but github is hosting it. Life is easier when there is
a good example to follow.
>> I'm working on putting old CTAN files into modern version
>> control. Martin Scharrer is working in the other direction. He's
>> putting new files added to CTAN into Mercurial.
>> http://ctanhg.scharrer-online.de/
>
> Nb. thanks to tools such as git-hg and fast-import / fast-export
> we have quite good interoperability and convertability between
> Git and Mercurial.
>
> P.S. I'd point to reposurgeon tool, which can be used to do fixups
> after import, but it would probably won't work on such large (set of)
> repositories.
Thank you for the pointer to reposurgeon. My approach is a bit
different. First, get all the files into git, and then 'edit the tree'
to create new trees. And then commit worthwhile new trees.
As I recall the first 'commit' to the git repository for the Linux
kernel was just a tree, with a reference to that tree as a tag. But no
commit.
> P.P.S. Can you forward it to comp.text.tex?
Done.
--
Jonathan
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: A Python script to put CTAN into git (from DVDs)
2011-11-06 18:19 ` Jonathan Fine
@ 2011-11-06 20:29 ` Jakub Narebski
2011-11-07 20:21 ` Jonathan Fine
0 siblings, 1 reply; 7+ messages in thread
From: Jakub Narebski @ 2011-11-06 20:29 UTC (permalink / raw)
To: Jonathan Fine; +Cc: python-list, git
The following message is a courtesy copy of an article
that has been posted to comp.lang.python,comp.text.tex as well.
Jonathan Fine <jfine@pytex.org> writes:
> On 06/11/11 16:42, Jakub Narebski wrote:
>> Jonathan Fine<jfine@pytex.org> writes:
>>
>>> This it to let you know that I'm writing (in Python) a script that
>>> places the content of CTAN into a git repository.
>>> https://bitbucket.org/jfine/python-ctantools
>>
>> I hope that you meant "repositories" (plural) here, one per tool,
>> rather than putting all of CTAN into single Git repository.
[moved]
>> There was similar effort done in putting CPAN (Comprehensive _Perl_
>> Archive Network) in Git, hosting repositories on GitHub[1], by the name
>> of gitPAN, see e.g.:
>>
>> "The gitPAN Import is Complete"
>> http://perlisalive.com/articles/36
>>
>> [1]: https://github.com/gitpan
[/moved]
> There are complex dependencies among LaTeX macro packages, and TeX is
> often distributed and installed from a DVD. So it makes sense here to
> put *all* the content of a DVD into a repository.
Note that for gitPAN each "distribution" (usually but not always
corresponding to single Perl module) is in separate repository.
The dependencies are handled by CPAN / CPANPLUS / cpanm client
(i.e. during install).
Putting all DVD (is it "TeX Live" DVD by the way?) into single
repository would put quite a bit of stress to git; it was created for
software development (although admittedly of large project like Linux
kernel), not 4GB+ trees.
> Once you've done that, it is then possible and sensible to select
> suitable interesting subsets, such as releases of a particular
> package. Users could even define their own subsets, such as "all
> resources needed to process this file, exactly as it processes on my
> machine".
This could be handled using submodules, by having superrepository that
consist solely of references to other repositories by the way of
submodules... plus perhaps some administrativa files (like README for
whole CTAN, or search tool, or DVD install, etc.)
This could be the used to get for example contents of DVD from 2010.
But even though submodules (c.f. Subversion svn:external, Mecurial
forest extension, etc.) are in Git for quite a bit of time, it doesn't
have best user interface.
> In addition, many TeX users have a TeX DVD. If they import it into a
> git repository (using for example my script) then the update from 2011
> to 2012 would require much less bandwidth.
???
> Finally, I'd rather be working within git that modified copy of the
> ISO when doing the subsetting. I'm pretty sure that I can manage to
> pull the small repositories from the big git-CTAN repository.
No you cannot. It is all or nothing; there is no support for partial
_clone_ (yet), and it looks like it is a hard problem.
Nb. there is support for partial _checkout_, but this is something
different.
> But as I proceed, perhaps I'll change my mind (smile).
>
>>> I'm working from the TeX Collection DVDs that are published each year
>>> by the TeX user groups, which contain a snapshot of CTAN (about
>>> 100,000 files occupying 4Gb), which means I have to unzip folders and
>>> do a few other things.
>>
>> There is 'contrib/fast-import/import-zips.py' in git.git repository.
>> If you are not using it, or its equivalent, it might be worth checking
>> out.
>
> Well, I didn't know about that. I took a look, and it doesn't do what
> I want. I need to walk the tree (on a mounted ISO) and unpack some
> (but not all) zip files as I come across them. For details see:
> https://bitbucket.org/jfine/python-ctantools/src/tip/ctantools/filetools.py
>
> In addition, I don't want to make a commit. I just want to make a ref
> at the end of building the tree. This is because I want the import of
> a TeX DVD to give effectively identical results for all users, and so
> any commit information would be effectively constant.
Commit = tree + parent + metadata.
I think you would very much want to have linear sequence of trees,
ordered via DAG of commits. "Naked" trees are rather bad idea, I think.
> As I recall the first 'commit' to the git repository for the Linux
> kernel was just a tree, with a reference to that tree as a tag. But
> no commit.
That was a bad accident that there is a tag that points directly to a
tree of _initial import_, not something to copy.
--
Jakub Narębski
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: A Python script to put CTAN into git (from DVDs)
2011-11-06 20:29 ` Jakub Narebski
@ 2011-11-07 20:21 ` Jonathan Fine
2011-11-07 21:50 ` Jakub Narebski
0 siblings, 1 reply; 7+ messages in thread
From: Jonathan Fine @ 2011-11-07 20:21 UTC (permalink / raw)
To: Jakub Narebski; +Cc: python-list, git
On 06/11/11 20:28, Jakub Narebski wrote:
> Note that for gitPAN each "distribution" (usually but not always
> corresponding to single Perl module) is in separate repository.
> The dependencies are handled by CPAN / CPANPLUS / cpanm client
> (i.e. during install).
Thank you for your interest, Jakub, and also for this information. With
TeX there's a difficult which Perl, I think, does not have. With TeX we
process documents, which may demand specific versions of packages.
LaTeX users are concerned that move on to a later version will cause
documents to break.
> Putting all DVD (is it "TeX Live" DVD by the way?) into single
> repository would put quite a bit of stress to git; it was created for
> software development (although admittedly of large project like Linux
> kernel), not 4GB+ trees.
I'm impressed by how well git manages it. It took about 15 minutes to
build the 4GB tree, and it was disk speed rather than CPU which was the
bottleneck.
>> Once you've done that, it is then possible and sensible to select
>> suitable interesting subsets, such as releases of a particular
>> package. Users could even define their own subsets, such as "all
>> resources needed to process this file, exactly as it processes on my
>> machine".
>
> This could be handled using submodules, by having superrepository that
> consist solely of references to other repositories by the way of
> submodules... plus perhaps some administrativa files (like README for
> whole CTAN, or search tool, or DVD install, etc.)
>
> This could be the used to get for example contents of DVD from 2010.
We may be at cross purposes. My first task is get the DVD tree into
git, performing necessary transformations such as expanding zip files
along the way. Breaking the content into submodules can, I believe, be
done afterwards.
With DVDs from several years it could take several hours to load
everything into git. For myself, I'd like to do that once, more or less
as a batch process, and then move on to the more interesting topics.
Getting the DVD contents into git is already a significant piece of work.
Once done, I can them move on to what you're interested in, which is
organising the material. And I hope that others in the TeX community
will get involved with that, because I'm not building this repository
just for myself.
> But even though submodules (c.f. Subversion svn:external, Mecurial
> forest extension, etc.) are in Git for quite a bit of time, it doesn't
> have best user interface.
>
>> In addition, many TeX users have a TeX DVD. If they import it into a
>> git repository (using for example my script) then the update from 2011
>> to 2012 would require much less bandwidth.
>
> ???
A quick way to bring your TeX distribution up to date is to do a delta
with a later distribution, and download the difference. That's what git
does, and it does it well. So I'm keen to convert a TeX DVD into a git
repository, and then differences can be downloaded.
>> Finally, I'd rather be working within git that modified copy of the
>> ISO when doing the subsetting. I'm pretty sure that I can manage to
>> pull the small repositories from the big git-CTAN repository.
>
> No you cannot. It is all or nothing; there is no support for partial
> _clone_ (yet), and it looks like it is a hard problem.
>
> Nb. there is support for partial _checkout_, but this is something
> different.
From what I know, I'm confident that I can achieve what I want using
git. I'm also confident that my approach is not closing off any
possible approached. But if I'm wrong you'll be able to say: I told you so.
> Commit = tree + parent + metadata.
Actually, any number of parents, including none. What metadata do I
have to provide? At this time nothing, I think, beyond that provided by
the name of a reference (to the root of a tree).
> I think you would very much want to have linear sequence of trees,
> ordered via DAG of commits. "Naked" trees are rather bad idea, I think.
>
>> As I recall the first 'commit' to the git repository for the Linux
>> kernel was just a tree, with a reference to that tree as a tag. But
>> no commit.
>
> That was a bad accident that there is a tag that points directly to a
> tree of _initial import_, not something to copy.
Because git is a distributed version control system, anyone who wants to
can create such a directed acyclic graph of commits. And if it's useful
I'll gladly add it to my copy of the repository.
best regards
Jonathan
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: A Python script to put CTAN into git (from DVDs)
2011-11-07 20:21 ` Jonathan Fine
@ 2011-11-07 21:50 ` Jakub Narebski
2011-11-07 22:03 ` Jonathan Fine
0 siblings, 1 reply; 7+ messages in thread
From: Jakub Narebski @ 2011-11-07 21:50 UTC (permalink / raw)
To: Jonathan Fine; +Cc: python-list, git
The following message is a courtesy copy of an article
that has been posted to comp.text.tex as well.
Jonathan Fine <jfine@pytex.org> writes:
> On 06/11/11 20:28, Jakub Narebski wrote:
>
> > Note that for gitPAN each "distribution" (usually but not always
> > corresponding to single Perl module) is in separate repository.
> > The dependencies are handled by CPAN / CPANPLUS / cpanm client
> > (i.e. during install).
>
> Thank you for your interest, Jakub, and also for this information.
> With TeX there's a difficult which Perl, I think, does not have. With
> TeX we process documents, which may demand specific versions of
> packages. LaTeX users are concerned that move on to a later version
> will cause documents to break.
How you can demand specific version of package?
In the "\usepackage[options]{packages}[version]" LaTeX command the
<version> argument specifies _minimal_ (oldest) version. The same
is true for Perl "use Module VERSION LIST".
Nevertheless while with "use Module VERSION" / "use Module VERSION LIST"
you can request minimal version of Perl Module, the META build-time spec
can include requirement of exact version of required package:
http://p3rl.org/CPAN::Meta::Spec
Version Ranges
~~~~~~~~~~~~~~
Some fields (prereq, optional_features) indicate the particular
version(s) of some other module that may be required as a
prerequisite. This section details the Version Range type used to
provide this information.
The simplest format for a Version Range is just the version number
itself, e.g. 2.4. This means that *at least* version 2.4 must be
present. To indicate that *any* version of a prerequisite is okay,
even if the prerequisite doesn't define a version at all, use the
version 0.
Alternatively, a version range *may* use the operators < (less than),
<= (less than or equal), > (greater than), >= (greater than or
equal), == (equal), and != (not equal). For example, the
specification < 2.0 means that any version of the prerequisite less
than 2.0 is suitable.
For more complicated situations, version specifications *may* be
AND-ed together using commas. The specification >= 1.2, != 1.5, <
2.0 indicates a version that must be *at least* 1.2, *less than* 2.0,
and *not equal to* 1.5.
> > Putting all DVD (is it "TeX Live" DVD by the way?) into single
> > repository would put quite a bit of stress to git; it was created for
> > software development (although admittedly of large project like Linux
> > kernel), not 4GB+ trees.
>
> I'm impressed by how well git manages it. It took about 15 minutes to
> build the 4GB tree, and it was disk speed rather than CPU which was
> the bottleneck.
I still think that using modified contrib/fast-import/import-zips.py
(or import-tars.perl, or import-directories.perl) would be a better
solution here...
[...]
> We may be at cross purposes. My first task is get the DVD tree into
> git, performing necessary transformations such as expanding zip files
> along the way. Breaking the content into submodules can, I believe,
> be done afterwards.
'reposurgeon' might help there... or might not. Same with git-subtree
tool.
But now I understand that you are just building tree objects, and
creating references to them (with implicit ordering given by names,
I guess). This is to be a start of further work, isn't it?
> With DVDs from several years it could take several hours to load
> everything into git. For myself, I'd like to do that once, more or
> less as a batch process, and then move on to the more interesting
> topics. Getting the DVD contents into git is already a significant
> piece of work.
>
> Once done, I can them move on to what you're interested in, which is
> organising the material. And I hope that others in the TeX community
> will get involved with that, because I'm not building this repository
> just for myself.
[...]
> > > In addition, many TeX users have a TeX DVD. If they import it into a
> > > git repository (using for example my script) then the update from 2011
> > > to 2012 would require much less bandwidth.
> >
> > ???
>
> A quick way to bring your TeX distribution up to date is to do a delta
> with a later distribution, and download the difference. That's what
> git does, and it does it well. So I'm keen to convert a TeX DVD into
> a git repository, and then differences can be downloaded.
Here perhaps you should take a look at git-based 'bup' backup system.
Anyway I am not sure if for git to be able to generate deltas well you
have to have DAG of commits, so Git can notice what you have and what
you have not. Trees might be not enough here. (!)
> > Commit = tree + parent + metadata.
>
> Actually, any number of parents, including none. What metadata do I
> have to provide? At this time nothing, I think, beyond that provided
> by the name of a reference (to the root of a tree).
Metadata = commit message (here you can e.g. put the official name of
DVD), author and committer info (name, email, date and time, timezone;
date and time you can get from mtime / creation time of DVD).
[cut]
--
Jakub Narębski
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: A Python script to put CTAN into git (from DVDs)
2011-11-07 21:50 ` Jakub Narebski
@ 2011-11-07 22:03 ` Jonathan Fine
0 siblings, 0 replies; 7+ messages in thread
From: Jonathan Fine @ 2011-11-07 22:03 UTC (permalink / raw)
To: Jakub Narebski; +Cc: python-list, git
On 07/11/11 21:49, Jakub Narebski wrote:
[snip]
> But now I understand that you are just building tree objects, and
> creating references to them (with implicit ordering given by names,
> I guess). This is to be a start of further work, isn't it?
Yes, that's exactly the point, and my apologies if I was not clear enough.
I'll post again when I've finished the script and performed placed
several years of DVD into git. Then the discussion will be more
concrete - we have this tree, how do we make it more useful.
Thank you for your contributions, particularly telling me about gitpan.
--
Jonathan
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2011-11-07 22:06 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-06 15:17 A Python script to put CTAN into git (from DVDs) Jonathan Fine
2011-11-06 16:42 ` Jakub Narebski
[not found] ` <mailman.2464.1320597747.27778.python-list@python.org>
2011-11-06 18:19 ` Jonathan Fine
2011-11-06 20:29 ` Jakub Narebski
2011-11-07 20:21 ` Jonathan Fine
2011-11-07 21:50 ` Jakub Narebski
2011-11-07 22:03 ` Jonathan Fine
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).