Introduction and Wikipedia and Git Blame

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Introduction and Wikipedia and Git Blame
@ 2009-10-16  9:07 jamesmikedupont
  2009-10-16 11:26 ` Johannes Schindelin
  0 siblings, 1 reply; 15+ messages in thread
From: jamesmikedupont @ 2009-10-16  9:07 UTC (permalink / raw)
  To: git

Hi all,

I would like to say Hi! Git is great.

I made a hack to import the wikipedia changelogs into git, it is free
software and all checked in. I will be improving it to keep the git
repo in sync.

Here is the discussion on foundation-l :
http://www.gossamer-threads.com/lists/wiki/foundation/181163

the question is, is there a blame tool that we can use for multiple
horizontal diffs on the same line that will be needed for wikipedia
articles?

If not, I would work on this, if you give me some pointers.

thanks,
mike

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Introduction and Wikipedia and Git Blame
  2009-10-16  9:07 Introduction and Wikipedia and Git Blame jamesmikedupont
@ 2009-10-16 11:26 ` Johannes Schindelin
  2009-10-16 11:38   ` Martin Langhoff
  2009-10-16 11:43   ` jamesmikedupont
  0 siblings, 2 replies; 15+ messages in thread
From: Johannes Schindelin @ 2009-10-16 11:26 UTC (permalink / raw)
  To: jamesmikedupont@googlemail.com; +Cc: git

Hi,

On Fri, 16 Oct 2009, jamesmikedupont@googlemail.com wrote:

> I made a hack to import the wikipedia changelogs into git, it is free
> software and all checked in. I will be improving it to keep the git
> repo in sync.

This is cool!  I actually wanted this for quite some time, and could not 
find the time to do it myself.

> Here is the discussion on foundation-l :
> http://www.gossamer-threads.com/lists/wiki/foundation/181163

I found the link to the bazaar repository there, but do you have a Git 
repository, too?

> the question is, is there a blame tool that we can use for multiple 
> horizontal diffs on the same line that will be needed for wikipedia 
> articles?

I am not quite sure what you want to do horizontally there... Can you 
explain what you want to see?

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Introduction and Wikipedia and Git Blame
  2009-10-16 11:26 ` Johannes Schindelin
@ 2009-10-16 11:38   ` Martin Langhoff
  2009-10-16 11:43   ` jamesmikedupont
  1 sibling, 0 replies; 15+ messages in thread
From: Martin Langhoff @ 2009-10-16 11:38 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: jamesmikedupont@googlemail.com, git

On Fri, Oct 16, 2009 at 1:26 PM, Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
> I am not quite sure what you want to do horizontally there... Can you
> explain what you want to see?

Highlight the changed bits on the line. Example - the red-bold highlight in:

http://en.wikipedia.org/w/index.php?title=David_Letterman&action=historysubmit&diff=320061135&oldid=320060840


m
-- 
 martin.langhoff@gmail.com
 martin@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Introduction and Wikipedia and Git Blame
  2009-10-16 11:26 ` Johannes Schindelin
  2009-10-16 11:38   ` Martin Langhoff
@ 2009-10-16 11:43   ` jamesmikedupont
  2009-10-16 14:11     ` Johannes Schindelin
  1 sibling, 1 reply; 15+ messages in thread
From: jamesmikedupont @ 2009-10-16 11:43 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git

On Fri, Oct 16, 2009 at 1:26 PM, Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
>> Here is the discussion on foundation-l :
>> http://www.gossamer-threads.com/lists/wiki/foundation/181163
>
> I found the link to the bazaar repository there, but do you have a Git
> repository, too?

Not yet. Where should I put it?  Any suggestions.

>> the question is, is there a blame tool that we can use for multiple
>> horizontal diffs on the same line that will be needed for wikipedia
>> articles?
>
> I am not quite sure what you want to do horizontally there... Can you
> explain what you want to see?

Yes, I would like to see all the contributors to each word or line.

Basically one line of blame per contributor, so many lines of output.
Ideally we would have something that is usable in a html display. Lets
say, just an blame attribute for each word. so on one line :

This is a line with two changes first change Second change  end of line

It would look like this in html :
This is a line with two changes <span blame=revisionid>first
change</span><span blame=revisionid>Second change</span> end of line

The blame edit could look like this :
REVISION ID 1    48     :  This is a line with two changes first
change first change \
REVISTION ID 2  48 C:   Second change end of line


let me see if I can find an online example.

Here is a blame tool with links to the edits:
http://hewgill.com/journal/entries/461-wikipedia-blame

here is the wikitrust tool that could be interesting :
http://wikitrust.soe.ucsc.edu/
http://wikitrust.collaborativetrust.com/screenshots

Thanks,
mike

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Introduction and Wikipedia and Git Blame
  2009-10-16 11:43   ` jamesmikedupont
@ 2009-10-16 14:11     ` Johannes Schindelin
  2009-10-16 14:23       ` jamesmikedupont
  2009-10-16 17:04       ` Junio C Hamano
  0 siblings, 2 replies; 15+ messages in thread
From: Johannes Schindelin @ 2009-10-16 14:11 UTC (permalink / raw)
  To: jamesmikedupont@googlemail.com; +Cc: git

Hi,

On Fri, 16 Oct 2009, jamesmikedupont@googlemail.com wrote:

> On Fri, Oct 16, 2009 at 1:26 PM, Johannes Schindelin
> <Johannes.Schindelin@gmx.de> wrote:
> >> Here is the discussion on foundation-l :
> >> http://www.gossamer-threads.com/lists/wiki/foundation/181163
> >
> > I found the link to the bazaar repository there, but do you have a Git
> > repository, too?
> 
> Not yet. Where should I put it?  Any suggestions.

github.com has a nice interface.

BTW after reading some of the code, I am a bit surprised that you did not 
do it as a .php script outputting fast-import capable text...

> >> the question is, is there a blame tool that we can use for multiple 
> >> horizontal diffs on the same line that will be needed for wikipedia 
> >> articles?
> >
> > I am not quite sure what you want to do horizontally there... Can you
> > explain what you want to see?
> 
> Yes, I would like to see all the contributors to each word or line.
> 
> Basically one line of blame per contributor, so many lines of output.
> Ideally we would have something that is usable in a html display. Lets
> say, just an blame attribute for each word. so on one line :
> 
> This is a line with two changes first change Second change  end of line
> 
> It would look like this in html :
> This is a line with two changes <span blame=revisionid>first
> change</span><span blame=revisionid>Second change</span> end of line
> 
> The blame edit could look like this :
> REVISION ID 1    48     :  This is a line with two changes first
> change first change \
> REVISTION ID 2  48 C:   Second change end of line

Okay, so basically you want to analyze the text on a word-by-word basis 
rather than line-by-line.

Or maybe even better: you want to analyze the text character-by-character.  
That would also nicely circumvent to specify just what makes a word a word 
(subject for a lot of heated discussion during the design of the 
--color-words=<regex> patch).

Basically, if I had to implement that, I would not try to modify 
builtin-blame.c, but write a new program linking to libgit.a, calling the 
revision walker on the file you want to calculate the blame for.  (One of 
the best examples is probably in builtin-shortlog.c.)

Then I would introduce a linked-list structure which will hold the blamed 
regions in this form:

	struct region {
		int start;
		struct region *next;
	};

Initially, this would have a start element with the start offset 0 
pointing to the end element with start offset being set to the size of the 
blob.

Most likely you will have to add members to this struct, such as the 
original offsets (as you will have to adjust the offsets to the different 
file revisions while you go back in time), and the commit it was 
attributed to.

Then I would make modified "texts" from the blob of the file in the 
current revision and its parent revision, by inserting newlines after 
every single byte (probably replacing the original newlines by other 
values, such as \x01).

The reason for this touchup is that the diff machinery in Git only handles 
line-based diffs.

Then you can parse the hunk headers, adjust the offsets accordingly, and 
attribute the +++ regions to the current commit (by construction, the 
offsets are equal to the line number in the hunk header).  Here it is most 
likely necessary to split the regions.

You should also have a counter how many regions are still unattributed so 
you can stop early.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Introduction and Wikipedia and Git Blame
  2009-10-16 14:11     ` Johannes Schindelin
@ 2009-10-16 14:23       ` jamesmikedupont
  2009-10-16 17:04       ` Junio C Hamano
  1 sibling, 0 replies; 15+ messages in thread
From: jamesmikedupont @ 2009-10-16 14:23 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git

Johannes,
Thanks for your input,
comments below.
mfg,
mike

On Fri, Oct 16, 2009 at 4:11 PM, Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
> Hi,
>
> On Fri, 16 Oct 2009, jamesmikedupont@googlemail.com wrote:
>
>> On Fri, Oct 16, 2009 at 1:26 PM, Johannes Schindelin
>> <Johannes.Schindelin@gmx.de> wrote:
>> >> Here is the discussion on foundation-l :
>> >> http://www.gossamer-threads.com/lists/wiki/foundation/181163
>> >
>> > I found the link to the bazaar repository there, but do you have a Git
>> > repository, too?
>>
>> Not yet. Where should I put it?  Any suggestions.
>
> github.com has a nice interface.
>
> BTW after reading some of the code, I am a bit surprised that you did not
> do it as a .php script outputting fast-import capable text...

I dont really know php, and I dont have a debugger or any tools in it....
Really cannot understand how people can work in such an environment.

I have done all my hacking work as perl scripts.
These can be rewritten in c later on.


> Okay, so basically you want to analyze the text on a word-by-word basis
> rather than line-by-line.
yes.

>
> Or maybe even better: you want to analyze the text character-by-character.
> That would also nicely circumvent to specify just what makes a word a word
> (subject for a lot of heated discussion during the design of the
> --color-words=<regex> patch).

Yes,  Someone suggested in irc to review the color-words , I have the
source code now and will be looking into that.

>
> Basically, if I had to implement that, I would not try to modify
> builtin-blame.c, but write a new program linking to libgit.a, calling the
> revision walker on the file you want to calculate the blame for.  (One of
> the best examples is probably in builtin-shortlog.c.)
>
> Then I would introduce a linked-list structure which will hold the blamed
> regions in this form:
>
>        struct region {
>                int start;
>                struct region *next;
>        };
>
> Initially, this would have a start element with the start offset 0
> pointing to the end element with start offset being set to the size of the
> blob.
>
> Most likely you will have to add members to this struct, such as the
> original offsets (as you will have to adjust the offsets to the different
> file revisions while you go back in time), and the commit it was
> attributed to.
>
> Then I would make modified "texts" from the blob of the file in the
> current revision and its parent revision, by inserting newlines after
> every single byte (probably replacing the original newlines by other
> values, such as \x01).
>
> The reason for this touchup is that the diff machinery in Git only handles
> line-based diffs.
>
> Then you can parse the hunk headers, adjust the offsets accordingly, and
> attribute the +++ regions to the current commit (by construction, the
> offsets are equal to the line number in the hunk header).  Here it is most
> likely necessary to split the regions.
>
> You should also have a counter how many regions are still unattributed so
> you can stop early.

Ok this sounds like a plan. I think that will be a good outline to
start some work.
I will let you know when I have made some progress.
thanks,
mike

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Introduction and Wikipedia and Git Blame
  2009-10-16 14:11     ` Johannes Schindelin
  2009-10-16 14:23       ` jamesmikedupont
@ 2009-10-16 17:04       ` Junio C Hamano
  2009-10-16 18:00         ` jamesmikedupont
  1 sibling, 1 reply; 15+ messages in thread
From: Junio C Hamano @ 2009-10-16 17:04 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: jamesmikedupont@googlemail.com, git

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> Then I would make modified "texts" from the blob of the file in the 
> current revision and its parent revision, by inserting newlines after 
> every single byte (probably replacing the original newlines by other 
> values, such as \x01).
>
> The reason for this touchup is that the diff machinery in Git only handles 
> line-based diffs.
>
> Then you can parse the hunk headers, adjust the offsets accordingly,...

I would agree that text converted to "byte-per-line" format would be the
easiest way to re-use the diff engine, but if you go one more step, you
can even reusel the blame engine as well.  You convert the text into
"byte-in-hex-and-lf" (e.g. "AB C\n" becomes "41\n42\n20\n43\n0a\n") and
feed it into existing blame and have it produce script-readable output,
instead of feeding that to your reinvention of blame using diff engine.

You would need to postprocess the computed result (either by diff or
blame) to lay out the final text output in either case anyway, and making
the existing blame engine do the work for you would be a better approach,
I think.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Introduction and Wikipedia and Git Blame
  2009-10-16 17:04       ` Junio C Hamano
@ 2009-10-16 18:00         ` jamesmikedupont
  2009-10-16 19:00           ` Junio C Hamano
  0 siblings, 1 reply; 15+ messages in thread
From: jamesmikedupont @ 2009-10-16 18:00 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Johannes Schindelin, git

On Fri, Oct 16, 2009 at 7:04 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
>
>> Then I would make modified "texts" from the blob of the file in the
>> current revision and its parent revision, by inserting newlines after
>> every single byte (probably replacing the original newlines by other
>> values, such as \x01).
>>
>> The reason for this touchup is that the diff machinery in Git only handles
>> line-based diffs.
>>
>> Then you can parse the hunk headers, adjust the offsets accordingly,...
>
> I would agree that text converted to "byte-per-line" format would be the
> easiest way to re-use the diff engine, but if you go one more step, you
> can even reusel the blame engine as well.  You convert the text into
> "byte-in-hex-and-lf" (e.g. "AB C\n" becomes "41\n42\n20\n43\n0a\n") and
> feed it into existing blame and have it produce script-readable output,
> instead of feeding that to your reinvention of blame using diff engine.
>
> You would need to postprocess the computed result (either by diff or
> blame) to lay out the final text output in either case anyway, and making
> the existing blame engine do the work for you would be a better approach,
> I think.

Please can you tell me what is the basic algorithm of the blame engine?
I will have to start reading code
How can it tell the author a given line and I like the idea of one
line per char, even the newlines would be encoded that way. If it is a
unicode char, it might be multibyte.

The script would get the blame per byte and then recode that into
something visible.

od the octal dump utility comes to mind,
od x1 -w1 will output the file in one byte widths.

Now what about the ability to just pipe the file via some tool and
then run blame on that. It would just start the line with the byte
offset and blame would emit the blame for that offset and emit the
text that is following it.

so for example :
od x1 -w1  somefile :
///////////////////////////////
Offset       value
======= ======
0052752 065347
0052754 030356
0052756 035741
0052760 136302
0052762 035346

Here we see the lines are 0052760 - 0052762 =2  apart.

and then if you want wider diffs :
od some file
////////////////////////////////////////////
Offset       values
======= ====== ====== ====== ====== ====== ====== ====== ======
0074520 051754 162613 057705 155520 047032 043654 175550 062704
0074540 164400 060340 123434 030350 040457 136010 042270 170525
0074560 165053 124677 125776 031370 000006 102076 060060 052434
0074600 176452 140240 074007 130113 100424 020010 130773 103467
0074620 052776 052421 021544 101357 120035 107562 072641 053636

Here we see the lines are 0074520 - 0074540   = 20 apart.

That way the blame tool will not be concerned with the formatting or
content, the users can write filters like they want, and blame would
only expect a byte offset...

That way, we could write something like this :
grep -b x Test.xml
0:<?xml version="1.0" encoding="UTF-8"?>
39:<gpx
107:  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
then we would get blames for those byte offsets, very simple.

We could reduce this down to : make blame take a  list of byte positions.
grep -b \n Test.gpx would be the standard behavior, emit the blame per newline.

mike

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Introduction and Wikipedia and Git Blame
  2009-10-16 18:00         ` jamesmikedupont
@ 2009-10-16 19:00           ` Junio C Hamano
  2009-10-16 20:05             ` Junio C Hamano
  0 siblings, 1 reply; 15+ messages in thread
From: Junio C Hamano @ 2009-10-16 19:00 UTC (permalink / raw)
  To: jamesmikedupont@googlemail.com; +Cc: Johannes Schindelin, git

"jamesmikedupont@googlemail.com" <jamesmikedupont@googlemail.com> writes:

>> You would need to postprocess the computed result (either by diff or
>> blame) to lay out the final text output in either case anyway, and making
>> the existing blame engine do the work for you would be a better approach,
>> I think.
>
> Please can you tell me what is the basic algorithm of the blame engine?

I think this is one of the most conprehensive write-up on the algorithm:

  http://thread.gmane.org/gmane.comp.version-control.git/28826/focus=28895

The whole thread (at least what I wrote in it) is worth reading if you
want to understand what the current code does.  The first message in the
thread talks about "NEEDSWORK" label on an unimplemented part of the code,
and says "we could", but these gaps were since filled.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Introduction and Wikipedia and Git Blame
  2009-10-16 19:00           ` Junio C Hamano
@ 2009-10-16 20:05             ` Junio C Hamano
  2009-10-16 21:19               ` jamesmikedupont
  0 siblings, 1 reply; 15+ messages in thread
From: Junio C Hamano @ 2009-10-16 20:05 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: jamesmikedupont@googlemail.com, Johannes Schindelin, git

Junio C Hamano <gitster@pobox.com> writes:

> "jamesmikedupont@googlemail.com" <jamesmikedupont@googlemail.com> writes:
>
>>> You would need to postprocess the computed result (either by diff or
>>> blame) to lay out the final text output in either case anyway, and making
>>> the existing blame engine do the work for you would be a better approach,
>>> I think.
>>
>> Please can you tell me what is the basic algorithm of the blame engine?
>
> I think this is one of the most conprehensive write-up on the algorithm:
>
>   http://thread.gmane.org/gmane.comp.version-control.git/28826/focus=28895
>
> The whole thread (at least what I wrote in it) is worth reading if you
> want to understand what the current code does.  The first message in the
> thread talks about "NEEDSWORK" label on an unimplemented part of the code,
> and says "we could", but these gaps were since filled.

Ah, nevermind.  The thread is the definitive description of the blame
algorithm, but I agree with Dscho that in this case, you either have to
change blame itself to do this "byte-wise" comparison internally between
versions, or re-do the blame logic yourself like Dscho suggests.  Dscho is
right in this case; an unmodifled blame engine, unless you feed a history
that is converted to use the byte-per-line format, won't help you at all.

So it would be either between rolling a custom byte-wise blame algorithm
yourself and teaching a new byte-wise mode to existing blame engine.
Sorry for making the task sound much easier than it would be.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Introduction and Wikipedia and Git Blame
  2009-10-16 20:05             ` Junio C Hamano
@ 2009-10-16 21:19               ` jamesmikedupont
  2009-10-16 23:25                 ` Junio C Hamano
  0 siblings, 1 reply; 15+ messages in thread
From: jamesmikedupont @ 2009-10-16 21:19 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Johannes Schindelin, git

What do you think of my idea to create blames along a specific user
defined byte positions ?
please review my suggestion and comment.

mike

On Fri, Oct 16, 2009 at 10:05 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Junio C Hamano <gitster@pobox.com> writes:
>
>> "jamesmikedupont@googlemail.com" <jamesmikedupont@googlemail.com> writes:
>>
>>>> You would need to postprocess the computed result (either by diff or
>>>> blame) to lay out the final text output in either case anyway, and making
>>>> the existing blame engine do the work for you would be a better approach,
>>>> I think.
>>>
>>> Please can you tell me what is the basic algorithm of the blame engine?
>>
>> I think this is one of the most conprehensive write-up on the algorithm:
>>
>>   http://thread.gmane.org/gmane.comp.version-control.git/28826/focus=28895
>>
>> The whole thread (at least what I wrote in it) is worth reading if you
>> want to understand what the current code does.  The first message in the
>> thread talks about "NEEDSWORK" label on an unimplemented part of the code,
>> and says "we could", but these gaps were since filled.
>
> Ah, nevermind.  The thread is the definitive description of the blame
> algorithm, but I agree with Dscho that in this case, you either have to
> change blame itself to do this "byte-wise" comparison internally between
> versions, or re-do the blame logic yourself like Dscho suggests.  Dscho is
> right in this case; an unmodifled blame engine, unless you feed a history
> that is converted to use the byte-per-line format, won't help you at all.
>
> So it would be either between rolling a custom byte-wise blame algorithm
> yourself and teaching a new byte-wise mode to existing blame engine.
> Sorry for making the task sound much easier than it would be.
>
>
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Introduction and Wikipedia and Git Blame
  2009-10-16 21:19               ` jamesmikedupont
@ 2009-10-16 23:25                 ` Junio C Hamano
  2009-10-17  6:50                   ` jamesmikedupont
  0 siblings, 1 reply; 15+ messages in thread
From: Junio C Hamano @ 2009-10-16 23:25 UTC (permalink / raw)
  To: jamesmikedupont@googlemail.com; +Cc: Junio C Hamano, Johannes Schindelin, git

"jamesmikedupont@googlemail.com" <jamesmikedupont@googlemail.com> writes:

> What do you think of my idea to create blames along a specific user
> defined byte positions ?

Overly complicated and not enough time for _review_.  If you are blaming
one-byte (or one-char) per line, wouldn't it be enough to consider the
line number in the output as byte (or char) position when reconstituting
the original text?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Introduction and Wikipedia and Git Blame
  2009-10-16 23:25                 ` Junio C Hamano
@ 2009-10-17  6:50                   ` jamesmikedupont
  2009-10-17 16:42                     ` jamesmikedupont
  0 siblings, 1 reply; 15+ messages in thread
From: jamesmikedupont @ 2009-10-17  6:50 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Johannes Schindelin, git

Thank you very much for your input and advice,
I have a lot of learn about this great tool.
I am working on learning how the existing blame tool runs now.
Will report back when I have some code.
mike

On Sat, Oct 17, 2009 at 1:25 AM, Junio C Hamano <gitster@pobox.com> wrote:
> "jamesmikedupont@googlemail.com" <jamesmikedupont@googlemail.com> writes:
>
>> What do you think of my idea to create blames along a specific user
>> defined byte positions ?
>
> Overly complicated and not enough time for _review_.  If you are blaming
> one-byte (or one-char) per line, wouldn't it be enough to consider the
> line number in the output as byte (or char) position when reconstituting
> the original text?
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Introduction and Wikipedia and Git Blame
  2009-10-17  6:50                   ` jamesmikedupont
@ 2009-10-17 16:42                     ` jamesmikedupont
  2009-10-22  6:41                       ` jamesmikedupont
  0 siblings, 1 reply; 15+ messages in thread
From: jamesmikedupont @ 2009-10-17 16:42 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Johannes Schindelin, git

I have done a workaround hack,
today I attempted to hack the blame code but I need to do more
research, it did not work.

But I did get a new version of the import script running and word
level blame going.

http://fmtyewtk.blogspot.com/2009/10/mediawiki-git-word-level-blaming-one.html

Next step is ready :

1. I have a single script that will pull a given article and check in
the revisions into git,
it is not perfect, but works.

http://bazaar.launchpad.net/~jamesmikedupont/+junk/wikiatransfer/revision/8
you run it like this,from inside a git repo :

perl GetRevisions.pl "Article_Name"

git blame Article_Name/Article.xml
git push origin master

The code that splits up the line is in Process File, this splits all
spaces into newlines.
that way we get a word level blame.

     if ($insidetext)
     {
  ## split all lines on the space
  s/(\ )/\\\n/g;

  print OUT  $_;
     }

The Article is here:
http://github.com/h4ck3rm1k3/KosovoWikipedia/blob/master/Wiki/2008_Kosovo_declaration_of_independence/article.xml

here are the blame results.
http://github.com/h4ck3rm1k3/KosovoWikipedia/blob/master/Wiki/2008_Kosovo_declaration_of_independence/wordblame.txt

Problem is that github does not like this amount of processor power
begin used and kills the process, you can do a local git blame.

Now we have the tool to easily create a repository from wikipedia, or
any other export enabled mediawiki.

mike

On Sat, Oct 17, 2009 at 8:50 AM, jamesmikedupont@googlemail.com
<jamesmikedupont@googlemail.com> wrote:
> Thank you very much for your input and advice,
> I have a lot of learn about this great tool.
> I am working on learning how the existing blame tool runs now.
> Will report back when I have some code.
> mike
>
> On Sat, Oct 17, 2009 at 1:25 AM, Junio C Hamano <gitster@pobox.com> wrote:
>> "jamesmikedupont@googlemail.com" <jamesmikedupont@googlemail.com> writes:
>>
>>> What do you think of my idea to create blames along a specific user
>>> defined byte positions ?
>>
>> Overly complicated and not enough time for _review_.  If you are blaming
>> one-byte (or one-char) per line, wouldn't it be enough to consider the
>> line number in the output as byte (or char) position when reconstituting
>> the original text?
>>
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Introduction and Wikipedia and Git Blame
  2009-10-17 16:42                     ` jamesmikedupont
@ 2009-10-22  6:41                       ` jamesmikedupont
  0 siblings, 0 replies; 15+ messages in thread
From: jamesmikedupont @ 2009-10-22  6:41 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Johannes Schindelin, git

Hi all,
I have creates a group here mediawiki-vcs and you are invited to join,
it will be to create a git/vcs backend for the mediawiki.

http://groups.google.com/group/mediawiki-vcs/browse_thread/thread/ad3e0a194c8ac1d5#

Also, I have started to document the git internal structure, with the
idea of a gitbus, a dbus like system for doing rpc calls over git for
expensive and repeatable operations.
http://github.com/h4ck3rm1k3/GitBus

thanks,
mike

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2009-10-22  6:42 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-10-16  9:07 Introduction and Wikipedia and Git Blame jamesmikedupont
2009-10-16 11:26 ` Johannes Schindelin
2009-10-16 11:38   ` Martin Langhoff
2009-10-16 11:43   ` jamesmikedupont
2009-10-16 14:11     ` Johannes Schindelin
2009-10-16 14:23       ` jamesmikedupont
2009-10-16 17:04       ` Junio C Hamano
2009-10-16 18:00         ` jamesmikedupont
2009-10-16 19:00           ` Junio C Hamano
2009-10-16 20:05             ` Junio C Hamano
2009-10-16 21:19               ` jamesmikedupont
2009-10-16 23:25                 ` Junio C Hamano
2009-10-17  6:50                   ` jamesmikedupont
2009-10-17 16:42                     ` jamesmikedupont
2009-10-22  6:41                       ` jamesmikedupont

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).