Graph sloc tool for git repos

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Graph sloc tool for git repos
@ 2016-03-12 11:00 Kai Hendry
  2016-03-14  2:35 ` Jeff King
  0 siblings, 1 reply; 3+ messages in thread
From: Kai Hendry @ 2016-03-12 11:00 UTC (permalink / raw)
  To: git

Hi there,

I penned a script to plot SLOC of a git project using GNUplot & I
thought the fastest way to count code fluctuations was via `git show
--numstat`.

However that requires some awk counting of the lines:
https://github.com/kaihendry/graphsloc/blob/5f31e388e9b655e1801f13885f4311d221663a19/collect-stats.sh#L32

Is there a better way I missed? I think there is bug since my graph was
a factor of 10 out whilst graphing Linux:
https://twitter.com/kaihendry/status/706627679924174848

(Though the shape looks right)

Good news it's generating graphs for smaller projects just fine:
http://s.natalian.org/2016-03-12/dwm-3465bed.csv.svg

Anyway, would love to get your feedback on
https://github.com/kaihendry/graphsloc

Kind regards from Petaling Jaya,

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Graph sloc tool for git repos
  2016-03-12 11:00 Graph sloc tool for git repos Kai Hendry
@ 2016-03-14  2:35 ` Jeff King
  2016-03-14  2:50   ` Jeff King
  0 siblings, 1 reply; 3+ messages in thread
From: Jeff King @ 2016-03-14  2:35 UTC (permalink / raw)
  To: Kai Hendry; +Cc: git

On Sat, Mar 12, 2016 at 07:00:26PM +0800, Kai Hendry wrote:

> I penned a script to plot SLOC of a git project using GNUplot & I
> thought the fastest way to count code fluctuations was via `git show
> --numstat`.
> 
> However that requires some awk counting of the lines:
> https://github.com/kaihendry/graphsloc/blob/5f31e388e9b655e1801f13885f4311d221663a19/collect-stats.sh#L32
> 
> Is there a better way I missed? I think there is bug since my graph was
> a factor of 10 out whilst graphing Linux:
> https://twitter.com/kaihendry/status/706627679924174848

I think you'll always need to post-process the --numstat output to count
up lines. But that's fairly minor.

The bigger problem, I think is in how you handle merges. Imagine I have
two branches, each of which touch the same code:

  git init
  echo base >file
  git add file
  git commit -m base

  echo master >>file
  git commit -am master

  git checkout -b side HEAD^
  echo side >>file
  git commit -am side

and now I merge them, resolving the conflict favor of one side:

  git merge master
  { echo base; echo master; } >file
  git commit -am resolved

What does --numstat say?

  $ git log --cc --oneline --numstat
  989c6f7 resolved
  1       1       file
  b9bbaf9 side
  1       0       file
  087b294 master
  1       0       file
  09037ef base
  1       0       file

If we add these up, it looks like 3 lines were added. But the end result
has only 2 lines! We double-counted the additions on the two branches,
even though one stomped on the other. And then the merge resolution
looks neutral (one line gone, one line added), even though it was where
we did the stomping.

To be honest, I am not sure _what_ the "--cc --numstat" is showing
there (I added --cc because that is used by "git show", which is what
your script uses). The actual "--cc" (and "-c") patches show nothing,
which is right. It kind of looks like we are just showing the diffstat
against the first parent. I'm not sure anyone ever designed what a
"combined" diffstat would look like.

Let's redo our merge and resolve in favor of "side":

  git reset --hard HEAD^
  git merge master
  { echo base; echo side; } >file
  git commit -am resolved

Now the numstat is blank. I think it _is_ just showing the first-parent
diffstat. So that means each merge is introducing errors into your
count, and you'll drift away from accurate.

Another, related problem, is that you are adding up numbers along
multiple simultaneous branches. So going back to my output above, at the
point we read the "master" commit, we might say that the total
sloc-count is 2.  That's correct. And then we read "side", and say there
are 3 lines. But that's not right. It doesn't build on master, so we
still have only 2 lines. So even _if_ we merged them and the end result
had 3 lines (or if we somehow accounted for the double-counting when
examining the merge), you'd have inaccuracies through your dataset.

Another way of thinking about it is that your graph wants to represent a
single linear history, with the sloc-count changing as time moves to the
right. But that's not what really happened; at any given time, there are
_several_ sloc-counts, depending on which branch you're following.

But that can also give us a clue about one solution[1].  For your
purposes, you don't care about hitting every commit. You just want a
bunch of linear samples of the form [timestamp, sloc] to feed to
gnuplot. We can use "--first-parent" to walk _a_ linear history and see
a strict progression. And then use "-m" to tell git to just show merges
as the diff against that first parent (i.e., summarizing everything that
happened along the side-branch we are not following).

Like (this is back on the "we resolved as master" version of my example,
to illustrate how the merge is shown):

  $ git log --first-parent -m --numstat --oneline
  4244c8a resolved
  1       1       file
  b9bbaf9 side
  1       0       file
  09037ef base
  1       0       file

And that count is right. We had one line in our base, one on the
"side" commit, and then the merge didn't change our sloc-count at all
(we dropped the "side" line in favor of "master").

Finally, I'll note one other thing in my examples above. Note that I
used a single "git log" invocation, whereas your script reads "rev-list"
output to start a series of N "git show" invocations. I'll bet that took
quite a long time to run on the kernel. :)  Doing it all in one git-log
means you avoid the process startup overhead, and I'd expect it to run
~50 times as fast (at least that was what I saw from a quick experiment
on a much smaller repo).

-Peff

[1] The other solution I thought of was to actually count the SLOC of
    each tree at each commit, rather than worrying about diffs. You can
    make it efficient by caching the SLOC of blob and tree sha1s, so
    you don't count the same files over and over. That gives you
    accurate SLOC-counts at any given point in time, but doesn't address
    the multiple-branches thing (so you'd see your SLOC bounce up and
    down in time as you saw commits for various branches).

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Graph sloc tool for git repos
  2016-03-14  2:35 ` Jeff King
@ 2016-03-14  2:50   ` Jeff King
  0 siblings, 0 replies; 3+ messages in thread
From: Jeff King @ 2016-03-14  2:50 UTC (permalink / raw)
  To: Kai Hendry; +Cc: git

On Sun, Mar 13, 2016 at 10:35:45PM -0400, Jeff King wrote:

> Like (this is back on the "we resolved as master" version of my example,
> to illustrate how the merge is shown):
> 
>   $ git log --first-parent -m --numstat --oneline
>   4244c8a resolved
>   1       1       file
>   b9bbaf9 side
>   1       0       file
>   09037ef base
>   1       0       file

You'd probably want --reverse, of course, since the point is to build up
the count in the same order as time flows.

So this is the working version I came up with:

    git log --reverse --first-parent -m --format=%ct --numstat |
    perl -lne '
      if (/^\d+$/) {
              if (defined $time) {
                      print "$time $total"
              }
              $time = $&;
      } elsif (/^(\d+)\s+(\d+)\s/) {
              $total += $1 - $2;
      }
      END {
        # flush last entry
        print "$time $total";
      }
    '

For my git.git repo, the final line it produces is:

    1457666843 789457

which should be the final sloc-count right now.  If I count the lines in
the lines in HEAD, it's close but not quite the same:

   $ git ls-tree -r HEAD |
     awk '{print $3}' |
     xargs -n1 git cat-file blob |
     wc -l
   790437

I'd guess that the difference comes from a few files which are treated
as binary (and thus get "-" in their numstat output, but happen to have
newlines which cause "wc" to increment its count).

-Peff

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2016-03-14  2:52 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-12 11:00 Graph sloc tool for git repos Kai Hendry
2016-03-14  2:35 ` Jeff King
2016-03-14  2:50   ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).