* [GSoC] Improving parallelism
@ 2012-03-17 22:18 Felipe Tanus
2012-03-18 4:42 ` Nguyen Thai Ngoc Duy
2012-03-21 12:45 ` Thomas Rast
0 siblings, 2 replies; 5+ messages in thread
From: Felipe Tanus @ 2012-03-17 22:18 UTC (permalink / raw)
To: git
Hi,
I'm looking forward to joining Google Summer of Code through git. Some
short words about me: I'm an undergraduate student of Computer Science
in UFRGS, Brazil, and also work part time in the onthegosystems
company from home. In the past couple of years, I participated in
GSoC: The first in the boost, and the last on macports; in both years
my projects were evaluated as successful. I'm a git user for nearly 3
years now, and this is my first e-mail to this mailing list. If you
want to know more about me or check some references, please visit my
Site at this mail signature. Also be welcome to make any questions.
My proposal will most likely follow one of the proposed idea entitled
"Improving parallelism in various commands". I'm very used to C
programming, and pthreads is my friend, so I'm the right guy for this
job. The downside is that I never looked at the git source code
before, and I expect the most challenging step from the project is to
find where parallelism can be further explored. For this, I count on
my skill in C programming, a good mentor to help me to go through the
code and evaluate my ideas.
I find the idea of the proposal straight-forward, and no doubts pop up
in my mind, except on what commands can I work on. The idea described
in the wiki tells that the commands "git grep --cached" and "git grep
COMMIT" need this improvement, and most likely "git diff" and "git log
-p" need too. That is a good start, but if you know already other
commands that might benefit from this parallelism, please tell me in
order for me to include in my proposal. I also plan to use the
community bonding time frame to look deeper in the code searching for
what can be improved, and In my schedule, I plan to have some time at
the start coding phase to keep looking into the code and decide with
my mentor what commands will need to be touched.
If you have any Idea which can turn this project better or just some
advice for my application, please share it through the list, then
other people can keep collaborating.
Regards,
--
Felipe de Oliveira Tanus
E-mail: fotanus@gmail.com
Site: http://www.inf.ufrgs.br/~fotanus/
-----
"All we have to decide is what to do with the time that is given us." - Gandalf
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [GSoC] Improving parallelism
2012-03-17 22:18 [GSoC] Improving parallelism Felipe Tanus
@ 2012-03-18 4:42 ` Nguyen Thai Ngoc Duy
2012-03-18 4:58 ` Felipe Tanus
2012-03-21 12:45 ` Thomas Rast
1 sibling, 1 reply; 5+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-03-18 4:42 UTC (permalink / raw)
To: Felipe Tanus; +Cc: git
On Sun, Mar 18, 2012 at 5:18 AM, Felipe Tanus <fotanus@gmail.com> wrote:
> I find the idea of the proposal straight-forward, and no doubts pop up
> in my mind, except on what commands can I work on. The idea described
> in the wiki tells that the commands "git grep --cached" and "git grep
> COMMIT" need this improvement, and most likely "git diff" and "git log
Note that if you improve diff machinery, many commands will benefit
(add, apply, checkout, merge, status)
> -p" need too. That is a good start, but if you know already other
> commands that might benefit from this parallelism, please tell me in
> order for me to include in my proposal.
"git blame" (I think, I don't use this command much) and "git fsck".
"git index-pack" is getting multithread support soon (you can search
mail archive), but even then I think there's still room for further
improvements (i.e. parallelize the hashing code in the first phase of
checking the pack).
If that's not enough, you may want to investigate whether multithread
support can speed up "git rev-list --objects --all" without adding too
much complexity. Speeding up this can also be achieved by implementing
pack format version 4 (current version is 3). But that's a bigger work
and may need more time to land.
--
Duy
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [GSoC] Improving parallelism
2012-03-18 4:42 ` Nguyen Thai Ngoc Duy
@ 2012-03-18 4:58 ` Felipe Tanus
0 siblings, 0 replies; 5+ messages in thread
From: Felipe Tanus @ 2012-03-18 4:58 UTC (permalink / raw)
To: Nguyen Thai Ngoc Duy; +Cc: git
On Sun, Mar 18, 2012 at 1:42 AM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> On Sun, Mar 18, 2012 at 5:18 AM, Felipe Tanus <fotanus@gmail.com> wrote:
>> but if you know already other
>> commands that might benefit from this parallelism, please tell me in
>> order for me to include in my proposal.
>
> "git blame" (I think, I don't use this command much) and "git fsck".
> "git index-pack" is getting multithread support soon (you can search
> mail archive), but even then I think there's still room for further
> improvements (i.e. parallelize the hashing code in the first phase of
> checking the pack).
>
> If that's not enough, you may want to investigate whether multithread
> support can speed up "git rev-list --objects --all" without adding too
> much complexity. Speeding up this can also be achieved by implementing
> pack format version 4 (current version is 3). But that's a bigger work
> and may need more time to land.
> --
> Duy
Hi Duy,
Thanks for the answer, was very helpful. I'll check this commands and
see what can I complete in time to add into the proposal.
Regards,
--
Felipe de Oliveira Tanus
E-mail: fotanus@gmail.com
Site: http://www.inf.ufrgs.br/~fotanus/
-----
"All we have to decide is what to do with the time that is given us." - Gandalf
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [GSoC] Improving parallelism
2012-03-17 22:18 [GSoC] Improving parallelism Felipe Tanus
2012-03-18 4:42 ` Nguyen Thai Ngoc Duy
@ 2012-03-21 12:45 ` Thomas Rast
2012-03-21 18:06 ` Felipe Tanus
1 sibling, 1 reply; 5+ messages in thread
From: Thomas Rast @ 2012-03-21 12:45 UTC (permalink / raw)
To: Felipe Tanus; +Cc: git
Felipe Tanus <fotanus@gmail.com> writes:
> My proposal will most likely follow one of the proposed idea entitled
> "Improving parallelism in various commands". I'm very used to C
> programming, and pthreads is my friend, so I'm the right guy for this
> job. The downside is that I never looked at the git source code
> before, and I expect the most challenging step from the project is to
> find where parallelism can be further explored. For this, I count on
> my skill in C programming, a good mentor to help me to go through the
> code and evaluate my ideas.
>
> I find the idea of the proposal straight-forward, and no doubts pop up
> in my mind, except on what commands can I work on. The idea described
> in the wiki tells that the commands "git grep --cached" and "git grep
> COMMIT" need this improvement, and most likely "git diff" and "git log
> -p" need too. That is a good start, but if you know already other
> commands that might benefit from this parallelism, please tell me in
> order for me to include in my proposal.
As the ideas page says the steps are (the original wording was that it
would have 2.5 steps, hence "the half-step"):
0. In preparation (the half-step): identify commands that could benefit
from parallelism. git grep --cached and git grep COMMIT come to
mind, but most likely also git diff and git log -p. You can probably
find more.
1. Rework the pack access mechanisms to allow the maximum possible
parallel access.
2. Rework the commands found in the first step to use parallel pack
access if possible. Along the way, document the improvements with
performance tests.
I think (1.) is the most important part simply because without (1.) the
other two are totally meaningless. So I'd rather you not focus too hard
on the command list. However, correctly identifying more commands where
pack access is the hotspot, and backing that up with numbers, may be a
good way to show your understanding of the matter.
For further reading, you should start with the discussions surrounding
git-grep threading around
http://thread.gmane.org/gmane.comp.version-control.git/185932/focus=186217
http://thread.gmane.org/gmane.comp.version-control.git/186618
http://thread.gmane.org/gmane.comp.version-control.git/188701/focus=189592
etc.
--
Thomas Rast
trast@{inf,student}.ethz.ch
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [GSoC] Improving parallelism
2012-03-21 12:45 ` Thomas Rast
@ 2012-03-21 18:06 ` Felipe Tanus
0 siblings, 0 replies; 5+ messages in thread
From: Felipe Tanus @ 2012-03-21 18:06 UTC (permalink / raw)
To: Thomas Rast; +Cc: git
On Wed, Mar 21, 2012 at 9:45 AM, Thomas Rast <trast@student.ethz.ch> wrote:
> Felipe Tanus <fotanus@gmail.com> writes:
[...]
>
> 1. Rework the pack access mechanisms to allow the maximum possible
> parallel access.
>
[...]
> I think (1.) is the most important part simply because without (1.) the
> other two are totally meaningless. So I'd rather you not focus too hard
> on the command list. However, correctly identifying more commands where
> pack access is the hotspot, and backing that up with numbers, may be a
> good way to show your understanding of the matter.
>
> For further reading, you should start with the discussions surrounding
> git-grep threading around
>
[...]
Hi Thomas, thank you for your answer, and more than it, thank you for
point me to the right direction :-)
I followed the discussions you pointed out and it was very helpful. If
I get it right, what you expect to be in the proposal is an analysis
of what should be modified in the pack access mechanism because it
will be used by the commands we will work on, and thus they should be
thread safe and have a better performance. To achieve this I'll have
to go through the sha1_file.c file and come with an idea of how to
improve the parallelism on it. Is that what you expect for a good work
in this summer?
Regards,
--
Felipe de Oliveira Tanus
E-mail: fotanus@gmail.com
Site: http://www.inf.ufrgs.br/~fotanus/
-----
"All we have to decide is what to do with the time that is given us." - Gandalf
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2012-03-21 18:06 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-03-17 22:18 [GSoC] Improving parallelism Felipe Tanus
2012-03-18 4:42 ` Nguyen Thai Ngoc Duy
2012-03-18 4:58 ` Felipe Tanus
2012-03-21 12:45 ` Thomas Rast
2012-03-21 18:06 ` Felipe Tanus
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).