RFC: One strange idea inspired by the SSSA

linux-sparse.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RFC: One strange idea inspired by the SSSA
@ 2017-08-21 13:28 Christopher Li
  2017-08-21 14:14 ` Luc Van Oostenryck
  0 siblings, 1 reply; 9+ messages in thread
From: Christopher Li @ 2017-08-21 13:28 UTC (permalink / raw)
  To: Linux-Sparse; +Cc: Luc Van Oostenryck

The SSSA paper idea is interesting. It get the simple case
handle very fast. However, I see it has two draw backs.
1) Creating the redundant phi node and using per pseudo
    ptrmap[block] -> pesudo mapping has over head in terms
    of memory allocation and execution.
2) it does not handle the case access memory via pointers.
    This draw back is a major one.
    This means we need to have other means of promoting
    memory access to pseudo, (may be using classing SSA
    conversion?)

My have a hunch that the SSA promoting from the instruction
level is kind of unavoidable. e.g. some promotion might happen
after CSE.

If we take that as an assumption. Then I have this idea similar
to the SSSA paper in the sense that it take advantage of the
source file that is reducible graph. Instead of inserting the phi
instruction on all possible block join point. We can do some thing
else that can be beneficial to the classic SSA promotion algorithm.
We can mark the DF local defined in the Appel "morden
compiler implementation in java" page 406.

DF local(n) define as: The successor of n that is not strictly
dominated by n.

I have this observation that, for the function without goto,
(it can have break/continue/return,  I need to think more about
switch, case is like goto)

the DF local(n) is actually trivial without goto. When we do
the linearization. We already know where the DF local(n) are.

Let's image the source code has been properly format by
indent. Without goto, the code block that has indentation
level 1 (top level statements inside the function)  will dominate
the rest of the code(which have a bigger line number) in that function.

For statement if else, for/while loop. The body of those statement
will have a indentation level > 1. It means variable define there
will not dominate the rest of code. But we know exactly where the
domination ends. It is the end of this indentation level with in the
scope.

So if we mark those DF local(n) for a block during linearization.
Then we don't need to go through the dominating *tree* to find
DF(n). We know that during linearization. This can serve as
some thing similar to the "single store short cut", except that
it can place the phi node properly.

Another thing we can do is that, we can try to handle the
common case using goto for bail out. In that case it
is a DAG does not create loops. The DF local(n) might still
be trivial. My guess is that swith/case statement can be handle
in similar way. Case statement  is a special kind of goto that
does not create loops. I need to think more about that.

Of course that is the optimal case the function does not have
goto, or only have goto that jump to higher level indentation
in the same coding block. That is the short cut path.

If the function has not friendly gotos. we let the classic
SSA algorithm handle it. Which means it will build DF without
assumption on the CFG. If there is a level 1 indentation block
and there is no goto fly over it. That block can still serve as
splitting point. The function can be split as the top half and
bottom half. Calculate dominating tree  in the two smaller graph.
This process can happen recursively. This will break the
very long function into chain of much smaller CFG, which are
easier to calculate.  This will avoid the long chain problem (if
possible) in paper "A Simple, Fast Dominance Algorithm" by
Cooper et al.

This is my original idea, inspired by the SSSA paper.  I am sure
my first write up did not explain this idea well. Please ask question
if you have one.

Feed back?

Chris

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: One strange idea inspired by the SSSA
  2017-08-21 13:28 RFC: One strange idea inspired by the SSSA Christopher Li
@ 2017-08-21 14:14 ` Luc Van Oostenryck
  2017-08-21 16:40   ` Christopher Li
  0 siblings, 1 reply; 9+ messages in thread
From: Luc Van Oostenryck @ 2017-08-21 14:14 UTC (permalink / raw)
  To: Christopher Li; +Cc: Linux-Sparse

On Mon, Aug 21, 2017 at 3:28 PM, Christopher Li <sparse@chrisli.org> wrote:
> The SSSA paper idea is interesting. It get the simple case
> handle very fast. However, I see it has two draw backs.
> 1) Creating the redundant phi node and using per pseudo
>     ptrmap[block] -> pesudo mapping has over head in terms
>     of memory allocation and execution.

I've checked yesterday evening on some big functions and the
ptrmap use 26 times less memory than the ptrlist. I then added
the call needed to free all ptrmap as soon as they're not needed.
I really think this memory for ptrmap is a non-issue.

> 2) it does not handle the case access memory via pointers.
>     This draw back is a major one.
>     This means we need to have other means of promoting
>     memory access to pseudo, (may be using classing SSA
>     conversion?)
>
>
> My have a hunch that the SSA promoting from the instruction
> level is kind of unavoidable. e.g. some promotion might happen
> after CSE.

Right.

> If we take that as an assumption. Then I have this idea similar
> to the SSSA paper in the sense that it take advantage of the
> source file that is reducible graph. Instead of inserting the phi
> instruction on all possible block join point.

Mind to explain what you mean by 'all possible block join point' and
why you believe the method do this?

> We can do some thing
> else that can be beneficial to the classic SSA promotion algorithm.
> We can mark the DF local defined in the Appel "morden
> compiler implementation in java" page 406.
>
> DF local(n) define as: The successor of n that is not strictly
> dominated by n.

Don't you need the whole CFG for this (when you have
a loop, for example)?

> ...
>
> So if we mark those DF local(n) for a block during linearization.
> Then we don't need to go through the dominating *tree* to find
> DF(n). We know that during linearization. This can serve as
> some thing similar to the "single store short cut", except that
> it can place the phi node properly.

The SSSA method already do something very close to the
'single-store shortcut'. I strongly encourage you to play a bit
with 'test-linearize -fdump-linearize=only' on some examples.

Also, we must not forget that it's all a question of trade-off.
The advantage of this method is that it can do a good part
of the job on-the-fly, without needing the CFG. With the CFG
you can do some smarter things but then you first need to build
everything blindly and only then you can begin to be smarter.

It's the same if you want pruned SSA. Then you only have
to do the conversion for live variables but ... you first need
to calculate the liveness information. Idem about the
dominance graph.

-- Luc

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: One strange idea inspired by the SSSA
  2017-08-21 14:14 ` Luc Van Oostenryck
@ 2017-08-21 16:40   ` Christopher Li
  2017-08-21 16:57     ` Luc Van Oostenryck
  0 siblings, 1 reply; 9+ messages in thread
From: Christopher Li @ 2017-08-21 16:40 UTC (permalink / raw)
  To: Luc Van Oostenryck; +Cc: Linux-Sparse

On Mon, Aug 21, 2017 at 10:14 AM, Luc Van Oostenryck
<luc.vanoostenryck@gmail.com> wrote:
> On Mon, Aug 21, 2017 at 3:28 PM, Christopher Li <sparse@chrisli.org> wrote:
>> The SSSA paper idea is interesting. It get the simple case
>> handle very fast. However, I see it has two draw backs.
>> 1) Creating the redundant phi node and using per pseudo
>>     ptrmap[block] -> pesudo mapping has over head in terms
>>     of memory allocation and execution.
>
> I've checked yesterday evening on some big functions and the
> ptrmap use 26 times less memory than the ptrlist. I then added
> the call needed to free all ptrmap as soon as they're not needed.
> I really think this memory for ptrmap is a non-issue.

Do you have insight then why the SSSA is 2% slower than
the RC5 without the short cut case? I expect with simpler
effective approach it should be at least similar or faster.
It do less work than RC5 in terms of finding dorminator etc.

>> My have a hunch that the SSA promoting from the instruction
>> level is kind of unavoidable. e.g. some promotion might happen
>> after CSE.
>
> Right.

Good

>
>> If we take that as an assumption. Then I have this idea similar
>> to the SSSA paper in the sense that it take advantage of the
>> source file that is reducible graph. Instead of inserting the phi
>> instruction on all possible block join point.
>
> Mind to explain what you mean by 'all possible block join point' and
> why you believe the method do this?

I am reading the paper there seems to be recursive go through
the parents. My understanding might be wrong. I am trying to
think why SSSA is slower.

readVariable(variable, block):
    if currentDef[variable] contains block:
    # local value numbering
        return currentDef[variable][block]
    # global value numbering
    return readVariableRecursive(variable, block)  <==============

readVariableRecursive(variable, block):
   if block not in sealedBlocks:
        # Incomplete CFG
        val ← new Phi(block)
        incompletePhis[block][variable] ← val
   else if |block.preds| = 1:
       # Optimize the common case of one predecessor: No phi needed
       val ← readVariable(variable, block.preds[0]) <========== recursion?
   else:
       # Break potential cycles with operandless phi
       val ← new Phi(block)
       writeVariable(variable, block, val)
   val ← addPhiOperands(variable, val)
   writeVariable(variable, block, val)
   return val

>
>> We can do some thing
>> else that can be beneficial to the classic SSA promotion algorithm.
>> We can mark the DF local defined in the Appel "morden
>> compiler implementation in java" page 406.
>>
>> DF local(n) define as: The successor of n that is not strictly
>> dominated by n.
>
> Don't you need the whole CFG for this (when you have
> a loop, for example)?

I haven't play out all the detail yet. The idea about the loop:

while (expr_a()) {
      loop_body_blocks;
}
after_loop_blok;

We known that every variable declare in the loop_body_blocks
with indentation level == 2 is going to dominate the rest of the
loop_body_blocks where line number > the current define place.

And we know that that define domination ends at the last block
in side loop_body_blocks for the same indentation level.

We can have a stack system while we recursive linearize
the statement. That is the idea. I haven't plan out the detail.

>
>> ...
>>
>> So if we mark those DF local(n) for a block during linearization.
>> Then we don't need to go through the dominating *tree* to find
>> DF(n). We know that during linearization. This can serve as
>> some thing similar to the "single store short cut", except that
>> it can place the phi node properly.
>
> The SSSA method already do something very close to the
> 'single-store shortcut'. I strongly encourage you to play a bit
> with 'test-linearize -fdump-linearize=only' on some examples.
>
> Also, we must not forget that it's all a question of trade-off.
> The advantage of this method is that it can do a good part
> of the job on-the-fly, without needing the CFG. With the CFG

Why is SSSA slower that is part I don't understand.
In the method I describe, there is no recursion more than
what is needed for the linearization. It should do better than
SSSA in micro benchmarks if that idea is implementable.

> you can do some smarter things but then you first need to build
> everything blindly and only then you can begin to be smarter.

You don't need to be blindly, that is my strange idea.
When you are building the loop body blocks. You can already
see in this indentation level, where the domination ends.
When you recursive to deeper indentation level, you can
save your context of the understanding into a stack. Each
indentation level have their own context stack. They can
reference to the upper level's context for bail out etc.

So if there is no gotos. Every linearization should already
know where the domination terminates (from the context stack).

> It's the same if you want pruned SSA. Then you only have
> to do the conversion for live variables but ... you first need
> to calculate the liveness information. Idem about the
> dominance graph.

My strange idea, if work out, can actually help build the
dominance graph on the fly. The only weakness is that
if there is crazy goto jump to other indentation level not
even in the context stack, then we need to do the dominator
tree the old fashion way.

Chris

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: One strange idea inspired by the SSSA
  2017-08-21 16:40   ` Christopher Li
@ 2017-08-21 16:57     ` Luc Van Oostenryck
  2017-08-21 17:42       ` Christopher Li
  0 siblings, 1 reply; 9+ messages in thread
From: Luc Van Oostenryck @ 2017-08-21 16:57 UTC (permalink / raw)
  To: Christopher Li; +Cc: Linux-Sparse

On Mon, Aug 21, 2017 at 6:40 PM, Christopher Li <sparse@chrisli.org> wrote:
>
> Do you have insight then why the SSSA is 2% slower than
> the RC5 without the short cut case? I expect with simpler
> effective approach it should be at least similar or faster.
> It do less work than RC5 in terms of finding dorminator etc.

I just made a small change to the ptrmap and I'm back to slightly
faster than rc5. With all March's pending fixes it's even a bit faster.
We're talking of something like twice 2%.
With smaller ptrlist it's yet another 2%.


>>> If we take that as an assumption. Then I have this idea similar
>>> to the SSSA paper in the sense that it take advantage of the
>>> source file that is reducible graph. Instead of inserting the phi
>>> instruction on all possible block join point.
>>
>> Mind to explain what you mean by 'all possible block join point' and
>> why you believe the method do this?
>
> I am reading the paper there seems to be recursive go through
> the parents.

Otherwise you could only process block local vars.
But going recursively through the parents is not something
expensive, especially since very few is done at each steps.

>> Don't you need the whole CFG for this (when you have
>> a loop, for example)?
>
> I haven't play out all the detail yet. The idea about the loop:
>
> while (expr_a()) {
>       loop_body_blocks;
> }
> after_loop_blok;
>
> We known that every variable declare in the loop_body_blocks
> with indentation level == 2 is going to dominate the rest of the
> loop_body_blocks where line number > the current define place.
>
> And we know that that define domination ends at the last block
> in side loop_body_blocks for the same indentation level.
>
> We can have a stack system while we recursive linearize
> the statement. That is the idea. I haven't plan out the detail.

Sorry, but for my part I'll give my trust to a published and well
received method.

-- Luc

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: One strange idea inspired by the SSSA
  2017-08-21 16:57     ` Luc Van Oostenryck
@ 2017-08-21 17:42       ` Christopher Li
  2017-08-21 19:02         ` Luc Van Oostenryck
  0 siblings, 1 reply; 9+ messages in thread
From: Christopher Li @ 2017-08-21 17:42 UTC (permalink / raw)
  To: Luc Van Oostenryck; +Cc: Linux-Sparse

On Mon, Aug 21, 2017 at 12:57 PM, Luc Van Oostenryck
<luc.vanoostenryck@gmail.com> wrote:
> I just made a small change to the ptrmap and I'm back to slightly
> faster than rc5. With all March's pending fixes it's even a bit faster.
> We're talking of something like twice 2%.

Do you have a link for testing? I can run that on my compile server
and report back with testing my numbers.

> With smaller ptrlist it's yet another 2%.

I think the RC5 with the short cut removed is all already a low
starting point. It should be faster.  A cytron el al should be faster
than RC5 too.

> Sorry, but for my part I'll give my trust to a published and well
> received method.

I understand you don't want to think some thing that is new
and unproven. But that is all new ideas are about.

In your reasoning we might want to consider Cytron et al then.
It is published and much more well known in the field. It give
minimal SSA at smaller overhead than SSSA (if we want minimal
SSA from SSSA). It does not have the 2) problem SSSA has.

Also variance of Cytron el al has been implemented in LLVM and
gcc. Due to the 2) problem of SSSA, we might need to implement
it any way.

Chris

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: One strange idea inspired by the SSSA
  2017-08-21 17:42       ` Christopher Li
@ 2017-08-21 19:02         ` Luc Van Oostenryck
  2017-08-21 19:27           ` Christopher Li
  2017-08-21 19:50           ` Christopher Li
  0 siblings, 2 replies; 9+ messages in thread
From: Luc Van Oostenryck @ 2017-08-21 19:02 UTC (permalink / raw)
  To: Christopher Li; +Cc: Linux-Sparse

On Mon, Aug 21, 2017 at 7:42 PM, Christopher Li <sparse@chrisli.org> wrote:
> On Mon, Aug 21, 2017 at 12:57 PM, Luc Van Oostenryck wrote:
>> I just made a small change to the ptrmap and I'm back to slightly
>> faster than rc5. With all March's pending fixes it's even a bit faster.
>> We're talking of something like twice 2%.
>
> Do you have a link for testing? I can run that on my compile server
> and report back with testing my numbers.

I'll send this very soon.

>> With smaller ptrlist it's yet another 2%.
>
> I think the RC5 with the short cut removed is all already a low
> starting point. It should be faster.

Do you have some numbers for rc4 vs. rc5 ?

> A cytron el al should be faster
> than RC5 too.
>
>> Sorry, but for my part I'll give my trust to a published and well
>> received method.
>
> I understand you don't want to think some thing that is new
> and unproven. But that is all new ideas are about.
>
> In your reasoning we might want to consider Cytron et al then.

Sorry but comparing Braun & Hack's method with Cytron's
is not exactly the same as comparing one of these methods
with some new method by yourself.

Plans, ideas, 'should', 'ideal', 'optimal' and ponies are all good
but at some point you need to implement this and have good
confidence that it's working correctly.

-- Luc

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: One strange idea inspired by the SSSA
  2017-08-21 19:02         ` Luc Van Oostenryck
@ 2017-08-21 19:27           ` Christopher Li
  2017-08-21 19:50           ` Christopher Li
  1 sibling, 0 replies; 9+ messages in thread
From: Christopher Li @ 2017-08-21 19:27 UTC (permalink / raw)
  To: Luc Van Oostenryck; +Cc: Linux-Sparse

On Mon, Aug 21, 2017 at 3:02 PM, Luc Van Oostenryck
<luc.vanoostenryck@gmail.com> wrote:
> On Mon, Aug 21, 2017 at 7:42 PM, Christopher Li <sparse@chrisli.org> wrote:
>> On Mon, Aug 21, 2017 at 12:57 PM, Luc Van Oostenryck wrote:
>>> I just made a small change to the ptrmap and I'm back to slightly
>>> faster than rc5. With all March's pending fixes it's even a bit faster.
>>> We're talking of something like twice 2%.
>>
>> Do you have a link for testing? I can run that on my compile server
>> and report back with testing my numbers.
>
> I'll send this very soon.

Great. Looking forward to it.

> Do you have some numbers for rc4 vs. rc5 ?

I will do it now.

>> In your reasoning we might want to consider Cytron et al then.
>
> Sorry but comparing Braun & Hack's method with Cytron's
> is not exactly the same as comparing one of these methods
> with some new method by yourself.

Of course. That is just *ideas*. I am trying to see if there is
some serious fault in the ideas that it is not implementable.
Consider it as some brain storming.

I am not purposing do that idea right now. The development plan
I have in mind was:
1) merge your SSSA patches.
2) enable the old SSA promotion code path (with bugs) to cover the
    case not handle by SSSA.
3) Fix the bug in old SSA promotion path, possibly implement Cytron
   el al.

> Plans, ideas, 'should', 'ideal', 'optimal' and ponies are all good
> but at some point you need to implement this and have good
> confidence that it's working correctly.

Of course. We can implement Cytron first and use that as
verification path for other idea to optimized it. That is why I want
to have debug version of sparse as well.

Chris

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: One strange idea inspired by the SSSA
  2017-08-21 19:02         ` Luc Van Oostenryck
  2017-08-21 19:27           ` Christopher Li
@ 2017-08-21 19:50           ` Christopher Li
  2017-08-21 20:37             ` Luc Van Oostenryck
  1 sibling, 1 reply; 9+ messages in thread
From: Christopher Li @ 2017-08-21 19:50 UTC (permalink / raw)
  To: Luc Van Oostenryck; +Cc: Linux-Sparse

On Mon, Aug 21, 2017 at 3:02 PM, Luc Van Oostenryck
<luc.vanoostenryck@gmail.com> wrote:
>
> Do you have some numbers for rc4 vs. rc5 ?
>
============rc4 ====================
1166.81user 437.91system 1:14.20elapsed 2162%CPU (0avgtext+0avgdata
238036maxresident)k
8inputs+12776outputs (0major+128364597minor)pagefaults 0swaps
1166.99user 436.69system 1:14.23elapsed 2160%CPU (0avgtext+0avgdata
238100maxresident)k
0inputs+12776outputs (0major+128364667minor)pagefaults 0swaps

============rc5====================
1175.79user 438.58system 1:14.59elapsed 2164%CPU (0avgtext+0avgdata
238028maxresident)k
0inputs+12768outputs (0major+128881556minor)pagefaults 0swaps
1175.76user 439.47system 1:14.55elapsed 2166%CPU (0avgtext+0avgdata
238024maxresident)k
0inputs+12768outputs (0major+128881539minor)pagefaults 0swaps
1174.53user 438.69system 1:14.54elapsed 2164%CPU (0avgtext+0avgdata
238112maxresident)k
0inputs+12768outputs (0major+128880926minor)pagefaults 0swaps

I really enjoy this server. The timing run is very precise.

Chris

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: One strange idea inspired by the SSSA
  2017-08-21 19:50           ` Christopher Li
@ 2017-08-21 20:37             ` Luc Van Oostenryck
  0 siblings, 0 replies; 9+ messages in thread
From: Luc Van Oostenryck @ 2017-08-21 20:37 UTC (permalink / raw)
  To: Christopher Li; +Cc: Linux-Sparse

On Mon, Aug 21, 2017 at 9:50 PM, Christopher Li <sparse@chrisli.org> wrote:
> On Mon, Aug 21, 2017 at 3:02 PM, Luc Van Oostenryck wrote:
>>
>> Do you have some numbers for rc4 vs. rc5 ?
>>
> ============rc4 ====================
> 1166.81user 437.91system 1:14.20elapsed 2162%CPU (0avgtext+0avgdata
> 238036maxresident)k
> 8inputs+12776outputs (0major+128364597minor)pagefaults 0swaps
> 1166.99user 436.69system 1:14.23elapsed 2160%CPU (0avgtext+0avgdata
> 238100maxresident)k
> 0inputs+12776outputs (0major+128364667minor)pagefaults 0swaps
>
> ============rc5====================
> 1175.79user 438.58system 1:14.59elapsed 2164%CPU (0avgtext+0avgdata
> 238028maxresident)k
> 0inputs+12768outputs (0major+128881556minor)pagefaults 0swaps
> 1175.76user 439.47system 1:14.55elapsed 2166%CPU (0avgtext+0avgdata
> 238024maxresident)k
> 0inputs+12768outputs (0major+128881539minor)pagefaults 0swaps
> 1174.53user 438.69system 1:14.54elapsed 2164%CPU (0avgtext+0avgdata
> 238112maxresident)k
> 0inputs+12768outputs (0major+128880926minor)pagefaults 0swaps

Thanks!
So that make rc5 taking 0.4 - 0.7% more time than rc4.

> I really enjoy this server. The timing run is very precise.

Yes, it's certainly quite useful to have very repeatable measurements.

-- Luc

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-08-21 20:37 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-08-21 13:28 RFC: One strange idea inspired by the SSSA Christopher Li
2017-08-21 14:14 ` Luc Van Oostenryck
2017-08-21 16:40   ` Christopher Li
2017-08-21 16:57     ` Luc Van Oostenryck
2017-08-21 17:42       ` Christopher Li
2017-08-21 19:02         ` Luc Van Oostenryck
2017-08-21 19:27           ` Christopher Li
2017-08-21 19:50           ` Christopher Li
2017-08-21 20:37             ` Luc Van Oostenryck

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).