synchronize with a non-atomic flag

Discussions of the Parallel Programming book
 help / color / mirror / Atom feed

* synchronize with a non-atomic flag
@ 2017-10-06  5:52 Yubin Ruan
  2017-10-06 12:03 ` Akira Yokosawa
  2017-10-08  9:12 ` Yubin Ruan
  0 siblings, 2 replies; 13+ messages in thread
From: Yubin Ruan @ 2017-10-06  5:52 UTC (permalink / raw)
  To: perfbook

Hi,
I saw lots of discussions on the web about possible race when doing
synchronization between multiple threads/processes with lock or atomic
operations[1][2]. From my point of view most them are over-worrying.
But I want to point out some particular issue here to see whether
anyone have anything to say.

Imagine two processes communicate using only a uint32_t variable in
shared memory, like this:

    // uint32_t variable in shared memory
    uint32_t flag = 0;

    //process 1
    while(1) {
        if(READ_ONCE(flag) == 0) {
            do_something();
            WRITE_ONCE(flag, 1); // let another process to run
        } else {
            continue;
        }
    }

    //process 2
    while(1) {
        if(READ_ONCE(flag) == 1) {
            printf("process 2 running...\n");
            WRITE_ONCE(flag, 0); // let another process to run
        } else {
            continue;
        }
    }

On X86 or X64, I expect this code to run correctly, that is, I will
got the two `printf' to printf one after one. That is because:

    1) on X86/X64, load/store on 32-bits variable are atomic
    2) I use READ_ONCE/WRITE_ONCE to prevent possibly harmful compiler
optimization on `flag'.
    3) I use only one variable to communicate between two processes,
so there is no need for any kind of barrier.

Does anyone have any objection at that?

I know using a lock or atomic operation will save me a lot of
argument, but I think those things are unnecessary at this
circumstance, and it matter where performance matter, so I am picky
here...

Yubin

[1]: https://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
[2]: https://www.usenix.org/conference/osdi10/ad-hoc-synchronization-considered-harmful

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: synchronize with a non-atomic flag
  2017-10-06  5:52 synchronize with a non-atomic flag Yubin Ruan
@ 2017-10-06 12:03 ` Akira Yokosawa
  2017-10-06 12:35   ` Yubin Ruan
  2017-10-08  9:12 ` Yubin Ruan
  1 sibling, 1 reply; 13+ messages in thread
From: Akira Yokosawa @ 2017-10-06 12:03 UTC (permalink / raw)
  To: Yubin Ruan; +Cc: perfbook, Paul E. McKenney, Akira Yokosawa

Hi Yubin,

On 2017/10/06 14:52, Yubin Ruan wrote:
> Hi,
> I saw lots of discussions on the web about possible race when doing
> synchronization between multiple threads/processes with lock or atomic
> operations[1][2]. From my point of view most them are over-worrying.
> But I want to point out some particular issue here to see whether
> anyone have anything to say.
> 
> Imagine two processes communicate using only a uint32_t variable in
> shared memory, like this:
> 
>     // uint32_t variable in shared memory
>     uint32_t flag = 0;
> 
>     //process 1
>     while(1) {
>         if(READ_ONCE(flag) == 0) {
>             do_something();
>             WRITE_ONCE(flag, 1); // let another process to run
>         } else {
>             continue;
>         }
>     }
> 
>     //process 2
>     while(1) {
>         if(READ_ONCE(flag) == 1) {
>             printf("process 2 running...\n");
>             WRITE_ONCE(flag, 0); // let another process to run
>         } else {
>             continue;
>         }
>     }
> 
> On X86 or X64, I expect this code to run correctly, that is, I will
> got the two `printf' to printf one after one.

Well, I see only one printf() above.
Do you mean:

    //process 1
    while(1) {
        if(READ_ONCE(flag) == 0) {
            printf("process 1 running...\n");
            WRITE_ONCE(flag, 1); // let another process to run
        } else {
            continue;
        }
    }

    //process 2
    while(1) {
        if(READ_ONCE(flag) == 1) {
            printf("process 2 running...\n");
            WRITE_ONCE(flag, 0); // let another process to run
        } else {
            continue;
        }
    }

?

Then printf()s can be a problem.
It partially negates your claim 3).
Without using memory barrier, there is no guarantee that the results of
WRITE_ONCE() are visible to the other thread after the printf()'s
memory accesses complete. I/O operations in printf() might make the situation
trickier.

In a more realistic case where you do something meaningful in
do_something() in both threads:

    //process 1
    while(1) {
        if(READ_ONCE(flag) == 0) {
            do_something();
            WRITE_ONCE(flag, 1); // let another process to run
        } else {
            continue;
        }
    }

    //process 2
    while(1) {
        if(READ_ONCE(flag) == 1) {
            do_something();
            WRITE_ONCE(flag, 0); // let another process to run
        } else {
            continue;
        }
    }

and if do_something() uses some shared variables other than "flag",
you need a couple of memory barriers to ensure the ordering of
READ_ONCE(), do_something(), and WRITE_ONCE() something like:

    //process 1
    while(1) {
        if(READ_ONCE(flag) == 0) {
	    smp_rmb();
            do_something();
	    smp_wmb();
            WRITE_ONCE(flag, 1); // let another process to run
        } else {
            continue;
        }
    }

    //process 2
    while(1) {
        if(READ_ONCE(flag) == 1) {
	    smp_rmb();
            do_something();
	    smp_wmb();
            WRITE_ONCE(flag, 0); // let another process to run
        } else {
            continue;
        }
    }

In Linux kernel memory model, you can use acquire/release APIs instead:

    //process 1
    while(1) {
        if(smp_load_acquire(&flag) == 0) {
            do_something();
            smp_store_release(&flag, 1); // let another process to run
        } else {
            continue;
        }
    }

    //process 2
    while(1) {
        if(smp_load_acquire(&flag) == 1) {
            do_something();
            smp_store_release(&flag, 0); // let another process to run
        } else {
            continue;
        }
    }

The intention of the code is easier to see when you use well-defined APIs.
Just my two cents.

              Thanks, Akira

>                                                That is because:
> 
>     1) on X86/X64, load/store on 32-bits variable are atomic
>     2) I use READ_ONCE/WRITE_ONCE to prevent possibly harmful compiler
> optimization on `flag'.
>     3) I use only one variable to communicate between two processes,
> so there is no need for any kind of barrier.
> 
> Does anyone have any objection at that?
> 
> I know using a lock or atomic operation will save me a lot of
> argument, but I think those things are unnecessary at this
> circumstance, and it matter where performance matter, so I am picky
> here...
> 
> Yubin
> 
> [1]: https://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
> [2]: https://www.usenix.org/conference/osdi10/ad-hoc-synchronization-considered-harmful
> --
> To unsubscribe from this list: send the line "unsubscribe perfbook" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: synchronize with a non-atomic flag
  2017-10-06 12:03 ` Akira Yokosawa
@ 2017-10-06 12:35   ` Yubin Ruan
  2017-10-06 19:12     ` Paul E. McKenney
  0 siblings, 1 reply; 13+ messages in thread
From: Yubin Ruan @ 2017-10-06 12:35 UTC (permalink / raw)
  To: Akira Yokosawa; +Cc: perfbook, Paul E. McKenney

2017-10-06 20:03 GMT+08:00 Akira Yokosawa <akiyks@gmail.com>:
> Hi Yubin,
>
> On 2017/10/06 14:52, Yubin Ruan wrote:
>> Hi,
>> I saw lots of discussions on the web about possible race when doing
>> synchronization between multiple threads/processes with lock or atomic
>> operations[1][2]. From my point of view most them are over-worrying.
>> But I want to point out some particular issue here to see whether
>> anyone have anything to say.
>>
>> Imagine two processes communicate using only a uint32_t variable in
>> shared memory, like this:
>>
>>     // uint32_t variable in shared memory
>>     uint32_t flag = 0;
>>
>>     //process 1
>>     while(1) {
>>         if(READ_ONCE(flag) == 0) {
>>             do_something();
>>             WRITE_ONCE(flag, 1); // let another process to run
>>         } else {
>>             continue;
>>         }
>>     }
>>
>>     //process 2
>>     while(1) {
>>         if(READ_ONCE(flag) == 1) {
>>             printf("process 2 running...\n");
>>             WRITE_ONCE(flag, 0); // let another process to run
>>         } else {
>>             continue;
>>         }
>>     }
>>
>> On X86 or X64, I expect this code to run correctly, that is, I will
>> got the two `printf' to printf one after one.
>
> Well, I see only one printf() above.
> Do you mean:

yes. sorry about the typo.

>     //process 1
>     while(1) {
>         if(READ_ONCE(flag) == 0) {
>             printf("process 1 running...\n");
>             WRITE_ONCE(flag, 1); // let another process to run
>         } else {
>             continue;
>         }
>     }
>
>     //process 2
>     while(1) {
>         if(READ_ONCE(flag) == 1) {
>             printf("process 2 running...\n");
>             WRITE_ONCE(flag, 0); // let another process to run
>         } else {
>             continue;
>         }
>     }
>
> ?
>
> Then printf()s can be a problem.
> It partially negates your claim 3).
> Without using memory barrier, there is no guarantee that the results of
> WRITE_ONCE() are visible to the other thread after the printf()'s
> memory accesses complete.

But, on X86/X64, where we have cache coherence, the result of
WRITE_ONCE() should be visible to other thread (maybe not immediately,
but eventually it will be visible).

> I/O operations in printf() might make the situation trickier.

printf(3) is claimed to be thread-safe, so I think this issue will not
concern us.

> In a more realistic case where you do something meaningful in
> do_something() in both threads:
>
>     //process 1
>     while(1) {
>         if(READ_ONCE(flag) == 0) {
>             do_something();
>             WRITE_ONCE(flag, 1); // let another process to run
>         } else {
>             continue;
>         }
>     }
>
>     //process 2
>     while(1) {
>         if(READ_ONCE(flag) == 1) {
>             do_something();
>             WRITE_ONCE(flag, 0); // let another process to run
>         } else {
>             continue;
>         }
>     }
>
> and if do_something() uses some shared variables other than "flag",
> you need a couple of memory barriers to ensure the ordering of
> READ_ONCE(), do_something(), and WRITE_ONCE() something like:
>
>     //process 1
>     while(1) {
>         if(READ_ONCE(flag) == 0) {
>             smp_rmb();
>             do_something();
>             smp_wmb();
>             WRITE_ONCE(flag, 1); // let another process to run
>         } else {
>             continue;
>         }
>     }
>
>     //process 2
>     while(1) {
>         if(READ_ONCE(flag) == 1) {
>             smp_rmb();
>             do_something();
>             smp_wmb();
>             WRITE_ONCE(flag, 0); // let another process to run
>         } else {
>             continue;
>         }
>     }
>
> In Linux kernel memory model, you can use acquire/release APIs instead:
>
>     //process 1
>     while(1) {
>         if(smp_load_acquire(&flag) == 0) {
>             do_something();
>             smp_store_release(&flag, 1); // let another process to run
>         } else {
>             continue;
>         }
>     }
>
>     //process 2
>     while(1) {
>         if(smp_load_acquire(&flag) == 1) {
>             do_something();
>             smp_store_release(&flag, 0); // let another process to run
>         } else {
>             continue;
>         }
>     }

Yes it could be tricky when `do_something()' really do something that
involved other shared variable.

Yubin

> The intention of the code is easier to see when you use well-defined APIs.
> Just my two cents.
>
>               Thanks, Akira
>
>>                                                That is because:
>>
>>     1) on X86/X64, load/store on 32-bits variable are atomic
>>     2) I use READ_ONCE/WRITE_ONCE to prevent possibly harmful compiler
>> optimization on `flag'.
>>     3) I use only one variable to communicate between two processes,
>> so there is no need for any kind of barrier.
>>
>> Does anyone have any objection at that?
>>
>> I know using a lock or atomic operation will save me a lot of
>> argument, but I think those things are unnecessary at this
>> circumstance, and it matter where performance matter, so I am picky
>> here...
>>
>> Yubin
>>
>> [1]: https://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
>> [2]: https://www.usenix.org/conference/osdi10/ad-hoc-synchronization-considered-harmful
>> --
>> To unsubscribe from this list: send the line "unsubscribe perfbook" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: synchronize with a non-atomic flag
  2017-10-06 12:35   ` Yubin Ruan
@ 2017-10-06 19:12     ` Paul E. McKenney
  2017-10-07  7:04       ` Yubin Ruan
  0 siblings, 1 reply; 13+ messages in thread
From: Paul E. McKenney @ 2017-10-06 19:12 UTC (permalink / raw)
  To: Yubin Ruan; +Cc: Akira Yokosawa, perfbook

On Fri, Oct 06, 2017 at 08:35:00PM +0800, Yubin Ruan wrote:
> 2017-10-06 20:03 GMT+08:00 Akira Yokosawa <akiyks@gmail.com>:
> > On 2017/10/06 14:52, Yubin Ruan wrote:

[ . . . ]

> > I/O operations in printf() might make the situation trickier.
> 
> printf(3) is claimed to be thread-safe, so I think this issue will not
> concern us.
> 
> > In a more realistic case where you do something meaningful in
> > do_something() in both threads:
> >
> >     //process 1
> >     while(1) {
> >         if(READ_ONCE(flag) == 0) {
> >             do_something();
> >             WRITE_ONCE(flag, 1); // let another process to run
> >         } else {
> >             continue;
> >         }
> >     }
> >
> >     //process 2
> >     while(1) {
> >         if(READ_ONCE(flag) == 1) {
> >             do_something();
> >             WRITE_ONCE(flag, 0); // let another process to run
> >         } else {
> >             continue;
> >         }
> >     }

In the Linux kernel, there is control-dependency ordering between
the READ_ONCE(flag) and any stores in either the then-clause or
the else-clause.  However, I see no ordering between do_something()
and the WRITE_ONCE().

> > and if do_something() uses some shared variables other than "flag",
> > you need a couple of memory barriers to ensure the ordering of
> > READ_ONCE(), do_something(), and WRITE_ONCE() something like:
> >
> >     //process 1
> >     while(1) {
> >         if(READ_ONCE(flag) == 0) {
> >             smp_rmb();
> >             do_something();
> >             smp_wmb();
> >             WRITE_ONCE(flag, 1); // let another process to run
> >         } else {
> >             continue;
> >         }
> >     }
> >
> >     //process 2
> >     while(1) {
> >         if(READ_ONCE(flag) == 1) {
> >             smp_rmb();
> >             do_something();
> >             smp_wmb();
> >             WRITE_ONCE(flag, 0); // let another process to run
> >         } else {
> >             continue;
> >         }
> >     }

Here, the control dependency again orders the READ_ONCE() against later
stores, and the smp_rmb() orders the READ_ONCE() against any later
loads.  The smp_wmb() orders do_something()'s writes (but not its reads!)
against the WRITE_ONCE().

> > In Linux kernel memory model, you can use acquire/release APIs instead:
> >
> >     //process 1
> >     while(1) {
> >         if(smp_load_acquire(&flag) == 0) {
> >             do_something();
> >             smp_store_release(&flag, 1); // let another process to run
> >         } else {
> >             continue;
> >         }
> >     }
> >
> >     //process 2
> >     while(1) {
> >         if(smp_load_acquire(&flag) == 1) {
> >             do_something();
> >             smp_store_release(&flag, 0); // let another process to run
> >         } else {
> >             continue;
> >         }
> >     }

This is probably the most straightforward of the above approaches.

That said, if you really want a series of things to execute in a
particular order, why not just put them into the same process?

							Thanx, Paul

> Yes it could be tricky when `do_something()' really do something that
> involved other shared variable.
> 
> Yubin
> 
> > The intention of the code is easier to see when you use well-defined APIs.
> > Just my two cents.
> >
> >               Thanks, Akira
> >
> >>                                                That is because:
> >>
> >>     1) on X86/X64, load/store on 32-bits variable are atomic
> >>     2) I use READ_ONCE/WRITE_ONCE to prevent possibly harmful compiler
> >> optimization on `flag'.
> >>     3) I use only one variable to communicate between two processes,
> >> so there is no need for any kind of barrier.
> >>
> >> Does anyone have any objection at that?
> >>
> >> I know using a lock or atomic operation will save me a lot of
> >> argument, but I think those things are unnecessary at this
> >> circumstance, and it matter where performance matter, so I am picky
> >> here...
> >>
> >> Yubin
> >>
> >> [1]: https://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
> >> [2]: https://www.usenix.org/conference/osdi10/ad-hoc-synchronization-considered-harmful
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe perfbook" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: synchronize with a non-atomic flag
  2017-10-06 19:12     ` Paul E. McKenney
@ 2017-10-07  7:04       ` Yubin Ruan
  2017-10-07 11:40         ` Akira Yokosawa
  0 siblings, 1 reply; 13+ messages in thread
From: Yubin Ruan @ 2017-10-07  7:04 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Akira Yokosawa, perfbook

Thanks Paul and Akira,

2017-10-07 3:12 GMT+08:00 Paul E. McKenney <paulmck@linux.vnet.ibm.com>:
> On Fri, Oct 06, 2017 at 08:35:00PM +0800, Yubin Ruan wrote:
>> 2017-10-06 20:03 GMT+08:00 Akira Yokosawa <akiyks@gmail.com>:
>> > On 2017/10/06 14:52, Yubin Ruan wrote:
>
> [ . . . ]
>
>> > I/O operations in printf() might make the situation trickier.
>>
>> printf(3) is claimed to be thread-safe, so I think this issue will not
>> concern us.

so now I can pretty much confirm this.

>> > In a more realistic case where you do something meaningful in
>> > do_something() in both threads:
>> >
>> >     //process 1
>> >     while(1) {
>> >         if(READ_ONCE(flag) == 0) {
>> >             do_something();
>> >             WRITE_ONCE(flag, 1); // let another process to run
>> >         } else {
>> >             continue;
>> >         }
>> >     }
>> >
>> >     //process 2
>> >     while(1) {
>> >         if(READ_ONCE(flag) == 1) {
>> >             do_something();
>> >             WRITE_ONCE(flag, 0); // let another process to run
>> >         } else {
>> >             continue;
>> >         }
>> >     }
>
> In the Linux kernel, there is control-dependency ordering between
> the READ_ONCE(flag) and any stores in either the then-clause or
> the else-clause.  However, I see no ordering between do_something()
> and the WRITE_ONCE().

I was not aware of the "control-dependency" ordering issue in the
Linux kernel before. Is it true for all architectures?

But anyway, the ordering between READ_ONCE(flag) and any subsequent
stores are guaranteed on X86/X64, so we didn't need any memory barrier
here.

>> > and if do_something() uses some shared variables other than "flag",
>> > you need a couple of memory barriers to ensure the ordering of
>> > READ_ONCE(), do_something(), and WRITE_ONCE() something like:
>> >
>> >     //process 1
>> >     while(1) {
>> >         if(READ_ONCE(flag) == 0) {
>> >             smp_rmb();
>> >             do_something();
>> >             smp_wmb();
>> >             WRITE_ONCE(flag, 1); // let another process to run
>> >         } else {
>> >             continue;
>> >         }
>> >     }
>> >
>> >     //process 2
>> >     while(1) {
>> >         if(READ_ONCE(flag) == 1) {
>> >             smp_rmb();
>> >             do_something();
>> >             smp_wmb();
>> >             WRITE_ONCE(flag, 0); // let another process to run
>> >         } else {
>> >             continue;
>> >         }
>> >     }
>
> Here, the control dependency again orders the READ_ONCE() against later
> stores, and the smp_rmb() orders the READ_ONCE() against any later
> loads.

Understand and agree.

> The smp_wmb() orders do_something()'s writes (but not its reads!)
> against the WRITE_ONCE().

Understand and agree. But do we really need the smp_rmb() on X86/64?
As far as I know, on X86/64 stores are not reordered with other
stores...[1]

>> > In Linux kernel memory model, you can use acquire/release APIs instead:
>> >
>> >     //process 1
>> >     while(1) {
>> >         if(smp_load_acquire(&flag) == 0) {
>> >             do_something();
>> >             smp_store_release(&flag, 1); // let another process to run
>> >         } else {
>> >             continue;
>> >         }
>> >     }
>> >
>> >     //process 2
>> >     while(1) {
>> >         if(smp_load_acquire(&flag) == 1) {
>> >             do_something();
>> >             smp_store_release(&flag, 0); // let another process to run
>> >         } else {
>> >             continue;
>> >         }
>> >     }
>
> This is probably the most straightforward of the above approaches.
>
> That said, if you really want a series of things to execute in a
> particular order, why not just put them into the same process?

I will be very happy if I can. But sometimes we just have to deal with
issues concerning multiple processes...

[1]: One thing I got a little confused is that some people claim that
on x86/64 there are several guarantees[2]:
    1) Loads are not reordered with other loads.
    2) Stores are not reordered with other stores.
    3) Stores are not reordered with older loads.
(note that Loads may still be reordered with older stores to different
locations)

So, if 1) and 2) are true, why do we have "lfence" and "sfence"
instructions at all?

[2]: I found those claims here, but not so sure whether or not they
are true: https://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: synchronize with a non-atomic flag
  2017-10-07  7:04       ` Yubin Ruan
@ 2017-10-07 11:40         ` Akira Yokosawa
  2017-10-07 13:43           ` Yubin Ruan
  0 siblings, 1 reply; 13+ messages in thread
From: Akira Yokosawa @ 2017-10-07 11:40 UTC (permalink / raw)
  To: Yubin Ruan; +Cc: Paul E. McKenney, perfbook, Akira Yokosawa

On 2017/10/07 15:04:50 +0800, Yubin Ruan wrote:
> Thanks Paul and Akira,
> 
> 2017-10-07 3:12 GMT+08:00 Paul E. McKenney <paulmck@linux.vnet.ibm.com>:
>> On Fri, Oct 06, 2017 at 08:35:00PM +0800, Yubin Ruan wrote:
>>> 2017-10-06 20:03 GMT+08:00 Akira Yokosawa <akiyks@gmail.com>:
>>>> On 2017/10/06 14:52, Yubin Ruan wrote:
>>
>> [ . . . ]
>>
>>>> I/O operations in printf() might make the situation trickier.
>>>
>>> printf(3) is claimed to be thread-safe, so I think this issue will not
>>> concern us.
> 
> so now I can pretty much confirm this.

Yes. Now I recognize that POSIX.1c requires stdio functions to be MT-safe.
By MT-safe, one call to printf() won't be disturbed by other racy function
calls involving output to stdout.

I was disturbed by the following description of MT-Safe in attributes(7)
man page:

    Being MT-Safe does not imply a function is atomic, nor  that  it
    uses  any of the memory synchronization mechanisms POSIX exposes
    to users. [...]

Excerpt from a white paper at http://www.unix.org/whitepapers/reentrant.html:

    The POSIX.1 and C-language functions that operate on character streams
    (represented by pointers to objects of type FILE) are required by POSIX.1c
    to be implemented in such a way that reentrancy is achieved (see ISO/IEC
    9945:1-1996, §8.2). This requirement has a drawback; it imposes
    substantial performance penalties because of the synchronization that
    must be built into the implementations of the functions for the sake of
    reentrancy. [...]

Yubin, thank you for giving me the chance to realize this.

> 
>>>> In a more realistic case where you do something meaningful in
>>>> do_something() in both threads:
>>>>
>>>>     //process 1
>>>>     while(1) {
>>>>         if(READ_ONCE(flag) == 0) {
>>>>             do_something();
>>>>             WRITE_ONCE(flag, 1); // let another process to run
>>>>         } else {
>>>>             continue;
>>>>         }
>>>>     }
>>>>
>>>>     //process 2
>>>>     while(1) {
>>>>         if(READ_ONCE(flag) == 1) {
>>>>             do_something();
>>>>             WRITE_ONCE(flag, 0); // let another process to run
>>>>         } else {
>>>>             continue;
>>>>         }
>>>>     }
>>
>> In the Linux kernel, there is control-dependency ordering between
>> the READ_ONCE(flag) and any stores in either the then-clause or
>> the else-clause.  However, I see no ordering between do_something()
>> and the WRITE_ONCE().
> 
> I was not aware of the "control-dependency" ordering issue in the
> Linux kernel before. Is it true for all architectures?
> 
> But anyway, the ordering between READ_ONCE(flag) and any subsequent
> stores are guaranteed on X86/X64, so we didn't need any memory barrier
> here.
> 
>>>> and if do_something() uses some shared variables other than "flag",
>>>> you need a couple of memory barriers to ensure the ordering of
>>>> READ_ONCE(), do_something(), and WRITE_ONCE() something like:
>>>>
>>>>     //process 1
>>>>     while(1) {
>>>>         if(READ_ONCE(flag) == 0) {
>>>>             smp_rmb();
>>>>             do_something();
>>>>             smp_wmb();
>>>>             WRITE_ONCE(flag, 1); // let another process to run
>>>>         } else {
>>>>             continue;
>>>>         }
>>>>     }
>>>>
>>>>     //process 2
>>>>     while(1) {
>>>>         if(READ_ONCE(flag) == 1) {
>>>>             smp_rmb();
>>>>             do_something();
>>>>             smp_wmb();
>>>>             WRITE_ONCE(flag, 0); // let another process to run
>>>>         } else {
>>>>             continue;
>>>>         }
>>>>     }
>>
>> Here, the control dependency again orders the READ_ONCE() against later
>> stores, and the smp_rmb() orders the READ_ONCE() against any later
>> loads.
> 
> Understand and agree.
> 
>> The smp_wmb() orders do_something()'s writes (but not its reads!)
>> against the WRITE_ONCE().
> 
> Understand and agree. But do we really need the smp_rmb() on X86/64?
> As far as I know, on X86/64 stores are not reordered with other
> stores...[1]
> 
>>>> In Linux kernel memory model, you can use acquire/release APIs instead:
>>>>
>>>>     //process 1
>>>>     while(1) {
>>>>         if(smp_load_acquire(&flag) == 0) {
>>>>             do_something();
>>>>             smp_store_release(&flag, 1); // let another process to run
>>>>         } else {
>>>>             continue;
>>>>         }
>>>>     }
>>>>
>>>>     //process 2
>>>>     while(1) {
>>>>         if(smp_load_acquire(&flag) == 1) {
>>>>             do_something();
>>>>             smp_store_release(&flag, 0); // let another process to run
>>>>         } else {
>>>>             continue;
>>>>         }
>>>>     }
>>
>> This is probably the most straightforward of the above approaches.
>>
>> That said, if you really want a series of things to execute in a
>> particular order, why not just put them into the same process?
> 
> I will be very happy if I can. But sometimes we just have to deal with
> issues concerning multiple processes...
> 
> [1]: One thing I got a little confused is that some people claim that
> on x86/64 there are several guarantees[2]:
>     1) Loads are not reordered with other loads.
>     2) Stores are not reordered with other stores.
>     3) Stores are not reordered with older loads.
> (note that Loads may still be reordered with older stores to different
> locations)
> 
> So, if 1) and 2) are true, why do we have "lfence" and "sfence"
> instructions at all?

Excerpt from Intel 64 and IA-32 Architectures Developer's Manual: Vol. 3A
Section 8.2.5

    [...] Despite the fact that Pentium 4, Intel Xeon, and P6 family
    processors support processor ordering, Intel does not guarantee
    that future processors will support this model. To make software
    portable to future processors, it is recommended that operating systems
    provide critical region and resource control constructs and API's
    (application program interfaces) based on I/O, locking, and/or
    serializing instructions be used to synchronize access to shared
    areas of memory in multiple-processor systems. [...]

So the answer seems "to make software portable to future processors".

        Thanks, Akira

> 
> [2]: I found those claims here, but not so sure whether or not they
> are true: https://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: synchronize with a non-atomic flag
  2017-10-07 11:40         ` Akira Yokosawa
@ 2017-10-07 13:43           ` Yubin Ruan
  2017-10-07 14:36             ` Akira Yokosawa
  0 siblings, 1 reply; 13+ messages in thread
From: Yubin Ruan @ 2017-10-07 13:43 UTC (permalink / raw)
  To: Akira Yokosawa; +Cc: Paul E. McKenney, perfbook

2017-10-07 19:40 GMT+08:00 Akira Yokosawa <akiyks@gmail.com>:
> On 2017/10/07 15:04:50 +0800, Yubin Ruan wrote:
>> Thanks Paul and Akira,
>>
>> 2017-10-07 3:12 GMT+08:00 Paul E. McKenney <paulmck@linux.vnet.ibm.com>:
>>> On Fri, Oct 06, 2017 at 08:35:00PM +0800, Yubin Ruan wrote:
>>>> 2017-10-06 20:03 GMT+08:00 Akira Yokosawa <akiyks@gmail.com>:
>>>>> On 2017/10/06 14:52, Yubin Ruan wrote:
>>>
>>> [ . . . ]
>>>
>>>>> I/O operations in printf() might make the situation trickier.
>>>>
>>>> printf(3) is claimed to be thread-safe, so I think this issue will not
>>>> concern us.
>>
>> so now I can pretty much confirm this.
>
> Yes. Now I recognize that POSIX.1c requires stdio functions to be MT-safe.
> By MT-safe, one call to printf() won't be disturbed by other racy function
> calls involving output to stdout.
>
> I was disturbed by the following description of MT-Safe in attributes(7)
> man page:
>
>     Being MT-Safe does not imply a function is atomic, nor  that  it
>     uses  any of the memory synchronization mechanisms POSIX exposes
>     to users. [...]
>
> Excerpt from a white paper at http://www.unix.org/whitepapers/reentrant.html:
>
>     The POSIX.1 and C-language functions that operate on character streams
>     (represented by pointers to objects of type FILE) are required by POSIX.1c
>     to be implemented in such a way that reentrancy is achieved (see ISO/IEC
>     9945:1-1996, §8.2). This requirement has a drawback; it imposes
>     substantial performance penalties because of the synchronization that
>     must be built into the implementations of the functions for the sake of
>     reentrancy. [...]
>
> Yubin, thank you for giving me the chance to realize this.
>
>>
>>>>> In a more realistic case where you do something meaningful in
>>>>> do_something() in both threads:
>>>>>
>>>>>     //process 1
>>>>>     while(1) {
>>>>>         if(READ_ONCE(flag) == 0) {
>>>>>             do_something();
>>>>>             WRITE_ONCE(flag, 1); // let another process to run
>>>>>         } else {
>>>>>             continue;
>>>>>         }
>>>>>     }
>>>>>
>>>>>     //process 2
>>>>>     while(1) {
>>>>>         if(READ_ONCE(flag) == 1) {
>>>>>             do_something();
>>>>>             WRITE_ONCE(flag, 0); // let another process to run
>>>>>         } else {
>>>>>             continue;
>>>>>         }
>>>>>     }
>>>
>>> In the Linux kernel, there is control-dependency ordering between
>>> the READ_ONCE(flag) and any stores in either the then-clause or
>>> the else-clause.  However, I see no ordering between do_something()
>>> and the WRITE_ONCE().
>>
>> I was not aware of the "control-dependency" ordering issue in the
>> Linux kernel before. Is it true for all architectures?
>>
>> But anyway, the ordering between READ_ONCE(flag) and any subsequent
>> stores are guaranteed on X86/X64, so we didn't need any memory barrier
>> here.
>>
>>>>> and if do_something() uses some shared variables other than "flag",
>>>>> you need a couple of memory barriers to ensure the ordering of
>>>>> READ_ONCE(), do_something(), and WRITE_ONCE() something like:
>>>>>
>>>>>     //process 1
>>>>>     while(1) {
>>>>>         if(READ_ONCE(flag) == 0) {
>>>>>             smp_rmb();
>>>>>             do_something();
>>>>>             smp_wmb();
>>>>>             WRITE_ONCE(flag, 1); // let another process to run
>>>>>         } else {
>>>>>             continue;
>>>>>         }
>>>>>     }
>>>>>
>>>>>     //process 2
>>>>>     while(1) {
>>>>>         if(READ_ONCE(flag) == 1) {
>>>>>             smp_rmb();
>>>>>             do_something();
>>>>>             smp_wmb();
>>>>>             WRITE_ONCE(flag, 0); // let another process to run
>>>>>         } else {
>>>>>             continue;
>>>>>         }
>>>>>     }
>>>
>>> Here, the control dependency again orders the READ_ONCE() against later
>>> stores, and the smp_rmb() orders the READ_ONCE() against any later
>>> loads.
>>
>> Understand and agree.
>>
>>> The smp_wmb() orders do_something()'s writes (but not its reads!)
>>> against the WRITE_ONCE().
>>
>> Understand and agree. But do we really need the smp_rmb() on X86/64?
>> As far as I know, on X86/64 stores are not reordered with other
>> stores...[1]
>>
>>>>> In Linux kernel memory model, you can use acquire/release APIs instead:
>>>>>
>>>>>     //process 1
>>>>>     while(1) {
>>>>>         if(smp_load_acquire(&flag) == 0) {
>>>>>             do_something();
>>>>>             smp_store_release(&flag, 1); // let another process to run
>>>>>         } else {
>>>>>             continue;
>>>>>         }
>>>>>     }
>>>>>
>>>>>     //process 2
>>>>>     while(1) {
>>>>>         if(smp_load_acquire(&flag) == 1) {
>>>>>             do_something();
>>>>>             smp_store_release(&flag, 0); // let another process to run
>>>>>         } else {
>>>>>             continue;
>>>>>         }
>>>>>     }
>>>
>>> This is probably the most straightforward of the above approaches.
>>>
>>> That said, if you really want a series of things to execute in a
>>> particular order, why not just put them into the same process?
>>
>> I will be very happy if I can. But sometimes we just have to deal with
>> issues concerning multiple processes...
>>
>> [1]: One thing I got a little confused is that some people claim that
>> on x86/64 there are several guarantees[2]:
>>     1) Loads are not reordered with other loads.
>>     2) Stores are not reordered with other stores.
>>     3) Stores are not reordered with older loads.
>> (note that Loads may still be reordered with older stores to different
>> locations)
>>
>> So, if 1) and 2) are true, why do we have "lfence" and "sfence"
>> instructions at all?
>
> Excerpt from Intel 64 and IA-32 Architectures Developer's Manual: Vol. 3A
> Section 8.2.5
>
>     [...] Despite the fact that Pentium 4, Intel Xeon, and P6 family
>     processors support processor ordering, Intel does not guarantee
>     that future processors will support this model. To make software
>     portable to future processors, it is recommended that operating systems
>     provide critical region and resource control constructs and API's
>     (application program interfaces) based on I/O, locking, and/or
>     serializing instructions be used to synchronize access to shared
>     areas of memory in multiple-processor systems. [...]
>
> So the answer seems "to make software portable to future processors".

Hmm...so currently these instructions are nops effectively?

Yubin

>
>>
>> [2]: I found those claims here, but not so sure whether or not they
>> are true: https://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/
>>
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: synchronize with a non-atomic flag
  2017-10-07 13:43           ` Yubin Ruan
@ 2017-10-07 14:36             ` Akira Yokosawa
  2017-10-07 20:20               ` Paul E. McKenney
  0 siblings, 1 reply; 13+ messages in thread
From: Akira Yokosawa @ 2017-10-07 14:36 UTC (permalink / raw)
  To: Yubin Ruan; +Cc: Paul E. McKenney, perfbook, Akira Yokosawa

On 2017/10/07 21:43:53 +0800, Yubin Ruan wrote:
> 2017-10-07 19:40 GMT+08:00 Akira Yokosawa <akiyks@gmail.com>:
>> On 2017/10/07 15:04:50 +0800, Yubin Ruan wrote:
>>> Thanks Paul and Akira,
>>>
>>> 2017-10-07 3:12 GMT+08:00 Paul E. McKenney <paulmck@linux.vnet.ibm.com>:
>>>> On Fri, Oct 06, 2017 at 08:35:00PM +0800, Yubin Ruan wrote:
>>>>> 2017-10-06 20:03 GMT+08:00 Akira Yokosawa <akiyks@gmail.com>:
>>>>>> On 2017/10/06 14:52, Yubin Ruan wrote:
>>>>
>>>> [ . . . ]
>>>>
>>>>>> I/O operations in printf() might make the situation trickier.
>>>>>
>>>>> printf(3) is claimed to be thread-safe, so I think this issue will not
>>>>> concern us.
>>>
>>> so now I can pretty much confirm this.
>>
>> Yes. Now I recognize that POSIX.1c requires stdio functions to be MT-safe.
>> By MT-safe, one call to printf() won't be disturbed by other racy function
>> calls involving output to stdout.
>>
>> I was disturbed by the following description of MT-Safe in attributes(7)
>> man page:
>>
>>     Being MT-Safe does not imply a function is atomic, nor  that  it
>>     uses  any of the memory synchronization mechanisms POSIX exposes
>>     to users. [...]
>>
>> Excerpt from a white paper at http://www.unix.org/whitepapers/reentrant.html:
>>
>>     The POSIX.1 and C-language functions that operate on character streams
>>     (represented by pointers to objects of type FILE) are required by POSIX.1c
>>     to be implemented in such a way that reentrancy is achieved (see ISO/IEC
>>     9945:1-1996, §8.2). This requirement has a drawback; it imposes
>>     substantial performance penalties because of the synchronization that
>>     must be built into the implementations of the functions for the sake of
>>     reentrancy. [...]
>>
>> Yubin, thank you for giving me the chance to realize this.
>>
>>>
>>>>>> In a more realistic case where you do something meaningful in
>>>>>> do_something() in both threads:
>>>>>>
>>>>>>     //process 1
>>>>>>     while(1) {
>>>>>>         if(READ_ONCE(flag) == 0) {
>>>>>>             do_something();
>>>>>>             WRITE_ONCE(flag, 1); // let another process to run
>>>>>>         } else {
>>>>>>             continue;
>>>>>>         }
>>>>>>     }
>>>>>>
>>>>>>     //process 2
>>>>>>     while(1) {
>>>>>>         if(READ_ONCE(flag) == 1) {
>>>>>>             do_something();
>>>>>>             WRITE_ONCE(flag, 0); // let another process to run
>>>>>>         } else {
>>>>>>             continue;
>>>>>>         }
>>>>>>     }
>>>>
>>>> In the Linux kernel, there is control-dependency ordering between
>>>> the READ_ONCE(flag) and any stores in either the then-clause or
>>>> the else-clause.  However, I see no ordering between do_something()
>>>> and the WRITE_ONCE().
>>>
>>> I was not aware of the "control-dependency" ordering issue in the
>>> Linux kernel before. Is it true for all architectures?
>>>
>>> But anyway, the ordering between READ_ONCE(flag) and any subsequent
>>> stores are guaranteed on X86/X64, so we didn't need any memory barrier
>>> here.
>>>
>>>>>> and if do_something() uses some shared variables other than "flag",
>>>>>> you need a couple of memory barriers to ensure the ordering of
>>>>>> READ_ONCE(), do_something(), and WRITE_ONCE() something like:
>>>>>>
>>>>>>     //process 1
>>>>>>     while(1) {
>>>>>>         if(READ_ONCE(flag) == 0) {
>>>>>>             smp_rmb();
>>>>>>             do_something();
>>>>>>             smp_wmb();
>>>>>>             WRITE_ONCE(flag, 1); // let another process to run
>>>>>>         } else {
>>>>>>             continue;
>>>>>>         }
>>>>>>     }
>>>>>>
>>>>>>     //process 2
>>>>>>     while(1) {
>>>>>>         if(READ_ONCE(flag) == 1) {
>>>>>>             smp_rmb();
>>>>>>             do_something();
>>>>>>             smp_wmb();
>>>>>>             WRITE_ONCE(flag, 0); // let another process to run
>>>>>>         } else {
>>>>>>             continue;
>>>>>>         }
>>>>>>     }
>>>>
>>>> Here, the control dependency again orders the READ_ONCE() against later
>>>> stores, and the smp_rmb() orders the READ_ONCE() against any later
>>>> loads.
>>>
>>> Understand and agree.
>>>
>>>> The smp_wmb() orders do_something()'s writes (but not its reads!)
>>>> against the WRITE_ONCE().
>>>
>>> Understand and agree. But do we really need the smp_rmb() on X86/64?
>>> As far as I know, on X86/64 stores are not reordered with other
>>> stores...[1]
>>>
>>>>>> In Linux kernel memory model, you can use acquire/release APIs instead:
>>>>>>
>>>>>>     //process 1
>>>>>>     while(1) {
>>>>>>         if(smp_load_acquire(&flag) == 0) {
>>>>>>             do_something();
>>>>>>             smp_store_release(&flag, 1); // let another process to run
>>>>>>         } else {
>>>>>>             continue;
>>>>>>         }
>>>>>>     }
>>>>>>
>>>>>>     //process 2
>>>>>>     while(1) {
>>>>>>         if(smp_load_acquire(&flag) == 1) {
>>>>>>             do_something();
>>>>>>             smp_store_release(&flag, 0); // let another process to run
>>>>>>         } else {
>>>>>>             continue;
>>>>>>         }
>>>>>>     }
>>>>
>>>> This is probably the most straightforward of the above approaches.
>>>>
>>>> That said, if you really want a series of things to execute in a
>>>> particular order, why not just put them into the same process?
>>>
>>> I will be very happy if I can. But sometimes we just have to deal with
>>> issues concerning multiple processes...
>>>
>>> [1]: One thing I got a little confused is that some people claim that
>>> on x86/64 there are several guarantees[2]:
>>>     1) Loads are not reordered with other loads.
>>>     2) Stores are not reordered with other stores.
>>>     3) Stores are not reordered with older loads.
>>> (note that Loads may still be reordered with older stores to different
>>> locations)
>>>
>>> So, if 1) and 2) are true, why do we have "lfence" and "sfence"
>>> instructions at all?
>>
>> Excerpt from Intel 64 and IA-32 Architectures Developer's Manual: Vol. 3A
>> Section 8.2.5
>>
>>     [...] Despite the fact that Pentium 4, Intel Xeon, and P6 family
>>     processors support processor ordering, Intel does not guarantee
>>     that future processors will support this model. To make software
>>     portable to future processors, it is recommended that operating systems
>>     provide critical region and resource control constructs and API's
>>     (application program interfaces) based on I/O, locking, and/or
>>     serializing instructions be used to synchronize access to shared
>>     areas of memory in multiple-processor systems. [...]
>>
>> So the answer seems "to make software portable to future processors".
> 
> Hmm...so currently these instructions are nops effectively?
> 

According to perfbook's Section 14.4.9 "x86" (as of current master),

    However, note that some SSE instructions are weakly ordered (clflush
    and non-temporal move instructions [Int04a]). CPUs that have SSE can
    use mfence for smp_mb(), lfence for smp_rmb(), and sfence for smp_wmb().

So as long as you don't use SSE extensions, I guess they are effectively
nops. But I'm not sure.

Paul, could you enlighten us?

Akira

> Yubin
> 
>>
>>>
>>> [2]: I found those claims here, but not so sure whether or not they
>>> are true: https://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/
>>>
>>
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: synchronize with a non-atomic flag
  2017-10-07 14:36             ` Akira Yokosawa
@ 2017-10-07 20:20               ` Paul E. McKenney
  0 siblings, 0 replies; 13+ messages in thread
From: Paul E. McKenney @ 2017-10-07 20:20 UTC (permalink / raw)
  To: Akira Yokosawa; +Cc: Yubin Ruan, perfbook

On Sat, Oct 07, 2017 at 11:36:45PM +0900, Akira Yokosawa wrote:
> On 2017/10/07 21:43:53 +0800, Yubin Ruan wrote:
> > 2017-10-07 19:40 GMT+08:00 Akira Yokosawa <akiyks@gmail.com>:
> >> On 2017/10/07 15:04:50 +0800, Yubin Ruan wrote:
> >>> Thanks Paul and Akira,
> >>>
> >>> 2017-10-07 3:12 GMT+08:00 Paul E. McKenney <paulmck@linux.vnet.ibm.com>:
> >>>> On Fri, Oct 06, 2017 at 08:35:00PM +0800, Yubin Ruan wrote:
> >>>>> 2017-10-06 20:03 GMT+08:00 Akira Yokosawa <akiyks@gmail.com>:
> >>>>>> On 2017/10/06 14:52, Yubin Ruan wrote:
> >>>>
> >>>> [ . . . ]
> >>>>
> >>>>>> I/O operations in printf() might make the situation trickier.
> >>>>>
> >>>>> printf(3) is claimed to be thread-safe, so I think this issue will not
> >>>>> concern us.
> >>>
> >>> so now I can pretty much confirm this.
> >>
> >> Yes. Now I recognize that POSIX.1c requires stdio functions to be MT-safe.
> >> By MT-safe, one call to printf() won't be disturbed by other racy function
> >> calls involving output to stdout.
> >>
> >> I was disturbed by the following description of MT-Safe in attributes(7)
> >> man page:
> >>
> >>     Being MT-Safe does not imply a function is atomic, nor  that  it
> >>     uses  any of the memory synchronization mechanisms POSIX exposes
> >>     to users. [...]
> >>
> >> Excerpt from a white paper at http://www.unix.org/whitepapers/reentrant.html:
> >>
> >>     The POSIX.1 and C-language functions that operate on character streams
> >>     (represented by pointers to objects of type FILE) are required by POSIX.1c
> >>     to be implemented in such a way that reentrancy is achieved (see ISO/IEC
> >>     9945:1-1996, §8.2). This requirement has a drawback; it imposes
> >>     substantial performance penalties because of the synchronization that
> >>     must be built into the implementations of the functions for the sake of
> >>     reentrancy. [...]
> >>
> >> Yubin, thank you for giving me the chance to realize this.
> >>
> >>>
> >>>>>> In a more realistic case where you do something meaningful in
> >>>>>> do_something() in both threads:
> >>>>>>
> >>>>>>     //process 1
> >>>>>>     while(1) {
> >>>>>>         if(READ_ONCE(flag) == 0) {
> >>>>>>             do_something();
> >>>>>>             WRITE_ONCE(flag, 1); // let another process to run
> >>>>>>         } else {
> >>>>>>             continue;
> >>>>>>         }
> >>>>>>     }
> >>>>>>
> >>>>>>     //process 2
> >>>>>>     while(1) {
> >>>>>>         if(READ_ONCE(flag) == 1) {
> >>>>>>             do_something();
> >>>>>>             WRITE_ONCE(flag, 0); // let another process to run
> >>>>>>         } else {
> >>>>>>             continue;
> >>>>>>         }
> >>>>>>     }
> >>>>
> >>>> In the Linux kernel, there is control-dependency ordering between
> >>>> the READ_ONCE(flag) and any stores in either the then-clause or
> >>>> the else-clause.  However, I see no ordering between do_something()
> >>>> and the WRITE_ONCE().
> >>>
> >>> I was not aware of the "control-dependency" ordering issue in the
> >>> Linux kernel before. Is it true for all architectures?

It is true for all architectures that the Linux kernel supports.
But beware, control dependencies are quite fragile because compilers
break them easily.  See the control dependencies section of the book
for some guidelines for using them -- and for the advice to avoid
using them where feasible.

> >>> But anyway, the ordering between READ_ONCE(flag) and any subsequent
> >>> stores are guaranteed on X86/X64, so we didn't need any memory barrier
> >>> here.

But you do need something to keep the compiler from messing with you,
much the same as if you were using a control dependency.

> >>>>>> and if do_something() uses some shared variables other than "flag",
> >>>>>> you need a couple of memory barriers to ensure the ordering of
> >>>>>> READ_ONCE(), do_something(), and WRITE_ONCE() something like:
> >>>>>>
> >>>>>>     //process 1
> >>>>>>     while(1) {
> >>>>>>         if(READ_ONCE(flag) == 0) {
> >>>>>>             smp_rmb();
> >>>>>>             do_something();
> >>>>>>             smp_wmb();
> >>>>>>             WRITE_ONCE(flag, 1); // let another process to run
> >>>>>>         } else {
> >>>>>>             continue;
> >>>>>>         }
> >>>>>>     }
> >>>>>>
> >>>>>>     //process 2
> >>>>>>     while(1) {
> >>>>>>         if(READ_ONCE(flag) == 1) {
> >>>>>>             smp_rmb();
> >>>>>>             do_something();
> >>>>>>             smp_wmb();
> >>>>>>             WRITE_ONCE(flag, 0); // let another process to run
> >>>>>>         } else {
> >>>>>>             continue;
> >>>>>>         }
> >>>>>>     }
> >>>>
> >>>> Here, the control dependency again orders the READ_ONCE() against later
> >>>> stores, and the smp_rmb() orders the READ_ONCE() against any later
> >>>> loads.
> >>>
> >>> Understand and agree.
> >>>
> >>>> The smp_wmb() orders do_something()'s writes (but not its reads!)
> >>>> against the WRITE_ONCE().
> >>>
> >>> Understand and agree. But do we really need the smp_rmb() on X86/64?
> >>> As far as I know, on X86/64 stores are not reordered with other
> >>> stores...[1]

Give or take SSE instructions.  But the constraints on these are rumored
to have tightened, so I need to check up on this.

> >>>>>> In Linux kernel memory model, you can use acquire/release APIs instead:
> >>>>>>
> >>>>>>     //process 1
> >>>>>>     while(1) {
> >>>>>>         if(smp_load_acquire(&flag) == 0) {
> >>>>>>             do_something();
> >>>>>>             smp_store_release(&flag, 1); // let another process to run
> >>>>>>         } else {
> >>>>>>             continue;
> >>>>>>         }
> >>>>>>     }
> >>>>>>
> >>>>>>     //process 2
> >>>>>>     while(1) {
> >>>>>>         if(smp_load_acquire(&flag) == 1) {
> >>>>>>             do_something();
> >>>>>>             smp_store_release(&flag, 0); // let another process to run
> >>>>>>         } else {
> >>>>>>             continue;
> >>>>>>         }
> >>>>>>     }
> >>>>
> >>>> This is probably the most straightforward of the above approaches.
> >>>>
> >>>> That said, if you really want a series of things to execute in a
> >>>> particular order, why not just put them into the same process?
> >>>
> >>> I will be very happy if I can. But sometimes we just have to deal with
> >>> issues concerning multiple processes...
> >>>
> >>> [1]: One thing I got a little confused is that some people claim that
> >>> on x86/64 there are several guarantees[2]:
> >>>     1) Loads are not reordered with other loads.
> >>>     2) Stores are not reordered with other stores.
> >>>     3) Stores are not reordered with older loads.
> >>> (note that Loads may still be reordered with older stores to different
> >>> locations)
> >>>
> >>> So, if 1) and 2) are true, why do we have "lfence" and "sfence"
> >>> instructions at all?
> >>
> >> Excerpt from Intel 64 and IA-32 Architectures Developer's Manual: Vol. 3A
> >> Section 8.2.5
> >>
> >>     [...] Despite the fact that Pentium 4, Intel Xeon, and P6 family
> >>     processors support processor ordering, Intel does not guarantee
> >>     that future processors will support this model. To make software
> >>     portable to future processors, it is recommended that operating systems
> >>     provide critical region and resource control constructs and API's
> >>     (application program interfaces) based on I/O, locking, and/or
> >>     serializing instructions be used to synchronize access to shared
> >>     areas of memory in multiple-processor systems. [...]
> >>
> >> So the answer seems "to make software portable to future processors".
> > 
> > Hmm...so currently these instructions are nops effectively?
> 
> According to perfbook's Section 14.4.9 "x86" (as of current master),
> 
>     However, note that some SSE instructions are weakly ordered (clflush
>     and non-temporal move instructions [Int04a]). CPUs that have SSE can
>     use mfence for smp_mb(), lfence for smp_rmb(), and sfence for smp_wmb().
> 
> So as long as you don't use SSE extensions, I guess they are effectively
> nops. But I'm not sure.
> 
> Paul, could you enlighten us?

I do need to update that section.  My current understanding is that the
rules on SSE extensions have been tightened to allow reordering within
the sequence of memory operations making up the SSE sequence in question,
but that the CPU is not permitted to reorder the SSE accesses with
surrounding accesses, with the usual exception for reordering prior
stores with later loads.

But I do need to read a current manual and update that section.

In other words, I will be happy to enlighten you guys, but must
enlighten myself first.  ;-)

							Thanx, Paul

> Akira
> 
> > Yubin
> > 
> >>
> >>>
> >>> [2]: I found those claims here, but not so sure whether or not they
> >>> are true: https://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/
> >>>
> >>
> > 
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: synchronize with a non-atomic flag
  2017-10-06  5:52 synchronize with a non-atomic flag Yubin Ruan
  2017-10-06 12:03 ` Akira Yokosawa
@ 2017-10-08  9:12 ` Yubin Ruan
  2017-10-08 16:07   ` Paul E. McKenney
  1 sibling, 1 reply; 13+ messages in thread
From: Yubin Ruan @ 2017-10-08  9:12 UTC (permalink / raw)
  To: perfbook, Akira Yokosawa, Paul E. McKenney

2017-10-06 13:52 GMT+08:00 Yubin Ruan <ablacktshirt@gmail.com>:
> Hi,
> I saw lots of discussions on the web about possible race when doing
> synchronization between multiple threads/processes with lock or atomic
> operations[1][2]. From my point of view most them are over-worrying.
> But I want to point out some particular issue here to see whether
> anyone have anything to say.
>
> Imagine two processes communicate using only a uint32_t variable in
> shared memory, like this:
>
>     // uint32_t variable in shared memory
>     uint32_t flag = 0;
>
>     //process 1
>     while(1) {
>         if(READ_ONCE(flag) == 0) {
>             do_something();
>             WRITE_ONCE(flag, 1); // let another process to run
>         } else {
>             continue;
>         }
>     }
>
>     //process 2
>     while(1) {
>         if(READ_ONCE(flag) == 1) {
>             printf("process 2 running...\n");
>             WRITE_ONCE(flag, 0); // let another process to run
>         } else {
>             continue;
>         }
>     }
>
> On X86 or X64, I expect this code to run correctly, that is, I will
> got the two `printf' to printf one after one. That is because:
>
>     1) on X86/X64, load/store on 32-bits variable are atomic

Ah...this assumption is wrong at the first place. Atomic access on
4-bytes integers is guaranteed only when these integer is aligned on a
4-bytes memory address boundary...

Yubin

>     2) I use READ_ONCE/WRITE_ONCE to prevent possibly harmful compiler
> optimization on `flag'.
>     3) I use only one variable to communicate between two processes,
> so there is no need for any kind of barrier.
>
> Does anyone have any objection at that?
>
> I know using a lock or atomic operation will save me a lot of
> argument, but I think those things are unnecessary at this
> circumstance, and it matter where performance matter, so I am picky
> here...
>
> Yubin
>
> [1]: https://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
> [2]: https://www.usenix.org/conference/osdi10/ad-hoc-synchronization-considered-harmful


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: synchronize with a non-atomic flag
  2017-10-08  9:12 ` Yubin Ruan
@ 2017-10-08 16:07   ` Paul E. McKenney
  2017-10-09  8:40     ` Yubin Ruan
  0 siblings, 1 reply; 13+ messages in thread
From: Paul E. McKenney @ 2017-10-08 16:07 UTC (permalink / raw)
  To: Yubin Ruan; +Cc: perfbook, Akira Yokosawa

On Sun, Oct 08, 2017 at 05:12:18PM +0800, Yubin Ruan wrote:
> 2017-10-06 13:52 GMT+08:00 Yubin Ruan <ablacktshirt@gmail.com>:
> > Hi,
> > I saw lots of discussions on the web about possible race when doing
> > synchronization between multiple threads/processes with lock or atomic
> > operations[1][2]. From my point of view most them are over-worrying.
> > But I want to point out some particular issue here to see whether
> > anyone have anything to say.
> >
> > Imagine two processes communicate using only a uint32_t variable in
> > shared memory, like this:
> >
> >     // uint32_t variable in shared memory
> >     uint32_t flag = 0;
> >
> >     //process 1
> >     while(1) {
> >         if(READ_ONCE(flag) == 0) {
> >             do_something();
> >             WRITE_ONCE(flag, 1); // let another process to run
> >         } else {
> >             continue;
> >         }
> >     }
> >
> >     //process 2
> >     while(1) {
> >         if(READ_ONCE(flag) == 1) {
> >             printf("process 2 running...\n");
> >             WRITE_ONCE(flag, 0); // let another process to run
> >         } else {
> >             continue;
> >         }
> >     }
> >
> > On X86 or X64, I expect this code to run correctly, that is, I will
> > got the two `printf' to printf one after one. That is because:
> >
> >     1) on X86/X64, load/store on 32-bits variable are atomic
> 
> Ah...this assumption is wrong at the first place. Atomic access on
> 4-bytes integers is guaranteed only when these integer is aligned on a
> 4-bytes memory address boundary...

Indeed, accesses crossing cachelines normally won't guarantee you
much of anything other than painful debugging sessions.  ;-)

						Thanx, Paul

> Yubin
> 
> >     2) I use READ_ONCE/WRITE_ONCE to prevent possibly harmful compiler
> > optimization on `flag'.
> >     3) I use only one variable to communicate between two processes,
> > so there is no need for any kind of barrier.
> >
> > Does anyone have any objection at that?
> >
> > I know using a lock or atomic operation will save me a lot of
> > argument, but I think those things are unnecessary at this
> > circumstance, and it matter where performance matter, so I am picky
> > here...
> >
> > Yubin
> >
> > [1]: https://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
> > [2]: https://www.usenix.org/conference/osdi10/ad-hoc-synchronization-considered-harmful
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: synchronize with a non-atomic flag
  2017-10-08 16:07   ` Paul E. McKenney
@ 2017-10-09  8:40     ` Yubin Ruan
  2017-10-09  2:14       ` Paul E. McKenney
  0 siblings, 1 reply; 13+ messages in thread
From: Yubin Ruan @ 2017-10-09  8:40 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: perfbook, Akira Yokosawa

On Sun, Oct 08, 2017 at 09:07:38AM -0700, Paul E. McKenney wrote:
> On Sun, Oct 08, 2017 at 05:12:18PM +0800, Yubin Ruan wrote:
> > 2017-10-06 13:52 GMT+08:00 Yubin Ruan <ablacktshirt@gmail.com>:
> > > Hi,
> > > I saw lots of discussions on the web about possible race when doing
> > > synchronization between multiple threads/processes with lock or atomic
> > > operations[1][2]. From my point of view most them are over-worrying.
> > > But I want to point out some particular issue here to see whether
> > > anyone have anything to say.
> > >
> > > Imagine two processes communicate using only a uint32_t variable in
> > > shared memory, like this:
> > >
> > >     // uint32_t variable in shared memory
> > >     uint32_t flag = 0;
> > >
> > >     //process 1
> > >     while(1) {
> > >         if(READ_ONCE(flag) == 0) {
> > >             do_something();
> > >             WRITE_ONCE(flag, 1); // let another process to run
> > >         } else {
> > >             continue;
> > >         }
> > >     }
> > >
> > >     //process 2
> > >     while(1) {
> > >         if(READ_ONCE(flag) == 1) {
> > >             printf("process 2 running...\n");
> > >             WRITE_ONCE(flag, 0); // let another process to run
> > >         } else {
> > >             continue;
> > >         }
> > >     }
> > >
> > > On X86 or X64, I expect this code to run correctly, that is, I will
> > > got the two `printf' to printf one after one. That is because:
> > >
> > >     1) on X86/X64, load/store on 32-bits variable are atomic
> > 
> > Ah...this assumption is wrong at the first place. Atomic access on
> > 4-bytes integers is guaranteed only when these integer is aligned on a
> > 4-bytes memory address boundary...
> 
> Indeed, accesses crossing cachelines normally won't guarantee you
> much of anything other than painful debugging sessions.  ;-) 

I see similar interfaces in the Linux kernel source[1]:

	#define atomic_set(v, i)	((v)->counter = (i))
	#define atomic_read(v)	((v)->counter)

which set and read 'atomically' from a atomic variable, and by `atomic', they
simply mean:

    The setting is atomic in that the return values of the atomic operations by
    all threads are guaranteed to be correct reflecting either the value that
    has been set with this operation or set with another operation.

    The read is atomic in that the return value is guaranteed to be one of the
    values initialized or modified with the interface operations if a proper
    implicit or explicit memory barrier is used after possible runtime
    initialization by any other thread and the value is modified only with the
    interface operations.
(but still, the compare-and-swap operations still involve lock)

Are those operations atomic because the `atomic_t' is defined as a struct

	typedef struct { int counter; } atomic_t;

and therefore proper alignment and atomic attribute is guaranteed by the
compiler and the CPU? If I do something like this:

    atomic_t v = ATOMIC_INIT(0); // globally visible

    atomic_set(&v, 1); //process 1

    atomic_set(&v, 2); //process 2

    int i = atomic_read(&v); // process 3

will process 3 see any intermediate value between 1 and 2?

Yubin

[1]: Documentation/core-api/atomic_ops.rst


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: synchronize with a non-atomic flag
  2017-10-09  8:40     ` Yubin Ruan
@ 2017-10-09  2:14       ` Paul E. McKenney
  0 siblings, 0 replies; 13+ messages in thread
From: Paul E. McKenney @ 2017-10-09  2:14 UTC (permalink / raw)
  To: Yubin Ruan; +Cc: perfbook, Akira Yokosawa

On Mon, Oct 09, 2017 at 04:40:11PM +0800, Yubin Ruan wrote:
> On Sun, Oct 08, 2017 at 09:07:38AM -0700, Paul E. McKenney wrote:
> > On Sun, Oct 08, 2017 at 05:12:18PM +0800, Yubin Ruan wrote:
> > > 2017-10-06 13:52 GMT+08:00 Yubin Ruan <ablacktshirt@gmail.com>:
> > > > Hi,
> > > > I saw lots of discussions on the web about possible race when doing
> > > > synchronization between multiple threads/processes with lock or atomic
> > > > operations[1][2]. From my point of view most them are over-worrying.
> > > > But I want to point out some particular issue here to see whether
> > > > anyone have anything to say.
> > > >
> > > > Imagine two processes communicate using only a uint32_t variable in
> > > > shared memory, like this:
> > > >
> > > >     // uint32_t variable in shared memory
> > > >     uint32_t flag = 0;
> > > >
> > > >     //process 1
> > > >     while(1) {
> > > >         if(READ_ONCE(flag) == 0) {
> > > >             do_something();
> > > >             WRITE_ONCE(flag, 1); // let another process to run
> > > >         } else {
> > > >             continue;
> > > >         }
> > > >     }
> > > >
> > > >     //process 2
> > > >     while(1) {
> > > >         if(READ_ONCE(flag) == 1) {
> > > >             printf("process 2 running...\n");
> > > >             WRITE_ONCE(flag, 0); // let another process to run
> > > >         } else {
> > > >             continue;
> > > >         }
> > > >     }
> > > >
> > > > On X86 or X64, I expect this code to run correctly, that is, I will
> > > > got the two `printf' to printf one after one. That is because:
> > > >
> > > >     1) on X86/X64, load/store on 32-bits variable are atomic
> > > 
> > > Ah...this assumption is wrong at the first place. Atomic access on
> > > 4-bytes integers is guaranteed only when these integer is aligned on a
> > > 4-bytes memory address boundary...
> > 
> > Indeed, accesses crossing cachelines normally won't guarantee you
> > much of anything other than painful debugging sessions.  ;-) 
> 
> I see similar interfaces in the Linux kernel source[1]:
> 
> 	#define atomic_set(v, i)	((v)->counter = (i))
> 	#define atomic_read(v)	((v)->counter)
> 
> which set and read 'atomically' from a atomic variable, and by `atomic', they
> simply mean:
> 
>     The setting is atomic in that the return values of the atomic operations by
>     all threads are guaranteed to be correct reflecting either the value that
>     has been set with this operation or set with another operation.
> 
>     The read is atomic in that the return value is guaranteed to be one of the
>     values initialized or modified with the interface operations if a proper
>     implicit or explicit memory barrier is used after possible runtime
>     initialization by any other thread and the value is modified only with the
>     interface operations.
> (but still, the compare-and-swap operations still involve lock)
> 
> Are those operations atomic because the `atomic_t' is defined as a struct
> 
> 	typedef struct { int counter; } atomic_t;
> 
> and therefore proper alignment and atomic attribute is guaranteed by the
> compiler and the CPU?

Yes, unless you take explicit action to force unalignment, usually
by allocating a block of memory and constructing an unaligned pointer
to the middle of it, but this is almost never a good thing to do.

>                       If I do something like this:
> 
>     atomic_t v = ATOMIC_INIT(0); // globally visible
> 
>     atomic_set(&v, 1); //process 1
> 
>     atomic_set(&v, 2); //process 2
> 
>     int i = atomic_read(&v); // process 3
> 
> will process 3 see any intermediate value between 1 and 2?

Given this code, process 3 should see only the values 0, 1, and 2.

							Thanx, Paul

> Yubin
> 
> [1]: Documentation/core-api/atomic_ops.rst
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-10-09  8:40 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-10-06  5:52 synchronize with a non-atomic flag Yubin Ruan
2017-10-06 12:03 ` Akira Yokosawa
2017-10-06 12:35   ` Yubin Ruan
2017-10-06 19:12     ` Paul E. McKenney
2017-10-07  7:04       ` Yubin Ruan
2017-10-07 11:40         ` Akira Yokosawa
2017-10-07 13:43           ` Yubin Ruan
2017-10-07 14:36             ` Akira Yokosawa
2017-10-07 20:20               ` Paul E. McKenney
2017-10-08  9:12 ` Yubin Ruan
2017-10-08 16:07   ` Paul E. McKenney
2017-10-09  8:40     ` Yubin Ruan
2017-10-09  2:14       ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox