RE: __rcu_process_callbacks() in Linux 2.6

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RE: __rcu_process_callbacks() in Linux 2.6
       [not found] <590353.52909.qm@web83819.mail.sp1.yahoo.com>
@ 2007-11-26 22:48 ` James Huang
  2007-11-27  2:39   ` Paul E. McKenney
  0 siblings, 1 reply; 11+ messages in thread
From: James Huang @ 2007-11-26 22:48 UTC (permalink / raw)
  To: Paul E. McKenney, linux-kernel, Manfred Spraul; +Cc: jamesclhuang



> -----Original Message-----
> From: James Huang [mailto:jamesclhuang@yahoo.com]
> Sent: Monday, November 26, 2007 2:21 PM
> To: James Huang
> Subject: Fw: __rcu_process_callbacks() in Linux 2.6
> 
> ----- Forwarded Message ----
> From: Manfred Spraul <manfred@colorfullife.com>
> To: James Huang <jamesclhuang@yahoo.com>
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>; linux-
> kernel@vger.kernel.org
> Sent: Monday, November 26, 2007 10:28:37 AM
> Subject: __rcu_process_callbacks() in Linux 2.6
> 
> Hi James,
> 
> If I understand the issue correctly, then the race is:
> 
> step 1: cpu 1: starts a new rcu batch (i.e. rcp->cur++, smb_mb)
> 
> step 2: cpu 2: completes the quiet state
> step 3: cpu 2: reads pointer 0x123 (ptr to a rcu protected struct)
> 
> step 4: cpu 3: call_rcu(0x123): rcu protected struct added to
rdp->nxtlist
> step 5: cpu 3: moves a new batch into rdp->curlist, rdp->batch = rcp-
> >cur+1.
> xxxxxxxxxxxxxxx Problem: where is the smp_rmb() that guarantees that
> xxxxxxxxxxxxxxx  update to rcp->cur from step 1 is seen by cpu 3?
> step 6: cpu 3: completes quiet state
> step 7: cpu 3: struct 0x123 destroyed
> 
> step 8: cpu 2: accesses pointer 0x123, but the struct is already
destroyed
> 
> James: Is that the race?


[James Huang] 

Yes, this is the race condition that I am concerned about.


> 
> I agree with Paul, there are smb_rmb's on cpu 3 between Step 1 and
Step 5:
> Either the test_and_set_bit in tasklet_action for rcu_process_callback
> if step 4 happens before the tasklet or somewhere in the irq handler
> path if step 4 happens in an irq handler that interrupted
> rcu_process_callback.
> 
> Thus theoretically no additional smb_rmb() should be necessary.
> What is missing is proper documentation.
> 


[James Huang] 

Is it true that a smb_rmb() before a read operation (say from variable
X) will guarantee that the read will always retrieve the most "current"
value of X?   I can not find such a guarantee in atomic_ops.txt or
memory-barriers.txt under Linux's documentation directory.  What is
described in both documents is relative ordering, e.g.

            CPU1                       CPU2
           ------                     ------
          write X = x1
          smp_wmb()  
          write Y = y1 

                                      read Y
                                      smp_rmb()
                                      read X

Then CPU2 will read X with a value of x1 if it reads Y with a value of
y1.

Please point me to the right section in the document if smp_rmb() does
provide such a guarantee.


Thanks,
-- James Huang



> I'm analyzing the code right now:
> Is it really true that typically a cpu only completes data in every
other
> rcu
> cycle? I.e. that most structures are stored in the rcu callback list
until
> two
> quiet states happened?
> 
> I've tried to track the values of rcp->cur and rdp->batch.
> If next_pending is set, then cpu_quiet() immetiately starts
> the next rcu cycle and a cpu cannot both complete the currently
> pending rcu callbacks and add new callbacks to the next cycle,
> thus a cpu only takes part in every other rcu cycle.
> 
> The oocalc file is at
> http://www.colorfullife.com/~manfred/rcu.ods
> http://www.colorfullife.com/~manfred/rcu.pdf
> 
> Is that analysis correct? Perhaps the whole code should be rewritten?
> 
> --
>     Manfred

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: __rcu_process_callbacks() in Linux 2.6
  2007-11-26 22:48 ` __rcu_process_callbacks() in Linux 2.6 James Huang
@ 2007-11-27  2:39   ` Paul E. McKenney
  2007-11-28  1:49     ` Paul E. McKenney
  0 siblings, 1 reply; 11+ messages in thread
From: Paul E. McKenney @ 2007-11-27  2:39 UTC (permalink / raw)
  To: James Huang; +Cc: linux-kernel, Manfred Spraul, jamesclhuang, ego

On Mon, Nov 26, 2007 at 02:48:08PM -0800, James Huang wrote:
> 
> > -----Original Message-----
> > From: James Huang [mailto:jamesclhuang@yahoo.com]
> > Sent: Monday, November 26, 2007 2:21 PM
> > To: James Huang
> > Subject: Fw: __rcu_process_callbacks() in Linux 2.6
> > 
> > ----- Forwarded Message ----
> > From: Manfred Spraul <manfred@colorfullife.com>
> > To: James Huang <jamesclhuang@yahoo.com>
> > Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>; linux-
> > kernel@vger.kernel.org
> > Sent: Monday, November 26, 2007 10:28:37 AM
> > Subject: __rcu_process_callbacks() in Linux 2.6
> > 
> > Hi James,
> > 
> > If I understand the issue correctly, then the race is:
> > 
> > step 1: cpu 1: starts a new rcu batch (i.e. rcp->cur++, smb_mb)
> > 
> > step 2: cpu 2: completes the quiet state
> > step 3: cpu 2: reads pointer 0x123 (ptr to a rcu protected struct)
> > 
> > step 4: cpu 3: call_rcu(0x123): rcu protected struct added to
> rdp->nxtlist
> > step 5: cpu 3: moves a new batch into rdp->curlist, rdp->batch = rcp-
> > >cur+1.
> > xxxxxxxxxxxxxxx Problem: where is the smp_rmb() that guarantees that
> > xxxxxxxxxxxxxxx  update to rcp->cur from step 1 is seen by cpu 3?
> > step 6: cpu 3: completes quiet state
> > step 7: cpu 3: struct 0x123 destroyed
> > 
> > step 8: cpu 2: accesses pointer 0x123, but the struct is already
> destroyed
> > 
> > James: Is that the race?
> 
> 
> [James Huang] 
> 
> Yes, this is the race condition that I am concerned about.
> 
> 
> > 
> > I agree with Paul, there are smb_rmb's on cpu 3 between Step 1 and
> Step 5:
> > Either the test_and_set_bit in tasklet_action for rcu_process_callback
> > if step 4 happens before the tasklet or somewhere in the irq handler
> > path if step 4 happens in an irq handler that interrupted
> > rcu_process_callback.
> > 
> > Thus theoretically no additional smb_rmb() should be necessary.
> > What is missing is proper documentation.
> > 
> 
> 
> [James Huang] 
> 
> Is it true that a smb_rmb() before a read operation (say from variable
> X) will guarantee that the read will always retrieve the most "current"
> value of X?   I can not find such a guarantee in atomic_ops.txt or
> memory-barriers.txt under Linux's documentation directory.  What is
> described in both documents is relative ordering, e.g.
> 
>             CPU1                       CPU2
>            ------                     ------
>           write X = x1
>           smp_wmb()  
>           write Y = y1 
> 
>                                       read Y
>                                       smp_rmb()
>                                       read X
> 
> Then CPU2 will read X with a value of x1 if it reads Y with a value of
> y1.
> 
> Please point me to the right section in the document if smp_rmb() does
> provide such a guarantee.

You are correct, smp_rmb() is about ordering rather than about any sort
of immediacy.  For one thing, it can be quite difficult to say exactly what
the most "current" version of X might be at a given point in time from
the viewpoint of a given CPU -- the different CPUs might well disagree as
to what the "current" version is for awhile (though they are guaranteed
to come to agreement).

> Thanks,
> -- James Huang
> 
> > I'm analyzing the code right now:
> > Is it really true that typically a cpu only completes data in every
> other
> > rcu
> > cycle? I.e. that most structures are stored in the rcu callback list
> until
> > two
> > quiet states happened?

That is correct.  This does mean that we should be able to leverage
locking primitives and memory barriers executed from the scheduling
clock interrupt.

> > I've tried to track the values of rcp->cur and rdp->batch.
> > If next_pending is set, then cpu_quiet() immetiately starts
> > the next rcu cycle and a cpu cannot both complete the currently
> > pending rcu callbacks and add new callbacks to the next cycle,
> > thus a cpu only takes part in every other rcu cycle.
> > 
> > The oocalc file is at
> > http://www.colorfullife.com/~manfred/rcu.ods
> > http://www.colorfullife.com/~manfred/rcu.pdf
> > 
> > Is that analysis correct? Perhaps the whole code should be rewritten?

I believe that the sequencing in spreadsheet is correct (and thank
you very much for going through it!!!), but it seems to be silent on
memory-barrier issues.

I also believe that Gautham's new CPU-hotplug setup will make
it possible to simplify the code quite a bit.  And given that the
grace-period-detection code is not on any sort of hot code path, it should
be possible to use a less-aggressive design, perhaps one using straight
locking to guard the shared structures.  Also, we are working in the
-rt implementation on a scheme that allows CPUs to stay asleep through
a grace period without the heavy overhead that is otherwise required to
interact with them.  The trick is to maintain a per-CPU counter that is
incremented on each entry and exit to low-power state.  But I would like
to get this right in -rt before trying it in Classic RCU.  ;-)

						Thanx, Paul

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: __rcu_process_callbacks() in Linux 2.6
  2007-11-27  2:39   ` Paul E. McKenney
@ 2007-11-28  1:49     ` Paul E. McKenney
  2007-11-28  6:21       ` Paul E. McKenney
  0 siblings, 1 reply; 11+ messages in thread
From: Paul E. McKenney @ 2007-11-28  1:49 UTC (permalink / raw)
  To: James Huang; +Cc: linux-kernel, Manfred Spraul, jamesclhuang, ego

On Mon, Nov 26, 2007 at 06:39:58PM -0800, Paul E. McKenney wrote:
> On Mon, Nov 26, 2007 at 02:48:08PM -0800, James Huang wrote:
> > 
> > > -----Original Message-----
> > > From: James Huang [mailto:jamesclhuang@yahoo.com]
> > > Sent: Monday, November 26, 2007 2:21 PM
> > > To: James Huang
> > > Subject: Fw: __rcu_process_callbacks() in Linux 2.6
> > > 
> > > ----- Forwarded Message ----
> > > From: Manfred Spraul <manfred@colorfullife.com>
> > > To: James Huang <jamesclhuang@yahoo.com>
> > > Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>; linux-
> > > kernel@vger.kernel.org
> > > Sent: Monday, November 26, 2007 10:28:37 AM
> > > Subject: __rcu_process_callbacks() in Linux 2.6
> > > 
> > > Hi James,
> > > 
> > > If I understand the issue correctly, then the race is:
> > > 
> > > step 1: cpu 1: starts a new rcu batch (i.e. rcp->cur++, smb_mb)
> > > 
> > > step 2: cpu 2: completes the quiet state
> > > step 3: cpu 2: reads pointer 0x123 (ptr to a rcu protected struct)
> > > 
> > > step 4: cpu 3: call_rcu(0x123): rcu protected struct added to
> > rdp->nxtlist
> > > step 5: cpu 3: moves a new batch into rdp->curlist, rdp->batch = rcp-
> > > >cur+1.
> > > xxxxxxxxxxxxxxx Problem: where is the smp_rmb() that guarantees that
> > > xxxxxxxxxxxxxxx  update to rcp->cur from step 1 is seen by cpu 3?
> > > step 6: cpu 3: completes quiet state
> > > step 7: cpu 3: struct 0x123 destroyed
> > > 
> > > step 8: cpu 2: accesses pointer 0x123, but the struct is already
> > destroyed
> > > 
> > > James: Is that the race?
> > 
> > 
> > [James Huang] 
> > 
> > Yes, this is the race condition that I am concerned about.
> > 
> > 
> > > 
> > > I agree with Paul, there are smb_rmb's on cpu 3 between Step 1 and
> > Step 5:
> > > Either the test_and_set_bit in tasklet_action for rcu_process_callback
> > > if step 4 happens before the tasklet or somewhere in the irq handler
> > > path if step 4 happens in an irq handler that interrupted
> > > rcu_process_callback.
> > > 
> > > Thus theoretically no additional smb_rmb() should be necessary.
> > > What is missing is proper documentation.
> > > 
> > 
> > 
> > [James Huang] 
> > 
> > Is it true that a smb_rmb() before a read operation (say from variable
> > X) will guarantee that the read will always retrieve the most "current"
> > value of X?   I can not find such a guarantee in atomic_ops.txt or
> > memory-barriers.txt under Linux's documentation directory.  What is
> > described in both documents is relative ordering, e.g.
> > 
> >             CPU1                       CPU2
> >            ------                     ------
> >           write X = x1
> >           smp_wmb()  
> >           write Y = y1 
> > 
> >                                       read Y
> >                                       smp_rmb()
> >                                       read X
> > 
> > Then CPU2 will read X with a value of x1 if it reads Y with a value of
> > y1.
> > 
> > Please point me to the right section in the document if smp_rmb() does
> > provide such a guarantee.
> 
> You are correct, smp_rmb() is about ordering rather than about any sort
> of immediacy.  For one thing, it can be quite difficult to say exactly what
> the most "current" version of X might be at a given point in time from
> the viewpoint of a given CPU -- the different CPUs might well disagree as
> to what the "current" version is for awhile (though they are guaranteed
> to come to agreement).
> 
> > Thanks,
> > -- James Huang
> > 
> > > I'm analyzing the code right now:
> > > Is it really true that typically a cpu only completes data in every
> > other
> > > rcu
> > > cycle? I.e. that most structures are stored in the rcu callback list
> > until
> > > two
> > > quiet states happened?
> 
> That is correct.  This does mean that we should be able to leverage
> locking primitives and memory barriers executed from the scheduling
> clock interrupt.

And I managed to get some time on a 64-CPU POWER5+ system.  Been running
rcutorture for about 2.5 hours without a failure (128 reader processes)
running through not quite 1.5M RCU updates.  Of course, this is not
proof that the Classic RCU implementation works, but is should provide
some reassurance.

I will keep it running until I get kicked off (probably rather soon).

						Thanx, Paul

> > > I've tried to track the values of rcp->cur and rdp->batch.
> > > If next_pending is set, then cpu_quiet() immetiately starts
> > > the next rcu cycle and a cpu cannot both complete the currently
> > > pending rcu callbacks and add new callbacks to the next cycle,
> > > thus a cpu only takes part in every other rcu cycle.
> > > 
> > > The oocalc file is at
> > > http://www.colorfullife.com/~manfred/rcu.ods
> > > http://www.colorfullife.com/~manfred/rcu.pdf
> > > 
> > > Is that analysis correct? Perhaps the whole code should be rewritten?
> 
> I believe that the sequencing in spreadsheet is correct (and thank
> you very much for going through it!!!), but it seems to be silent on
> memory-barrier issues.
> 
> I also believe that Gautham's new CPU-hotplug setup will make
> it possible to simplify the code quite a bit.  And given that the
> grace-period-detection code is not on any sort of hot code path, it should
> be possible to use a less-aggressive design, perhaps one using straight
> locking to guard the shared structures.  Also, we are working in the
> -rt implementation on a scheme that allows CPUs to stay asleep through
> a grace period without the heavy overhead that is otherwise required to
> interact with them.  The trick is to maintain a per-CPU counter that is
> incremented on each entry and exit to low-power state.  But I would like
> to get this right in -rt before trying it in Classic RCU.  ;-)
> 
> 						Thanx, Paul

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: __rcu_process_callbacks() in Linux 2.6
  2007-11-28  1:49     ` Paul E. McKenney
@ 2007-11-28  6:21       ` Paul E. McKenney
  2007-11-28 15:51         ` Paul E. McKenney
  0 siblings, 1 reply; 11+ messages in thread
From: Paul E. McKenney @ 2007-11-28  6:21 UTC (permalink / raw)
  To: James Huang; +Cc: linux-kernel, Manfred Spraul, jamesclhuang, ego

On Tue, Nov 27, 2007 at 05:49:15PM -0800, Paul E. McKenney wrote:
> On Mon, Nov 26, 2007 at 06:39:58PM -0800, Paul E. McKenney wrote:
> > On Mon, Nov 26, 2007 at 02:48:08PM -0800, James Huang wrote:
> > > > From: James Huang [mailto:jamesclhuang@yahoo.com]
> > > > Sent: Monday, November 26, 2007 2:21 PM
> > > > To: James Huang
> > > > Subject: Fw: __rcu_process_callbacks() in Linux 2.6
> > > > 
> > > > ----- Forwarded Message ----
> > > > From: Manfred Spraul <manfred@colorfullife.com>
> > > > To: James Huang <jamesclhuang@yahoo.com>
> > > > Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>; linux-
> > > > kernel@vger.kernel.org
> > > > Sent: Monday, November 26, 2007 10:28:37 AM
> > > > Subject: __rcu_process_callbacks() in Linux 2.6
> > > > 
> > > > Hi James,
> > > > 
> > > > If I understand the issue correctly, then the race is:
> > > > 
> > > > step 1: cpu 1: starts a new rcu batch (i.e. rcp->cur++, smb_mb)
> > > > 
> > > > step 2: cpu 2: completes the quiet state
> > > > step 3: cpu 2: reads pointer 0x123 (ptr to a rcu protected struct)
> > > > 
> > > > step 4: cpu 3: call_rcu(0x123): rcu protected struct added to
> > > rdp->nxtlist
> > > > step 5: cpu 3: moves a new batch into rdp->curlist, rdp->batch = rcp-
> > > > >cur+1.
> > > > xxxxxxxxxxxxxxx Problem: where is the smp_rmb() that guarantees that
> > > > xxxxxxxxxxxxxxx  update to rcp->cur from step 1 is seen by cpu 3?
> > > > step 6: cpu 3: completes quiet state
> > > > step 7: cpu 3: struct 0x123 destroyed
> > > > 
> > > > step 8: cpu 2: accesses pointer 0x123, but the struct is already
> > > destroyed
> > > > 
> > > > James: Is that the race?
> > > 
> > > 
> > > [James Huang] 
> > > 
> > > Yes, this is the race condition that I am concerned about.
> > > 
> > > 
> > > > 
> > > > I agree with Paul, there are smb_rmb's on cpu 3 between Step 1 and
> > > Step 5:
> > > > Either the test_and_set_bit in tasklet_action for rcu_process_callback
> > > > if step 4 happens before the tasklet or somewhere in the irq handler
> > > > path if step 4 happens in an irq handler that interrupted
> > > > rcu_process_callback.
> > > > 
> > > > Thus theoretically no additional smb_rmb() should be necessary.
> > > > What is missing is proper documentation.
> > > > 
> > > 
> > > 
> > > [James Huang] 
> > > 
> > > Is it true that a smb_rmb() before a read operation (say from variable
> > > X) will guarantee that the read will always retrieve the most "current"
> > > value of X?   I can not find such a guarantee in atomic_ops.txt or
> > > memory-barriers.txt under Linux's documentation directory.  What is
> > > described in both documents is relative ordering, e.g.
> > > 
> > >             CPU1                       CPU2
> > >            ------                     ------
> > >           write X = x1
> > >           smp_wmb()  
> > >           write Y = y1 
> > > 
> > >                                       read Y
> > >                                       smp_rmb()
> > >                                       read X
> > > 
> > > Then CPU2 will read X with a value of x1 if it reads Y with a value of
> > > y1.
> > > 
> > > Please point me to the right section in the document if smp_rmb() does
> > > provide such a guarantee.
> > 
> > You are correct, smp_rmb() is about ordering rather than about any sort
> > of immediacy.  For one thing, it can be quite difficult to say exactly what
> > the most "current" version of X might be at a given point in time from
> > the viewpoint of a given CPU -- the different CPUs might well disagree as
> > to what the "current" version is for awhile (though they are guaranteed
> > to come to agreement).
> > 
> > > Thanks,
> > > -- James Huang
> > > 
> > > > I'm analyzing the code right now:
> > > > Is it really true that typically a cpu only completes data in every
> > > other
> > > > rcu
> > > > cycle? I.e. that most structures are stored in the rcu callback list
> > > until
> > > > two
> > > > quiet states happened?
> > 
> > That is correct.  This does mean that we should be able to leverage
> > locking primitives and memory barriers executed from the scheduling
> > clock interrupt.
> 
> And I managed to get some time on a 64-CPU POWER5+ system.  Been running
> rcutorture for about 2.5 hours without a failure (128 reader processes)
> running through not quite 1.5M RCU updates.  Of course, this is not
> proof that the Classic RCU implementation works, but is should provide
> some reassurance.
> 
> I will keep it running until I get kicked off (probably rather soon).

More than seven hours, more than 4M RCU updates without failure.
Someone else's turn for the machine.

Again, not proof, but at least some reassurance.

						Thanx, Paul
> 
> > > > I've tried to track the values of rcp->cur and rdp->batch.
> > > > If next_pending is set, then cpu_quiet() immetiately starts
> > > > the next rcu cycle and a cpu cannot both complete the currently
> > > > pending rcu callbacks and add new callbacks to the next cycle,
> > > > thus a cpu only takes part in every other rcu cycle.
> > > > 
> > > > The oocalc file is at
> > > > http://www.colorfullife.com/~manfred/rcu.ods
> > > > http://www.colorfullife.com/~manfred/rcu.pdf
> > > > 
> > > > Is that analysis correct? Perhaps the whole code should be rewritten?
> > 
> > I believe that the sequencing in spreadsheet is correct (and thank
> > you very much for going through it!!!), but it seems to be silent on
> > memory-barrier issues.
> > 
> > I also believe that Gautham's new CPU-hotplug setup will make
> > it possible to simplify the code quite a bit.  And given that the
> > grace-period-detection code is not on any sort of hot code path, it should
> > be possible to use a less-aggressive design, perhaps one using straight
> > locking to guard the shared structures.  Also, we are working in the
> > -rt implementation on a scheme that allows CPUs to stay asleep through
> > a grace period without the heavy overhead that is otherwise required to
> > interact with them.  The trick is to maintain a per-CPU counter that is
> > incremented on each entry and exit to low-power state.  But I would like
> > to get this right in -rt before trying it in Classic RCU.  ;-)
> > 
> > 						Thanx, Paul

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: __rcu_process_callbacks() in Linux 2.6
  2007-11-28  6:21       ` Paul E. McKenney
@ 2007-11-28 15:51         ` Paul E. McKenney
  0 siblings, 0 replies; 11+ messages in thread
From: Paul E. McKenney @ 2007-11-28 15:51 UTC (permalink / raw)
  To: James Huang; +Cc: linux-kernel, Manfred Spraul, jamesclhuang, ego, akpm

On Tue, Nov 27, 2007 at 10:21:08PM -0800, Paul E. McKenney wrote:
> On Tue, Nov 27, 2007 at 05:49:15PM -0800, Paul E. McKenney wrote:
> > On Mon, Nov 26, 2007 at 06:39:58PM -0800, Paul E. McKenney wrote:
> > > On Mon, Nov 26, 2007 at 02:48:08PM -0800, James Huang wrote:

[ . . . ]

> > > That is correct.  This does mean that we should be able to leverage
> > > locking primitives and memory barriers executed from the scheduling
> > > clock interrupt.
> > 
> > And I managed to get some time on a 64-CPU POWER5+ system.  Been running
> > rcutorture for about 2.5 hours without a failure (128 reader processes)
> > running through not quite 1.5M RCU updates.  Of course, this is not
> > proof that the Classic RCU implementation works, but is should provide
> > some reassurance.
> > 
> > I will keep it running until I get kicked off (probably rather soon).
> 
> More than seven hours, more than 4M RCU updates without failure.
> Someone else's turn for the machine.
> 
> Again, not proof, but at least some reassurance.

Just to avoid potential misunderstanding...  This testing in no way
changes my opinion that the RCU Classic implementation needs some
combination of documentation and rework.  It pretty clearly needs a
bit of both.  And the increase in core counts over the past few years
indicates that some scalability work is in order.

What this testing does demonstrate is that we are not in an emergency
situation.

						Thanx, Paul

^ permalink raw reply	[flat|nested] 11+ messages in thread

* __rcu_process_callbacks() in Linux 2.6
@ 2007-11-26 18:28 Manfred Spraul
  0 siblings, 0 replies; 11+ messages in thread
From: Manfred Spraul @ 2007-11-26 18:28 UTC (permalink / raw)
  To: James Huang; +Cc: Paul E. McKenney, linux-kernel

Hi James,

If I understand the issue correctly, then the race is:

step 1: cpu 1: starts a new rcu batch (i.e. rcp->cur++, smb_mb)

step 2: cpu 2: completes the quiet state 
step 3: cpu 2: reads pointer 0x123 (ptr to a rcu protected struct)

step 4: cpu 3: call_rcu(0x123): rcu protected struct added to rdp->nxtlist
step 5: cpu 3: moves a new batch into rdp->curlist, rdp->batch = rcp->cur+1.
xxxxxxxxxxxxxxx Problem: where is the smp_rmb() that guarantees that
xxxxxxxxxxxxxxx   update to rcp->cur from step 1 is seen by cpu 3?
step 6: cpu 3: completes quiet state
step 7: cpu 3: struct 0x123 destroyed

step 8: cpu 2: accesses pointer 0x123, but the struct is already destroyed

James: Is that the race?

I agree with Paul, there are smb_rmb's on cpu 3 between Step 1 and Step 5:
Either the test_and_set_bit in tasklet_action for rcu_process_callback
if step 4 happens before the tasklet or somewhere in the irq handler
path if step 4 happens in an irq handler that interrupted
rcu_process_callback.

Thus theoretically no additional smb_rmb() should be necessary.
What is missing is proper documentation.

I'm analyzing the code right now:
Is it really true that typically a cpu only completes data in every other rcu
cycle? I.e. that most structures are stored in the rcu callback list until two
quiet states happened?

I've tried to track the values of rcp->cur and rdp->batch.
If next_pending is set, then cpu_quiet() immetiately starts
the next rcu cycle and a cpu cannot both complete the currently
pending rcu callbacks and add new callbacks to the next cycle,
thus a cpu only takes part in every other rcu cycle.

The oocalc file is at
http://www.colorfullife.com/~manfred/rcu.ods
http://www.colorfullife.com/~manfred/rcu.pdf

Is that analysis correct? Perhaps the whole code should be rewritten?

--
	Manfred

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: __rcu_process_callbacks() in Linux 2.6
@ 2007-11-21 19:57 James Huang
  2007-11-21 21:25 ` Paul E. McKenney
  0 siblings, 1 reply; 11+ messages in thread
From: James Huang @ 2007-11-21 19:57 UTC (permalink / raw)
  To: paulmck; +Cc: linux-kernel, manfred, dipankar, james.huang

Paul,

       I am not sure I understand your answer about using test_and_set_bit() in tasklet_schedule() as a 
memory barrier in this case.
 
       Yes, tasklet_schedule() includes a test_and_set_bit(TASKLET_STATE_SCHED, &t->state) on a tasklet, but 
in this case the tasklet is a per CPU tasklet.
 
       According to documentation/atomic_ops.txt, atomic op that returns a value has the semantics of 
"explicit memory barriers performed before and after the operation".  
If I understand it correctly, this means that, for exmaple,
 
           atomic_t aa = ATOMIC_INIT(0);
           int X = 1;
           int Y = 2;
 
 CPU 0:
          update X=100;
          atomc_inc_return(&aa);
          update Y=200;
 
Then CPU 1 will always see X=100 before it sees the new value of aa (1), and CPU 1 wil always 
see the new value of aa (1) before it sees Y=200.
 
This ordering semantics does not apply to the scenario in our discussion.
For one thing, the rcu tasklet is a per CPU tasklet.  So obviously no other CPU's will even read its t->state.    
 
Am I still missing something?
By the way, which tasklet_schedule() are you referring to in your previous email? 
Is it the one called by CPU 0, CPU 1, or CPU 2?
 

Thanks,
James Huang



----- Original Message ----
From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
To: James Huang <jamesclhuang@yahoo.com>
Cc: linux-kernel@vger.kernel.org; manfred@colorfullife.com; dipankar@in.ibm.com
Sent: Wednesday, November 21, 2007 8:54:15 AM
Subject: Re: __rcu_process_callbacks() in Linux 2.6

On Tue, Nov 20, 2007 at 07:43:09PM -0800, James Huang wrote:
> Please disregard the previous email.
> 
> 
> In the latest Linux 2.6 RCU implementation, __rcu_process_callbacks() is coded as follows: 
> 
> 
> 422 static void __rcu_process_callbacks(struct rcu_ctrlblk *rcp,
> 423                                        struct rcu_data *rdp)
> 424 {
> 425        if (rdp->curlist && !rcu_batch_before(rcp->completed, rdp->batch)) {
> 426                *rdp->donetail = rdp->curlist;
> 427                rdp->donetail = rdp->curtail;
> 428                rdp->curlist = NULL;
> 429                rdp->curtail = &rdp->curlist;
> 430        }
> 431 
> 432        if (rdp->nxtlist && !rdp->curlist) {
> 433                local_irq_disable();
> 434                rdp->curlist = rdp->nxtlist;
> 435                rdp->curtail = rdp->nxttail;
> 436                rdp->nxtlist = NULL;
> 437                rdp->nxttail = &rdp->nxtlist;
> 438                local_irq_enable();
> 439 
> 440                /*
> 441                  * start the next batch of callbacks
> 442                  */
> 443 
> 444                /* determine batch number */
> 445                rdp->batch = rcp->cur + 1;
> 446                /* see the comment and corresponding wmb() in
> 447                  * the rcu_start_batch()
> 448                  */
> 449                smp_rmb();
> 450 
> 451                if (!rcp->next_pending) {
> 452                        /* and start it/schedule start if it's a new batch */
> 453                        spin_lock(&rcp->lock);
> 454                        rcp->next_pending = 1;
> 455                        rcu_start_batch(rcp);
> 456                        spin_unlock(&rcp->lock);
> 457                }
> 458        }
> 459 
> 460        rcu_check_quiescent_state(rcp, rdp);
> 461        if (rdp->donelist)
> 462                rcu_do_batch(rdp);
> 463 }
> 
> 
> The question is how does the update of rdp->batch at line 445 guarantee to observe the most updated value of rcp->cur??
> The issue is that there is no memory barrier/locking before line 445.
> So I think the following sequence of events in chronological order is possible:
> 
> Assume initially rcp->cur = 100, this current batch value is visible to every CPU, and batch 100 has completed 
> (i.e. rcp->completed = 100, rcp->next_pending = 0,  rcp->cpumask = 0, and for each CPU, rdp->quiescbatch = 100, rdp->qs_pending = 0, rdp->passed_quiesc = 1)


> CPU 0: 
> --------- 
>              call_rcu(): a callback inserted into rdp->nxtlist;                    
> 
>              timer interrupt 
>                call rcu_pending(), return true  ( ! rdp->curlist && rdp->nxtlist)
>                call rcu_check_callbacks() 
>                      schedule per CPU rcu_tasklet
> 
>                __rcu_process_callbacks()
>                            move callbacks from nxtlist to curlist;
>                            rdp->batch = 101
>                            lock rcp->lock
>                            rcp->next_pending = 1
>                            call rcu_start_batch()
>                                  find the current batch has completed and next batch pending;
>                                  rcp->next_pending = 0
>                                  update rcp->cur to 101 and initialize rcp->cpumask;                  <----- time t1
>                                  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>                            unlock rcp->lock
>                              
> CPU 1:
> ---------  
>              timer interrupt 
>                  call rcu_pending(), return true (asume observing rcp->cur = 101 != rdp->quiescbatch) 
>                  
>                  call rcu_check_callbacks() 
>                        schedule per CPU rcu_tasklet
> 
>                  __rcu_process_callbacks()
>                        call rcu_check_quisecent_state()
>                              find rdp->quiescbatch != rcp->cur
>                              set rdp->qs_pending = 1
>                              set rdp->passed_quiesc = 0
>                              set rdp->quiescbatch = 101 (rcp->cur)
> 
>              Another timer interrupt
>                  call rcu_pending(), return true (rdp->qs_pending == 1)  
>                  call rcu_check_callbacks() 
>                        (assume in user mode)                        <-- time t2  pass quiescent state
>                        ~~~~~~~~~~~~~~~~~~                            ~~~~~~~~~~~~~~~~~~~~~~
>                        rdp->passed_quiesc = 1
>                        schedule per CPU rcu_tasklet
>            
>                  __rcu_process_callbacks()
>                        call rcu_check_quisecent_state()
>                              find rdp->qs_pending == 1 && rdp-> passed_quiesc == 1
>                              set rdp->qs_pending = 0
>                              lock rcp->lock
>                              call cpu_quite()
>                                    clear bit in the rcp->cpumask set up by CPU 0 at time t1
>                                    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>                              unlock rcp->lock
> 
> CPU 2:
> ---------  
>              call_rcu(): a callback inserted into rdp->nxtlist;    <--- time t3
>              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>              
>              timer interrupt 
>                  call rcu_pending(), return true ( ! rdp->curlist && rdp->nxtlist)
>                  call rcu_check_callbacks() 
>                      schedule per CPU rcu_tasklet

The tasklet_schedule() code calls test_and_set_bit(), which is an
atomic operation that returns a value.  It therefore implies a memory
barrier, so that if the prior call_rcu() at time t3 "happens after"
CPU 1's quiescent state at time t2 (which, due to locking, must follow
the update to rcp->cur at time t1), then the access to rcp->cur at time
t4 must see the new value of rcp->cur.

This is admittedly a bit subtle and perhaps a bit fragile.  Manfred,
Dipankar, would it be worth putting an explicit memory barrier somewhere
in this sequence, or is there another memory barrier that I forgetting?

                            Thanx, Paul

>                  calls __rcu_process_callbacks()
>                      move callbacks from nxtlist to curlist; (including the callback inserted at time t3)
>                      rdp->batch = rcp->cur + 1;                        <--- time t4
>                      ~~~~~~~~~~~~~~~~~~~~
> 
> At time t4, CPU 2 might observe rcp->cpu as 100 and set rdp->batch to 101.
> So CPU 2 essentially started its batch 101 (includes a callback inserted at time t3) after CPU 1 passed its quiescent state (at time t2) for batch 101.
> 
> Isn't this an issue???  Am I missing something?
> Thanks for pointing me to the right direction.
> 
> 
> -- James Huang
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: __rcu_process_callbacks() in Linux 2.6
  2007-11-21 19:57 James Huang
@ 2007-11-21 21:25 ` Paul E. McKenney
  0 siblings, 0 replies; 11+ messages in thread
From: Paul E. McKenney @ 2007-11-21 21:25 UTC (permalink / raw)
  To: James Huang; +Cc: linux-kernel, manfred, dipankar, james.huang

On Wed, Nov 21, 2007 at 11:57:29AM -0800, James Huang wrote:
> Paul,
> 
>        I am not sure I understand your answer about using test_and_set_bit() in tasklet_schedule() as a 
> memory barrier in this case.
> 
>        Yes, tasklet_schedule() includes a test_and_set_bit(TASKLET_STATE_SCHED, &t->state) on a tasklet, but 
> in this case the tasklet is a per CPU tasklet.

Memory barriers are memory barriers, regardless of what type of data
is being processed.

>        According to documentation/atomic_ops.txt, atomic op that returns a value has the semantics of 
> "explicit memory barriers performed before and after the operation".  

And test_and_set_bit() does return a value, namely the value of the
affected bit before the operation.  Therefore, any correct implementation
for a CONFIG_SMP build must include memory barriers before and after.
Extracting the relevant passage from Documentation/atomic_ops.txt
between the pair of dashed lines:

------------------------------------------------------------------------

	int test_and_set_bit(unsigned long nr, volatile unsigned long *addr);
	int test_and_clear_bit(unsigned long nr, volatile unsigned long *addr);
	int test_and_change_bit(unsigned long nr, volatile unsigned long *addr);

Like the above, except that these routines return a boolean which
indicates whether the changed bit was set _BEFORE_ the atomic bit
operation.

WARNING! It is incredibly important that the value be a boolean,
ie. "0" or "1".  Do not try to be fancy and save a few instructions by
declaring the above to return "long" and just returning something like
"old_val & mask" because that will not work.

For one thing, this return value gets truncated to int in many code
paths using these interfaces, so on 64-bit if the bit is set in the
upper 32-bits then testers will never see that.

One great example of where this problem crops up are the thread_info
flag operations.  Routines such as test_and_set_ti_thread_flag() chop
the return value into an int.  There are other places where things
like this occur as well.

These routines, like the atomic_t counter operations returning values,
require explicit memory barrier semantics around their execution.  All
memory operations before the atomic bit operation call must be made
visible globally before the atomic bit operation is made visible.
Likewise, the atomic bit operation must be visible globally before any
subsequent memory operation is made visible.  For example:

	obj->dead = 1;
	if (test_and_set_bit(0, &obj->flags))
		/* ... */;
	obj->killed = 1;

The implementation of test_and_set_bit() must guarantee that
"obj->dead = 1;" is visible to cpus before the atomic memory operation
done by test_and_set_bit() becomes visible.  Likewise, the atomic
memory operation done by test_and_set_bit() must become visible before
"obj->killed = 1;" is visible.

------------------------------------------------------------------------

> If I understand it correctly, this means that, for exmaple,
> 
>            atomic_t aa = ATOMIC_INIT(0);
>            int X = 1;
>            int Y = 2;
>  CPU 0:
>           update X=100;
>           atomc_inc_return(&aa);
>           update Y=200;

But, yes, atomic_inc_return() does indeed force ordering.

> Then CPU 1 will always see X=100 before it sees the new value of aa (1), and CPU 1 wil always 
> see the new value of aa (1) before it sees Y=200.

Yep.  And CPU 1 will also see any preceding unrelated assignment
prior to the new value of aa as well.  And it is not just preceding
stores.  See the sentence from Documentation/atomic_ops.txt:

	All memory operations before the atomic bit operation call must
	be made visible globally before the atomic bit operation is
	made visible.

Both stores -and- loads.

> This ordering semantics does not apply to the scenario in our discussion.
> For one thing, the rcu tasklet is a per CPU tasklet.  So obviously no other CPU's will even read its t->state.    
> 
> Am I still missing something?

Yep -- the test_and_set_bit() operation has no clue who else might or
might not be reading t->state.  Besides, tasklets need not be allocated
on a per-CPU basis, and therefore tasklet_schedule() must be prepared
to deal with other CPUs concurrently manipulating t->state, for example,
via the tasklet_disable() interface.

Another thing that might help is to fill in the RCU read-side critical
section that CPU 2 would have to execute (after all the stuff you currently
have it executing), along with the RCU update that would need to precede
CPU 2's call_rcu() call.  I have done this in your example code below.

Note that in order for a failure to occur, CPU 1 must reach /* A */
before CPU 2 reaches /* B */.

One key point is that tasklet_schedule()'s memory ordering affects this
preceding code for CPU 2.  The second key point is that acquiring and
releasing a lock acts as a barrier as well (though a limited one).

The result is that CPU 2 must see CPU 1's update to rcp->cur in the
case that CPU 1 makes it to /* A */ before CPU 2 makes it to /* B */.

> By the way, which tasklet_schedule() are you referring to in your previous email? 
> Is it the one called by CPU 0, CPU 1, or CPU 2?

The one called by CPU 2 that appears immediately before my response below.

All that aside, unless Manfred, Dipankar, or whomever can point out
some other ordering constraint that I have missed, this one is messy
enough that an additional explicit memory barrier might be in order.

Manfred?  Dipankar?

> Thanks,
> James Huang
> 
> 
> 
> ----- Original Message ----
> From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> To: James Huang <jamesclhuang@yahoo.com>
> Cc: linux-kernel@vger.kernel.org; manfred@colorfullife.com; dipankar@in.ibm.com
> Sent: Wednesday, November 21, 2007 8:54:15 AM
> Subject: Re: __rcu_process_callbacks() in Linux 2.6
> 
> On Tue, Nov 20, 2007 at 07:43:09PM -0800, James Huang wrote:
> > Please disregard the previous email.
> > 
> > 
> > In the latest Linux 2.6 RCU implementation, __rcu_process_callbacks() is coded as follows: 
> > 
> > 
> > 422 static void __rcu_process_callbacks(struct rcu_ctrlblk *rcp,
> > 423                                        struct rcu_data *rdp)
> > 424 {
> > 425        if (rdp->curlist && !rcu_batch_before(rcp->completed, rdp->batch)) {
> > 426                *rdp->donetail = rdp->curlist;
> > 427                rdp->donetail = rdp->curtail;
> > 428                rdp->curlist = NULL;
> > 429                rdp->curtail = &rdp->curlist;
> > 430        }
> > 431 
> > 432        if (rdp->nxtlist && !rdp->curlist) {
> > 433                local_irq_disable();
> > 434                rdp->curlist = rdp->nxtlist;
> > 435                rdp->curtail = rdp->nxttail;
> > 436                rdp->nxtlist = NULL;
> > 437                rdp->nxttail = &rdp->nxtlist;
> > 438                local_irq_enable();
> > 439 
> > 440                /*
> > 441                  * start the next batch of callbacks
> > 442                  */
> > 443 
> > 444                /* determine batch number */
> > 445                rdp->batch = rcp->cur + 1;
> > 446                /* see the comment and corresponding wmb() in
> > 447                  * the rcu_start_batch()
> > 448                  */
> > 449                smp_rmb();
> > 450 
> > 451                if (!rcp->next_pending) {
> > 452                        /* and start it/schedule start if it's a new batch */
> > 453                        spin_lock(&rcp->lock);
> > 454                        rcp->next_pending = 1;
> > 455                        rcu_start_batch(rcp);
> > 456                        spin_unlock(&rcp->lock);
> > 457                }
> > 458        }
> > 459 
> > 460        rcu_check_quiescent_state(rcp, rdp);
> > 461        if (rdp->donelist)
> > 462                rcu_do_batch(rdp);
> > 463 }
> > 
> > 
> > The question is how does the update of rdp->batch at line 445 guarantee to observe the most updated value of rcp->cur??
> > The issue is that there is no memory barrier/locking before line 445.
> > So I think the following sequence of events in chronological order is possible:
> > 
> > Assume initially rcp->cur = 100, this current batch value is visible to every CPU, and batch 100 has completed 
> > (i.e. rcp->completed = 100, rcp->next_pending = 0,  rcp->cpumask = 0, and for each CPU, rdp->quiescbatch = 100, rdp->qs_pending = 0, rdp->passed_quiesc = 1)
> 
> 
> > CPU 0: 
> > --------- 
> >              call_rcu(): a callback inserted into rdp->nxtlist;                    
> > 
> >              timer interrupt 
> >                call rcu_pending(), return true  ( ! rdp->curlist && rdp->nxtlist)
> >                call rcu_check_callbacks() 
> >                      schedule per CPU rcu_tasklet
> > 
> >                __rcu_process_callbacks()
> >                            move callbacks from nxtlist to curlist;
> >                            rdp->batch = 101
> >                            lock rcp->lock
> >                            rcp->next_pending = 1
> >                            call rcu_start_batch()
> >                                  find the current batch has completed and next batch pending;
> >                                  rcp->next_pending = 0
> >                                  update rcp->cur to 101 and initialize rcp->cpumask;                  <----- time t1
> >                                  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >                            unlock rcp->lock
> >                              
> > CPU 1:
> > ---------  
> >              timer interrupt 
> >                  call rcu_pending(), return true (asume observing rcp->cur = 101 != rdp->quiescbatch) 
> >                  
> >                  call rcu_check_callbacks() 
> >                        schedule per CPU rcu_tasklet
> > 
> >                  __rcu_process_callbacks()
> >                        call rcu_check_quisecent_state()
> >                              find rdp->quiescbatch != rcp->cur
> >                              set rdp->qs_pending = 1
> >                              set rdp->passed_quiesc = 0
> >                              set rdp->quiescbatch = 101 (rcp->cur)
> > 
> >              Another timer interrupt
> >                  call rcu_pending(), return true (rdp->qs_pending == 1)  
> >                  call rcu_check_callbacks() 
> >                        (assume in user mode)                        <-- time t2  pass quiescent state
> >                        ~~~~~~~~~~~~~~~~~~                            ~~~~~~~~~~~~~~~~~~~~~~
> >                        rdp->passed_quiesc = 1
> >                        schedule per CPU rcu_tasklet
> >            
> >                  __rcu_process_callbacks()
> >                        call rcu_check_quisecent_state()
> >                              find rdp->qs_pending == 1 && rdp-> passed_quiesc == 1
> >                              set rdp->qs_pending = 0
> >                              lock rcp->lock
> >                              call cpu_quite()
> >                                    clear bit in the rcp->cpumask set up by CPU 0 at time t1
> >                                    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >                              unlock rcp->lock

		rcu_read_lock();
		p = rcu_dereference(RCU_protected_pointer);
		/* A */
		/* do something time-consuming with p */
		rcu_read_unlock();

> > 
> > CPU 2:
> > ---------  

		q1 = kmalloc(...);
		initialize_it(q1);
		q = RCU_protected_pointer;
		/* B */
		rcu_assign_pointer(RCU_protected_pointer, q1);

> >              call_rcu(): a callback inserted into rdp->nxtlist;    <--- time t3
> >              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >              
> >              timer interrupt 
> >                  call rcu_pending(), return true ( ! rdp->curlist && rdp->nxtlist)
> >                  call rcu_check_callbacks() 
> >                      schedule per CPU rcu_tasklet
> 
> The tasklet_schedule() code calls test_and_set_bit(), which is an
> atomic operation that returns a value.  It therefore implies a memory
> barrier, so that if the prior call_rcu() at time t3 "happens after"
> CPU 1's quiescent state at time t2 (which, due to locking, must follow
> the update to rcp->cur at time t1), then the access to rcp->cur at time
> t4 must see the new value of rcp->cur.
> 
> This is admittedly a bit subtle and perhaps a bit fragile.  Manfred,
> Dipankar, would it be worth putting an explicit memory barrier somewhere
> in this sequence, or is there another memory barrier that I forgetting?
> 
>                             Thanx, Paul
> 
> >                  calls __rcu_process_callbacks()
> >                      move callbacks from nxtlist to curlist; (including the callback inserted at time t3)
> >                      rdp->batch = rcp->cur + 1;                        <--- time t4
> >                      ~~~~~~~~~~~~~~~~~~~~
> > 
> > At time t4, CPU 2 might observe rcp->cpu as 100 and set rdp->batch to 101.
> > So CPU 2 essentially started its batch 101 (includes a callback inserted at time t3) after CPU 1 passed its quiescent state (at time t2) for batch 101.
> > 
> > Isn't this an issue???  Am I missing something?
> > Thanks for pointing me to the right direction.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: __rcu_process_callbacks() in Linux 2.6
@ 2007-11-21  3:43 James Huang
  2007-11-21 16:54 ` Paul E. McKenney
  0 siblings, 1 reply; 11+ messages in thread
From: James Huang @ 2007-11-21  3:43 UTC (permalink / raw)
  To: linux-kernel; +Cc: jamesclhuang

Please disregard the previous email.


In the latest Linux 2.6 RCU implementation, __rcu_process_callbacks() is coded as follows: 


422 static void __rcu_process_callbacks(struct rcu_ctrlblk *rcp,
423                                        struct rcu_data *rdp)
424 {
425        if (rdp->curlist && !rcu_batch_before(rcp->completed, rdp->batch)) {
426                *rdp->donetail = rdp->curlist;
427                rdp->donetail = rdp->curtail;
428                rdp->curlist = NULL;
429                rdp->curtail = &rdp->curlist;
430        }
431 
432        if (rdp->nxtlist && !rdp->curlist) {
433                local_irq_disable();
434                rdp->curlist = rdp->nxtlist;
435                rdp->curtail = rdp->nxttail;
436                rdp->nxtlist = NULL;
437                rdp->nxttail = &rdp->nxtlist;
438                local_irq_enable();
439 
440                /*
441                  * start the next batch of callbacks
442                  */
443 
444                /* determine batch number */
445                rdp->batch = rcp->cur + 1;
446                /* see the comment and corresponding wmb() in
447                  * the rcu_start_batch()
448                  */
449                smp_rmb();
450 
451                if (!rcp->next_pending) {
452                        /* and start it/schedule start if it's a new batch */
453                        spin_lock(&rcp->lock);
454                        rcp->next_pending = 1;
455                        rcu_start_batch(rcp);
456                        spin_unlock(&rcp->lock);
457                }
458        }
459 
460        rcu_check_quiescent_state(rcp, rdp);
461        if (rdp->donelist)
462                rcu_do_batch(rdp);
463 }


The question is how does the update of rdp->batch at line 445 guarantee to observe the most updated value of rcp->cur??
The issue is that there is no memory barrier/locking before line 445.
So I think the following sequence of events in chronological order is possible:

Assume initially rcp->cur = 100, this current batch value is visible to every CPU, and batch 100 has completed 
(i.e. rcp->completed = 100, rcp->next_pending = 0,  rcp->cpumask = 0, and for each CPU, rdp->quiescbatch = 100, rdp->qs_pending = 0, rdp->passed_quiesc = 1)
 
 
CPU 0: 
--------- 
             call_rcu(): a callback inserted into rdp->nxtlist;                     
 
             timer interrupt 
                call rcu_pending(), return true  ( ! rdp->curlist && rdp->nxtlist)
                call rcu_check_callbacks() 
                     schedule per CPU rcu_tasklet
 
                __rcu_process_callbacks()
                           move callbacks from nxtlist to curlist;
                           rdp->batch = 101
                           lock rcp->lock
                           rcp->next_pending = 1
                           call rcu_start_batch()
                                 find the current batch has completed and next batch pending;
                                 rcp->next_pending = 0
                                 update rcp->cur to 101 and initialize rcp->cpumask;                  <----- time t1
                                  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                           unlock rcp->lock
                              
CPU 1:
---------  
             timer interrupt 
                  call rcu_pending(), return true (asume observing rcp->cur = 101 != rdp->quiescbatch) 
                  
                  call rcu_check_callbacks() 
                       schedule per CPU rcu_tasklet
 
                 __rcu_process_callbacks()
                       call rcu_check_quisecent_state()
                             find rdp->quiescbatch != rcp->cur
                             set rdp->qs_pending = 1
                             set rdp->passed_quiesc = 0
                             set rdp->quiescbatch = 101 (rcp->cur)
 
             Another timer interrupt
                 call rcu_pending(), return true (rdp->qs_pending == 1)  
                 call rcu_check_callbacks() 
                       (assume in user mode)                         <-- time t2  pass quiescent state
                       ~~~~~~~~~~~~~~~~~~                             ~~~~~~~~~~~~~~~~~~~~~~
                       rdp->passed_quiesc = 1
                       schedule per CPU rcu_tasklet
            
                 __rcu_process_callbacks()
                       call rcu_check_quisecent_state()
                             find rdp->qs_pending == 1 && rdp-> passed_quiesc == 1
                             set rdp->qs_pending = 0
                             lock rcp->lock
                             call cpu_quite()
                                    clear bit in the rcp->cpumask set up by CPU 0 at time t1
                                    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                             unlock rcp->lock

CPU 2:
---------  
             call_rcu(): a callback inserted into rdp->nxtlist;     <--- time t3
             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
             
             timer interrupt 
                 call rcu_pending(), return true ( ! rdp->curlist && rdp->nxtlist)
                 call rcu_check_callbacks() 
                      schedule per CPU rcu_tasklet
                                
                 calls __rcu_process_callbacks()
                     move callbacks from nxtlist to curlist; (including the callback inserted at time t3)
                     rdp->batch = rcp->cur + 1;                         <--- time t4
                     ~~~~~~~~~~~~~~~~~~~~

At time t4, CPU 2 might observe rcp->cpu as 100 and set rdp->batch to 101.
So CPU 2 essentially started its batch 101 (includes a callback inserted at time t3) after CPU 1 passed its quiescent state (at time t2) for batch 101.

Isn't this an issue???  Am I missing something?
Thanks for pointing me to the right direction.


-- James Huang

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: __rcu_process_callbacks() in Linux 2.6
  2007-11-21  3:43 James Huang
@ 2007-11-21 16:54 ` Paul E. McKenney
  0 siblings, 0 replies; 11+ messages in thread
From: Paul E. McKenney @ 2007-11-21 16:54 UTC (permalink / raw)
  To: James Huang; +Cc: linux-kernel, manfred, dipankar

On Tue, Nov 20, 2007 at 07:43:09PM -0800, James Huang wrote:
> Please disregard the previous email.
> 
> 
> In the latest Linux 2.6 RCU implementation, __rcu_process_callbacks() is coded as follows: 
> 
> 
> 422 static void __rcu_process_callbacks(struct rcu_ctrlblk *rcp,
> 423                                        struct rcu_data *rdp)
> 424 {
> 425        if (rdp->curlist && !rcu_batch_before(rcp->completed, rdp->batch)) {
> 426                *rdp->donetail = rdp->curlist;
> 427                rdp->donetail = rdp->curtail;
> 428                rdp->curlist = NULL;
> 429                rdp->curtail = &rdp->curlist;
> 430        }
> 431 
> 432        if (rdp->nxtlist && !rdp->curlist) {
> 433                local_irq_disable();
> 434                rdp->curlist = rdp->nxtlist;
> 435                rdp->curtail = rdp->nxttail;
> 436                rdp->nxtlist = NULL;
> 437                rdp->nxttail = &rdp->nxtlist;
> 438                local_irq_enable();
> 439 
> 440                /*
> 441                  * start the next batch of callbacks
> 442                  */
> 443 
> 444                /* determine batch number */
> 445                rdp->batch = rcp->cur + 1;
> 446                /* see the comment and corresponding wmb() in
> 447                  * the rcu_start_batch()
> 448                  */
> 449                smp_rmb();
> 450 
> 451                if (!rcp->next_pending) {
> 452                        /* and start it/schedule start if it's a new batch */
> 453                        spin_lock(&rcp->lock);
> 454                        rcp->next_pending = 1;
> 455                        rcu_start_batch(rcp);
> 456                        spin_unlock(&rcp->lock);
> 457                }
> 458        }
> 459 
> 460        rcu_check_quiescent_state(rcp, rdp);
> 461        if (rdp->donelist)
> 462                rcu_do_batch(rdp);
> 463 }
> 
> 
> The question is how does the update of rdp->batch at line 445 guarantee to observe the most updated value of rcp->cur??
> The issue is that there is no memory barrier/locking before line 445.
> So I think the following sequence of events in chronological order is possible:
> 
> Assume initially rcp->cur = 100, this current batch value is visible to every CPU, and batch 100 has completed 
> (i.e. rcp->completed = 100, rcp->next_pending = 0,  rcp->cpumask = 0, and for each CPU, rdp->quiescbatch = 100, rdp->qs_pending = 0, rdp->passed_quiesc = 1)


> CPU 0: 
> --------- 
>              call_rcu(): a callback inserted into rdp->nxtlist;                     
> 
>              timer interrupt 
>                 call rcu_pending(), return true  ( ! rdp->curlist && rdp->nxtlist)
>                 call rcu_check_callbacks() 
>                      schedule per CPU rcu_tasklet
> 
>                 __rcu_process_callbacks()
>                            move callbacks from nxtlist to curlist;
>                            rdp->batch = 101
>                            lock rcp->lock
>                            rcp->next_pending = 1
>                            call rcu_start_batch()
>                                  find the current batch has completed and next batch pending;
>                                  rcp->next_pending = 0
>                                  update rcp->cur to 101 and initialize rcp->cpumask;                  <----- time t1
>                                   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>                            unlock rcp->lock
>                               
> CPU 1:
> ---------  
>              timer interrupt 
>                   call rcu_pending(), return true (asume observing rcp->cur = 101 != rdp->quiescbatch) 
>                   
>                   call rcu_check_callbacks() 
>                        schedule per CPU rcu_tasklet
> 
>                  __rcu_process_callbacks()
>                        call rcu_check_quisecent_state()
>                              find rdp->quiescbatch != rcp->cur
>                              set rdp->qs_pending = 1
>                              set rdp->passed_quiesc = 0
>                              set rdp->quiescbatch = 101 (rcp->cur)
> 
>              Another timer interrupt
>                  call rcu_pending(), return true (rdp->qs_pending == 1)  
>                  call rcu_check_callbacks() 
>                        (assume in user mode)                         <-- time t2  pass quiescent state
>                        ~~~~~~~~~~~~~~~~~~                             ~~~~~~~~~~~~~~~~~~~~~~
>                        rdp->passed_quiesc = 1
>                        schedule per CPU rcu_tasklet
>             
>                  __rcu_process_callbacks()
>                        call rcu_check_quisecent_state()
>                              find rdp->qs_pending == 1 && rdp-> passed_quiesc == 1
>                              set rdp->qs_pending = 0
>                              lock rcp->lock
>                              call cpu_quite()
>                                     clear bit in the rcp->cpumask set up by CPU 0 at time t1
>                                     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>                              unlock rcp->lock
> 
> CPU 2:
> ---------  
>              call_rcu(): a callback inserted into rdp->nxtlist;     <--- time t3
>              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>              
>              timer interrupt 
>                  call rcu_pending(), return true ( ! rdp->curlist && rdp->nxtlist)
>                  call rcu_check_callbacks() 
>                       schedule per CPU rcu_tasklet

The tasklet_schedule() code calls test_and_set_bit(), which is an
atomic operation that returns a value.  It therefore implies a memory
barrier, so that if the prior call_rcu() at time t3 "happens after"
CPU 1's quiescent state at time t2 (which, due to locking, must follow
the update to rcp->cur at time t1), then the access to rcp->cur at time
t4 must see the new value of rcp->cur.

This is admittedly a bit subtle and perhaps a bit fragile.  Manfred,
Dipankar, would it be worth putting an explicit memory barrier somewhere
in this sequence, or is there another memory barrier that I forgetting?

							Thanx, Paul

>                  calls __rcu_process_callbacks()
>                      move callbacks from nxtlist to curlist; (including the callback inserted at time t3)
>                      rdp->batch = rcp->cur + 1;                         <--- time t4
>                      ~~~~~~~~~~~~~~~~~~~~
> 
> At time t4, CPU 2 might observe rcp->cpu as 100 and set rdp->batch to 101.
> So CPU 2 essentially started its batch 101 (includes a callback inserted at time t3) after CPU 1 passed its quiescent state (at time t2) for batch 101.
> 
> Isn't this an issue???  Am I missing something?
> Thanks for pointing me to the right direction.
> 
> 
> -- James Huang
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* __rcu_process_callbacks() in Linux 2.6
@ 2007-11-21  3:20 James Huang
  0 siblings, 0 replies; 11+ messages in thread
From: James Huang @ 2007-11-21  3:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: jamesclhuang

In the latest Linux 2.6 RCU implementation, __rcu_process_callbacks() is coded as follows: 


422 static void __rcu_process_callbacks(struct rcu_ctrlblk *rcp,
423                                        struct rcu_data *rdp)
424 {
425        if (rdp->curlist && !rcu_batch_before(rcp->completed, rdp->batch)) {
426                *rdp->donetail = rdp->curlist;
427                rdp->donetail = rdp->curtail;
428                rdp->curlist = NULL;
429                rdp->curtail = &rdp->curlist;
430        }
431 
432        if (rdp->nxtlist && !rdp->curlist) {
433                local_irq_disable();
434                rdp->curlist = rdp->nxtlist;
435                rdp->curtail = rdp->nxttail;
436                rdp->nxtlist = NULL;
437                rdp->nxttail = &rdp->nxtlist;
438                local_irq_enable();
439 
440                /*
441                  * start the next batch of callbacks
442                  */
443 
444                /* determine batch number */
445                rdp->batch = rcp->cur + 1;
446                /* see the comment and corresponding wmb() in
447                  * the rcu_start_batch()
448                  */
449                smp_rmb();
450 
451                if (!rcp->next_pending) {
452                        /* and start it/schedule start if it's a new batch */
453                        spin_lock(&rcp->lock);
454                        rcp->next_pending = 1;
455                        rcu_start_batch(rcp);
456                        spin_unlock(&rcp->lock);
457                }
458        }
459 
460        rcu_check_quiescent_state(rcp, rdp);
461        if (rdp->donelist)
462                rcu_do_batch(rdp);
463 }


The question is how does the update of rdp->batch at line 445 guarantee to observe the most updated value of rcp->cur??
The issue is that there is no memory barrier/locking before line 445.
So I think the following sequence of events in chronological order is possible:

Assume initially rcp->cur = 100, this current batch value is visible to every CPU, and batch 100 has completed.
 
 
CPU 0: 
--------- 
             call_rcu(): a callback inserted into rdp->nxtlist;                     
 
             timer interrupt 
                call rcu_pending(), return true  ( ! rdp->curlist && rdp->nxtlist)
                call rcu_check_callbacks() 
                     schedule per CPU rcu_tasklet
 
                __rcu_process_callbacks()
                           move callbacks from nxtlist to curlist;
                           rdp->batch = 101
                           lock rcp->lock
                           rcp->next_pending = 1
                           call rcu_start_batch()
                                 find the current batch has completed and next batch pending;
                                 rcp->next_pending = 0
                                 update rcp->cur to 101 and initialize rcp->cpumask;                  <----- time t1
                                  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                           unlock rcp->lock
                              
CPU 1:
---------  
             timer interrupt 
                  call rcu_pending(), return true (asume observing rcp->cur = 101 != rdp->quiescbatch) 
                  
                  call rcu_check_callbacks() 
                       schedule per CPU rcu_tasklet
 
                 __rcu_process_callbacks()
                       call rcu_check_quisecent_state()
                             find rdp->quiescbatch != rcp->cur
                             set rdp->qs_pending = 1
                             set rdp->passed_quiesc = 0
                             set rdp->quiescbatch = 101 (rcp->cur)
 
             Another timer interrupt
                 call rcu_pending(), return true (rdp->qs_pending == 1)  
                 call rcu_check_callbacks() 
                       (assume in user mode)                         <-- time t2  pass quiescent state
                       ~~~~~~~~~~~~~~~~~~                           ~~~~~~~~~~~~~~~~~~~~~~
                       rdp->passed_quiesc = 1
                       schedule per CPU rcu_tasklet
            
                 __rcu_process_callbacks()
                       call rcu_check_quisecent_state()
                             find rdp->qs_pending == 1 && rdp-> passed_quiesc == 1
                             set rdp->qs_pending = 0
                             lock rcp->lock
                             call cpu_quite()
                                    clear bit in the rcp->cpumask set up by CPU 0 at time t1
                                    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                             unlock rcp->lock

CPU 2:
---------  
             call_rcu(): a callback inserted into rdp->nxtlist;     <--- time t3
             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
             
             timer interrupt 
                 call rcu_pending(), return true ( ! rdp->curlist && rdp->nxtlist)
                 call rcu_check_callbacks() 
                      schedule per CPU rcu_tasklet
                                
                 calls __rcu_process_callbacks()
                     move callbacks from nxtlist to curlist; (including the callback inserted at time t3)
                     rdp->batch = rcp->cur + 1;                         <--- time t4
                     ~~~~~~~~~~~~~~~~~~~~

At time t4, CPU 2 might observe rcp->cpu as 100 and set rdp->batch to 101.
So CPU 2 essentially started its batch 101 (includes a callback inserted at time t3) after CPU 1 passed its quiescent state (at time t2) for batch 101.

Isn't this an issue???  Am I missing something?
Thanks for pointing me to the right direction.


-- James Huang

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2007-11-28 15:52 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <590353.52909.qm@web83819.mail.sp1.yahoo.com>
2007-11-26 22:48 ` __rcu_process_callbacks() in Linux 2.6 James Huang
2007-11-27  2:39   ` Paul E. McKenney
2007-11-28  1:49     ` Paul E. McKenney
2007-11-28  6:21       ` Paul E. McKenney
2007-11-28 15:51         ` Paul E. McKenney
2007-11-26 18:28 Manfred Spraul
  -- strict thread matches above, loose matches on Subject: below --
2007-11-21 19:57 James Huang
2007-11-21 21:25 ` Paul E. McKenney
2007-11-21  3:43 James Huang
2007-11-21 16:54 ` Paul E. McKenney
2007-11-21  3:20 James Huang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).