linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* __rcu_process_callbacks() in Linux 2.6
@ 2007-11-26 18:28 Manfred Spraul
  0 siblings, 0 replies; 11+ messages in thread
From: Manfred Spraul @ 2007-11-26 18:28 UTC (permalink / raw)
  To: James Huang; +Cc: Paul E. McKenney, linux-kernel

Hi James,

If I understand the issue correctly, then the race is:

step 1: cpu 1: starts a new rcu batch (i.e. rcp->cur++, smb_mb)

step 2: cpu 2: completes the quiet state 
step 3: cpu 2: reads pointer 0x123 (ptr to a rcu protected struct)

step 4: cpu 3: call_rcu(0x123): rcu protected struct added to rdp->nxtlist
step 5: cpu 3: moves a new batch into rdp->curlist, rdp->batch = rcp->cur+1.
xxxxxxxxxxxxxxx Problem: where is the smp_rmb() that guarantees that
xxxxxxxxxxxxxxx   update to rcp->cur from step 1 is seen by cpu 3?
step 6: cpu 3: completes quiet state
step 7: cpu 3: struct 0x123 destroyed

step 8: cpu 2: accesses pointer 0x123, but the struct is already destroyed

James: Is that the race?

I agree with Paul, there are smb_rmb's on cpu 3 between Step 1 and Step 5:
Either the test_and_set_bit in tasklet_action for rcu_process_callback
if step 4 happens before the tasklet or somewhere in the irq handler
path if step 4 happens in an irq handler that interrupted
rcu_process_callback.

Thus theoretically no additional smb_rmb() should be necessary.
What is missing is proper documentation.

I'm analyzing the code right now:
Is it really true that typically a cpu only completes data in every other rcu
cycle? I.e. that most structures are stored in the rcu callback list until two
quiet states happened?

I've tried to track the values of rcp->cur and rdp->batch.
If next_pending is set, then cpu_quiet() immetiately starts
the next rcu cycle and a cpu cannot both complete the currently
pending rcu callbacks and add new callbacks to the next cycle,
thus a cpu only takes part in every other rcu cycle.

The oocalc file is at
http://www.colorfullife.com/~manfred/rcu.ods
http://www.colorfullife.com/~manfred/rcu.pdf

Is that analysis correct? Perhaps the whole code should be rewritten?

--
	Manfred


^ permalink raw reply	[flat|nested] 11+ messages in thread
* Re: __rcu_process_callbacks() in Linux 2.6
@ 2007-11-21 19:57 James Huang
  2007-11-21 21:25 ` Paul E. McKenney
  0 siblings, 1 reply; 11+ messages in thread
From: James Huang @ 2007-11-21 19:57 UTC (permalink / raw)
  To: paulmck; +Cc: linux-kernel, manfred, dipankar, james.huang

Paul,

       I am not sure I understand your answer about using test_and_set_bit() in tasklet_schedule() as a 
memory barrier in this case.
 
       Yes, tasklet_schedule() includes a test_and_set_bit(TASKLET_STATE_SCHED, &t->state) on a tasklet, but 
in this case the tasklet is a per CPU tasklet.
 
       According to documentation/atomic_ops.txt, atomic op that returns a value has the semantics of 
"explicit memory barriers performed before and after the operation".  
If I understand it correctly, this means that, for exmaple,
 
           atomic_t aa = ATOMIC_INIT(0);
           int X = 1;
           int Y = 2;
 
 CPU 0:
          update X=100;
          atomc_inc_return(&aa);
          update Y=200;
 
Then CPU 1 will always see X=100 before it sees the new value of aa (1), and CPU 1 wil always 
see the new value of aa (1) before it sees Y=200.
 
This ordering semantics does not apply to the scenario in our discussion.
For one thing, the rcu tasklet is a per CPU tasklet.  So obviously no other CPU's will even read its t->state.    
 
Am I still missing something?
By the way, which tasklet_schedule() are you referring to in your previous email? 
Is it the one called by CPU 0, CPU 1, or CPU 2?
 

Thanks,
James Huang



----- Original Message ----
From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
To: James Huang <jamesclhuang@yahoo.com>
Cc: linux-kernel@vger.kernel.org; manfred@colorfullife.com; dipankar@in.ibm.com
Sent: Wednesday, November 21, 2007 8:54:15 AM
Subject: Re: __rcu_process_callbacks() in Linux 2.6

On Tue, Nov 20, 2007 at 07:43:09PM -0800, James Huang wrote:
> Please disregard the previous email.
> 
> 
> In the latest Linux 2.6 RCU implementation, __rcu_process_callbacks() is coded as follows: 
> 
> 
> 422 static void __rcu_process_callbacks(struct rcu_ctrlblk *rcp,
> 423                                        struct rcu_data *rdp)
> 424 {
> 425        if (rdp->curlist && !rcu_batch_before(rcp->completed, rdp->batch)) {
> 426                *rdp->donetail = rdp->curlist;
> 427                rdp->donetail = rdp->curtail;
> 428                rdp->curlist = NULL;
> 429                rdp->curtail = &rdp->curlist;
> 430        }
> 431 
> 432        if (rdp->nxtlist && !rdp->curlist) {
> 433                local_irq_disable();
> 434                rdp->curlist = rdp->nxtlist;
> 435                rdp->curtail = rdp->nxttail;
> 436                rdp->nxtlist = NULL;
> 437                rdp->nxttail = &rdp->nxtlist;
> 438                local_irq_enable();
> 439 
> 440                /*
> 441                  * start the next batch of callbacks
> 442                  */
> 443 
> 444                /* determine batch number */
> 445                rdp->batch = rcp->cur + 1;
> 446                /* see the comment and corresponding wmb() in
> 447                  * the rcu_start_batch()
> 448                  */
> 449                smp_rmb();
> 450 
> 451                if (!rcp->next_pending) {
> 452                        /* and start it/schedule start if it's a new batch */
> 453                        spin_lock(&rcp->lock);
> 454                        rcp->next_pending = 1;
> 455                        rcu_start_batch(rcp);
> 456                        spin_unlock(&rcp->lock);
> 457                }
> 458        }
> 459 
> 460        rcu_check_quiescent_state(rcp, rdp);
> 461        if (rdp->donelist)
> 462                rcu_do_batch(rdp);
> 463 }
> 
> 
> The question is how does the update of rdp->batch at line 445 guarantee to observe the most updated value of rcp->cur??
> The issue is that there is no memory barrier/locking before line 445.
> So I think the following sequence of events in chronological order is possible:
> 
> Assume initially rcp->cur = 100, this current batch value is visible to every CPU, and batch 100 has completed 
> (i.e. rcp->completed = 100, rcp->next_pending = 0,  rcp->cpumask = 0, and for each CPU, rdp->quiescbatch = 100, rdp->qs_pending = 0, rdp->passed_quiesc = 1)


> CPU 0: 
> --------- 
>              call_rcu(): a callback inserted into rdp->nxtlist;                    
> 
>              timer interrupt 
>                call rcu_pending(), return true  ( ! rdp->curlist && rdp->nxtlist)
>                call rcu_check_callbacks() 
>                      schedule per CPU rcu_tasklet
> 
>                __rcu_process_callbacks()
>                            move callbacks from nxtlist to curlist;
>                            rdp->batch = 101
>                            lock rcp->lock
>                            rcp->next_pending = 1
>                            call rcu_start_batch()
>                                  find the current batch has completed and next batch pending;
>                                  rcp->next_pending = 0
>                                  update rcp->cur to 101 and initialize rcp->cpumask;                  <----- time t1
>                                  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>                            unlock rcp->lock
>                              
> CPU 1:
> ---------  
>              timer interrupt 
>                  call rcu_pending(), return true (asume observing rcp->cur = 101 != rdp->quiescbatch) 
>                  
>                  call rcu_check_callbacks() 
>                        schedule per CPU rcu_tasklet
> 
>                  __rcu_process_callbacks()
>                        call rcu_check_quisecent_state()
>                              find rdp->quiescbatch != rcp->cur
>                              set rdp->qs_pending = 1
>                              set rdp->passed_quiesc = 0
>                              set rdp->quiescbatch = 101 (rcp->cur)
> 
>              Another timer interrupt
>                  call rcu_pending(), return true (rdp->qs_pending == 1)  
>                  call rcu_check_callbacks() 
>                        (assume in user mode)                        <-- time t2  pass quiescent state
>                        ~~~~~~~~~~~~~~~~~~                            ~~~~~~~~~~~~~~~~~~~~~~
>                        rdp->passed_quiesc = 1
>                        schedule per CPU rcu_tasklet
>            
>                  __rcu_process_callbacks()
>                        call rcu_check_quisecent_state()
>                              find rdp->qs_pending == 1 && rdp-> passed_quiesc == 1
>                              set rdp->qs_pending = 0
>                              lock rcp->lock
>                              call cpu_quite()
>                                    clear bit in the rcp->cpumask set up by CPU 0 at time t1
>                                    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>                              unlock rcp->lock
> 
> CPU 2:
> ---------  
>              call_rcu(): a callback inserted into rdp->nxtlist;    <--- time t3
>              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>              
>              timer interrupt 
>                  call rcu_pending(), return true ( ! rdp->curlist && rdp->nxtlist)
>                  call rcu_check_callbacks() 
>                      schedule per CPU rcu_tasklet

The tasklet_schedule() code calls test_and_set_bit(), which is an
atomic operation that returns a value.  It therefore implies a memory
barrier, so that if the prior call_rcu() at time t3 "happens after"
CPU 1's quiescent state at time t2 (which, due to locking, must follow
the update to rcp->cur at time t1), then the access to rcp->cur at time
t4 must see the new value of rcp->cur.

This is admittedly a bit subtle and perhaps a bit fragile.  Manfred,
Dipankar, would it be worth putting an explicit memory barrier somewhere
in this sequence, or is there another memory barrier that I forgetting?

                            Thanx, Paul

>                  calls __rcu_process_callbacks()
>                      move callbacks from nxtlist to curlist; (including the callback inserted at time t3)
>                      rdp->batch = rcp->cur + 1;                        <--- time t4
>                      ~~~~~~~~~~~~~~~~~~~~
> 
> At time t4, CPU 2 might observe rcp->cpu as 100 and set rdp->batch to 101.
> So CPU 2 essentially started its batch 101 (includes a callback inserted at time t3) after CPU 1 passed its quiescent state (at time t2) for batch 101.
> 
> Isn't this an issue???  Am I missing something?
> Thanks for pointing me to the right direction.
> 
> 
> -- James Huang
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 11+ messages in thread
* Re: __rcu_process_callbacks() in Linux 2.6
@ 2007-11-21  3:43 James Huang
  2007-11-21 16:54 ` Paul E. McKenney
  0 siblings, 1 reply; 11+ messages in thread
From: James Huang @ 2007-11-21  3:43 UTC (permalink / raw)
  To: linux-kernel; +Cc: jamesclhuang

Please disregard the previous email.


In the latest Linux 2.6 RCU implementation, __rcu_process_callbacks() is coded as follows: 


422 static void __rcu_process_callbacks(struct rcu_ctrlblk *rcp,
423                                        struct rcu_data *rdp)
424 {
425        if (rdp->curlist && !rcu_batch_before(rcp->completed, rdp->batch)) {
426                *rdp->donetail = rdp->curlist;
427                rdp->donetail = rdp->curtail;
428                rdp->curlist = NULL;
429                rdp->curtail = &rdp->curlist;
430        }
431 
432        if (rdp->nxtlist && !rdp->curlist) {
433                local_irq_disable();
434                rdp->curlist = rdp->nxtlist;
435                rdp->curtail = rdp->nxttail;
436                rdp->nxtlist = NULL;
437                rdp->nxttail = &rdp->nxtlist;
438                local_irq_enable();
439 
440                /*
441                  * start the next batch of callbacks
442                  */
443 
444                /* determine batch number */
445                rdp->batch = rcp->cur + 1;
446                /* see the comment and corresponding wmb() in
447                  * the rcu_start_batch()
448                  */
449                smp_rmb();
450 
451                if (!rcp->next_pending) {
452                        /* and start it/schedule start if it's a new batch */
453                        spin_lock(&rcp->lock);
454                        rcp->next_pending = 1;
455                        rcu_start_batch(rcp);
456                        spin_unlock(&rcp->lock);
457                }
458        }
459 
460        rcu_check_quiescent_state(rcp, rdp);
461        if (rdp->donelist)
462                rcu_do_batch(rdp);
463 }


The question is how does the update of rdp->batch at line 445 guarantee to observe the most updated value of rcp->cur??
The issue is that there is no memory barrier/locking before line 445.
So I think the following sequence of events in chronological order is possible:

Assume initially rcp->cur = 100, this current batch value is visible to every CPU, and batch 100 has completed 
(i.e. rcp->completed = 100, rcp->next_pending = 0,  rcp->cpumask = 0, and for each CPU, rdp->quiescbatch = 100, rdp->qs_pending = 0, rdp->passed_quiesc = 1)
 
 
CPU 0: 
--------- 
             call_rcu(): a callback inserted into rdp->nxtlist;                     
 
             timer interrupt 
                call rcu_pending(), return true  ( ! rdp->curlist && rdp->nxtlist)
                call rcu_check_callbacks() 
                     schedule per CPU rcu_tasklet
 
                __rcu_process_callbacks()
                           move callbacks from nxtlist to curlist;
                           rdp->batch = 101
                           lock rcp->lock
                           rcp->next_pending = 1
                           call rcu_start_batch()
                                 find the current batch has completed and next batch pending;
                                 rcp->next_pending = 0
                                 update rcp->cur to 101 and initialize rcp->cpumask;                  <----- time t1
                                  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                           unlock rcp->lock
                              
CPU 1:
---------  
             timer interrupt 
                  call rcu_pending(), return true (asume observing rcp->cur = 101 != rdp->quiescbatch) 
                  
                  call rcu_check_callbacks() 
                       schedule per CPU rcu_tasklet
 
                 __rcu_process_callbacks()
                       call rcu_check_quisecent_state()
                             find rdp->quiescbatch != rcp->cur
                             set rdp->qs_pending = 1
                             set rdp->passed_quiesc = 0
                             set rdp->quiescbatch = 101 (rcp->cur)
 
             Another timer interrupt
                 call rcu_pending(), return true (rdp->qs_pending == 1)  
                 call rcu_check_callbacks() 
                       (assume in user mode)                         <-- time t2  pass quiescent state
                       ~~~~~~~~~~~~~~~~~~                             ~~~~~~~~~~~~~~~~~~~~~~
                       rdp->passed_quiesc = 1
                       schedule per CPU rcu_tasklet
            
                 __rcu_process_callbacks()
                       call rcu_check_quisecent_state()
                             find rdp->qs_pending == 1 && rdp-> passed_quiesc == 1
                             set rdp->qs_pending = 0
                             lock rcp->lock
                             call cpu_quite()
                                    clear bit in the rcp->cpumask set up by CPU 0 at time t1
                                    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                             unlock rcp->lock

CPU 2:
---------  
             call_rcu(): a callback inserted into rdp->nxtlist;     <--- time t3
             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
             
             timer interrupt 
                 call rcu_pending(), return true ( ! rdp->curlist && rdp->nxtlist)
                 call rcu_check_callbacks() 
                      schedule per CPU rcu_tasklet
                                
                 calls __rcu_process_callbacks()
                     move callbacks from nxtlist to curlist; (including the callback inserted at time t3)
                     rdp->batch = rcp->cur + 1;                         <--- time t4
                     ~~~~~~~~~~~~~~~~~~~~

At time t4, CPU 2 might observe rcp->cpu as 100 and set rdp->batch to 101.
So CPU 2 essentially started its batch 101 (includes a callback inserted at time t3) after CPU 1 passed its quiescent state (at time t2) for batch 101.

Isn't this an issue???  Am I missing something?
Thanks for pointing me to the right direction.


-- James Huang

^ permalink raw reply	[flat|nested] 11+ messages in thread
* __rcu_process_callbacks() in Linux 2.6
@ 2007-11-21  3:20 James Huang
  0 siblings, 0 replies; 11+ messages in thread
From: James Huang @ 2007-11-21  3:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: jamesclhuang

In the latest Linux 2.6 RCU implementation, __rcu_process_callbacks() is coded as follows: 


422 static void __rcu_process_callbacks(struct rcu_ctrlblk *rcp,
423                                        struct rcu_data *rdp)
424 {
425        if (rdp->curlist && !rcu_batch_before(rcp->completed, rdp->batch)) {
426                *rdp->donetail = rdp->curlist;
427                rdp->donetail = rdp->curtail;
428                rdp->curlist = NULL;
429                rdp->curtail = &rdp->curlist;
430        }
431 
432        if (rdp->nxtlist && !rdp->curlist) {
433                local_irq_disable();
434                rdp->curlist = rdp->nxtlist;
435                rdp->curtail = rdp->nxttail;
436                rdp->nxtlist = NULL;
437                rdp->nxttail = &rdp->nxtlist;
438                local_irq_enable();
439 
440                /*
441                  * start the next batch of callbacks
442                  */
443 
444                /* determine batch number */
445                rdp->batch = rcp->cur + 1;
446                /* see the comment and corresponding wmb() in
447                  * the rcu_start_batch()
448                  */
449                smp_rmb();
450 
451                if (!rcp->next_pending) {
452                        /* and start it/schedule start if it's a new batch */
453                        spin_lock(&rcp->lock);
454                        rcp->next_pending = 1;
455                        rcu_start_batch(rcp);
456                        spin_unlock(&rcp->lock);
457                }
458        }
459 
460        rcu_check_quiescent_state(rcp, rdp);
461        if (rdp->donelist)
462                rcu_do_batch(rdp);
463 }


The question is how does the update of rdp->batch at line 445 guarantee to observe the most updated value of rcp->cur??
The issue is that there is no memory barrier/locking before line 445.
So I think the following sequence of events in chronological order is possible:

Assume initially rcp->cur = 100, this current batch value is visible to every CPU, and batch 100 has completed.
 
 
CPU 0: 
--------- 
             call_rcu(): a callback inserted into rdp->nxtlist;                     
 
             timer interrupt 
                call rcu_pending(), return true  ( ! rdp->curlist && rdp->nxtlist)
                call rcu_check_callbacks() 
                     schedule per CPU rcu_tasklet
 
                __rcu_process_callbacks()
                           move callbacks from nxtlist to curlist;
                           rdp->batch = 101
                           lock rcp->lock
                           rcp->next_pending = 1
                           call rcu_start_batch()
                                 find the current batch has completed and next batch pending;
                                 rcp->next_pending = 0
                                 update rcp->cur to 101 and initialize rcp->cpumask;                  <----- time t1
                                  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                           unlock rcp->lock
                              
CPU 1:
---------  
             timer interrupt 
                  call rcu_pending(), return true (asume observing rcp->cur = 101 != rdp->quiescbatch) 
                  
                  call rcu_check_callbacks() 
                       schedule per CPU rcu_tasklet
 
                 __rcu_process_callbacks()
                       call rcu_check_quisecent_state()
                             find rdp->quiescbatch != rcp->cur
                             set rdp->qs_pending = 1
                             set rdp->passed_quiesc = 0
                             set rdp->quiescbatch = 101 (rcp->cur)
 
             Another timer interrupt
                 call rcu_pending(), return true (rdp->qs_pending == 1)  
                 call rcu_check_callbacks() 
                       (assume in user mode)                         <-- time t2  pass quiescent state
                       ~~~~~~~~~~~~~~~~~~                           ~~~~~~~~~~~~~~~~~~~~~~
                       rdp->passed_quiesc = 1
                       schedule per CPU rcu_tasklet
            
                 __rcu_process_callbacks()
                       call rcu_check_quisecent_state()
                             find rdp->qs_pending == 1 && rdp-> passed_quiesc == 1
                             set rdp->qs_pending = 0
                             lock rcp->lock
                             call cpu_quite()
                                    clear bit in the rcp->cpumask set up by CPU 0 at time t1
                                    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                             unlock rcp->lock

CPU 2:
---------  
             call_rcu(): a callback inserted into rdp->nxtlist;     <--- time t3
             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
             
             timer interrupt 
                 call rcu_pending(), return true ( ! rdp->curlist && rdp->nxtlist)
                 call rcu_check_callbacks() 
                      schedule per CPU rcu_tasklet
                                
                 calls __rcu_process_callbacks()
                     move callbacks from nxtlist to curlist; (including the callback inserted at time t3)
                     rdp->batch = rcp->cur + 1;                         <--- time t4
                     ~~~~~~~~~~~~~~~~~~~~

At time t4, CPU 2 might observe rcp->cpu as 100 and set rdp->batch to 101.
So CPU 2 essentially started its batch 101 (includes a callback inserted at time t3) after CPU 1 passed its quiescent state (at time t2) for batch 101.

Isn't this an issue???  Am I missing something?
Thanks for pointing me to the right direction.


-- James Huang

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2007-11-28 15:52 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <590353.52909.qm@web83819.mail.sp1.yahoo.com>
2007-11-26 22:48 ` __rcu_process_callbacks() in Linux 2.6 James Huang
2007-11-27  2:39   ` Paul E. McKenney
2007-11-28  1:49     ` Paul E. McKenney
2007-11-28  6:21       ` Paul E. McKenney
2007-11-28 15:51         ` Paul E. McKenney
2007-11-26 18:28 Manfred Spraul
  -- strict thread matches above, loose matches on Subject: below --
2007-11-21 19:57 James Huang
2007-11-21 21:25 ` Paul E. McKenney
2007-11-21  3:43 James Huang
2007-11-21 16:54 ` Paul E. McKenney
2007-11-21  3:20 James Huang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).