* [PATCH 2.5.52] Use __set_current_state() instead of current->state = (take 1)
@ 2002-12-18 23:56 Inaky Perez-Gonzalez
2002-12-19 0:11 ` Robert Love
0 siblings, 1 reply; 11+ messages in thread
From: Inaky Perez-Gonzalez @ 2002-12-18 23:56 UTC (permalink / raw)
To: torvalds, linux-kernel
Hi all
In fs/*.c, many functions manually set the task state directly
accessing current->state, or with a macro, kind of
inconsistently. This patch changes all of them to use
[__]set_current_state().
Changelog:
- Ported forward to 2.5.52
diff -u fs/dquot.c:1.1.1.4 fs/dquot.c:1.1.1.1.6.2
--- fs/dquot.c:1.1.1.4 Wed Dec 11 11:13:35 2002
+++ fs/dquot.c Wed Dec 18 13:20:24 2002
@@ -264,7 +264,7 @@
goto repeat;
}
remove_wait_queue(&dquot->dq_wait_lock, &wait);
- current->state = TASK_RUNNING;
+ __set_current_state(TASK_RUNNING);
}
static inline void wait_on_dquot(struct dquot *dquot)
@@ -298,7 +298,7 @@
goto repeat;
}
remove_wait_queue(&dquot->dq_wait_free, &wait);
- current->state = TASK_RUNNING;
+ __set_current_state(TASK_RUNNING);
}
/* Wait for all duplicated dquot references to be dropped */
@@ -314,7 +314,7 @@
goto repeat;
}
remove_wait_queue(&dquot->dq_wait_free, &wait);
- current->state = TASK_RUNNING;
+ __set_current_state(TASK_RUNNING);
}
static int read_dqblk(struct dquot *dquot)
diff -u fs/exec.c:1.1.1.8 fs/exec.c:1.1.1.1.6.2
--- fs/exec.c:1.1.1.8 Mon Dec 16 18:44:31 2002
+++ fs/exec.c Wed Dec 18 13:20:24 2002
@@ -587,7 +587,7 @@
count = 1;
while (atomic_read(&oldsig->count) > count) {
oldsig->group_exit_task = current;
- current->state = TASK_UNINTERRUPTIBLE;
+ __set_current_state(TASK_UNINTERRUPTIBLE);
spin_unlock_irq(&oldsig->siglock);
schedule();
spin_lock_irq(&oldsig->siglock);
diff -u fs/inode.c:1.1.1.6 fs/inode.c:1.1.1.1.6.2
--- fs/inode.c:1.1.1.6 Mon Dec 16 18:44:31 2002
+++ fs/inode.c Wed Dec 18 13:20:24 2002
@@ -1195,7 +1195,7 @@
goto repeat;
}
remove_wait_queue(wq, &wait);
- current->state = TASK_RUNNING;
+ __set_current_state(TASK_RUNNING);
}
void wake_up_inode(struct inode *inode)
diff -u fs/locks.c:1.1.1.6 fs/locks.c:1.1.1.1.6.2
--- fs/locks.c:1.1.1.6 Wed Dec 11 11:13:35 2002
+++ fs/locks.c Wed Dec 18 13:20:24 2002
@@ -571,7 +571,7 @@
int result = 0;
DECLARE_WAITQUEUE(wait, current);
- current->state = TASK_INTERRUPTIBLE;
+ __set_current_state (TASK_INTERRUPTIBLE);
add_wait_queue(fl_wait, &wait);
if (timeout == 0)
schedule();
@@ -580,7 +580,7 @@
if (signal_pending(current))
result = -ERESTARTSYS;
remove_wait_queue(fl_wait, &wait);
- current->state = TASK_RUNNING;
+ __set_current_state (TASK_RUNNING);
return result;
}
diff -u fs/namei.c:1.1.1.6 fs/namei.c:1.1.1.1.6.2
--- fs/namei.c:1.1.1.6 Wed Dec 11 11:13:35 2002
+++ fs/namei.c Wed Dec 18 13:20:24 2002
@@ -410,7 +410,7 @@
if (current->total_link_count >= 40)
goto loop;
if (need_resched()) {
- current->state = TASK_RUNNING;
+ __set_current_state(TASK_RUNNING);
schedule();
}
err = security_inode_follow_link(dentry, nd);
diff -u fs/select.c:1.1.1.3 fs/select.c:1.1.1.1.6.2
--- fs/select.c:1.1.1.3 Wed Dec 11 11:10:14 2002
+++ fs/select.c Wed Dec 18 13:20:24 2002
@@ -235,7 +235,7 @@
}
__timeout = schedule_timeout(__timeout);
}
- current->state = TASK_RUNNING;
+ __set_current_state (TASK_RUNNING);
poll_freewait(&table);
@@ -417,7 +417,7 @@
break;
timeout = schedule_timeout(timeout);
}
- current->state = TASK_RUNNING;
+ __set_current_state (TASK_RUNNING);
return count;
}
--
Inaky Perez-Gonzalez -- Not speaking for Intel - opinions are my own [or my fault]
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [PATCH 2.5.52] Use __set_current_state() instead of current->state = (take 1)
2002-12-18 23:56 [PATCH 2.5.52] Use __set_current_state() instead of current->state = (take 1) Inaky Perez-Gonzalez
@ 2002-12-19 0:11 ` Robert Love
0 siblings, 0 replies; 11+ messages in thread
From: Robert Love @ 2002-12-19 0:11 UTC (permalink / raw)
To: Inaky Perez-Gonzalez; +Cc: torvalds, linux-kernel
On Wed, 2002-12-18 at 18:56, Inaky Perez-Gonzalez wrote:
> In fs/*.c, many functions manually set the task state directly
> accessing current->state, or with a macro, kind of
> inconsistently. This patch changes all of them to use
> [__]set_current_state().
Some of these should probably be set_current_state(). I realize the
current code is equivalent to __set_current_state() but it might as well
be done right.
> diff -u fs/locks.c:1.1.1.6 fs/locks.c:1.1.1.1.6.2
> --- fs/locks.c:1.1.1.6 Wed Dec 11 11:13:35 2002
> +++ fs/locks.c Wed Dec 18 13:20:24 2002
> @@ -571,7 +571,7 @@
> int result = 0;
> DECLARE_WAITQUEUE(wait, current);
>
> - current->state = TASK_INTERRUPTIBLE;
> + __set_current_state (TASK_INTERRUPTIBLE);
> add_wait_queue(fl_wait, &wait);
> if (timeout == 0)
At least this guy should be set_current_state(), on quick glance.
When in doubt just use set_current_state()..
Robert Love
^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: [PATCH 2.5.52] Use __set_current_state() instead of current-> state = (take 1)
@ 2002-12-19 0:46 Perez-Gonzalez, Inaky
2002-12-19 1:03 ` Robert Love
0 siblings, 1 reply; 11+ messages in thread
From: Perez-Gonzalez, Inaky @ 2002-12-19 0:46 UTC (permalink / raw)
To: 'Robert Love'; +Cc: torvalds, linux-kernel
> > In fs/*.c, many functions manually set the task state directly
> > accessing current->state, or with a macro, kind of
> > inconsistently. This patch changes all of them to use
> > [__]set_current_state().
>
> Some of these should probably be set_current_state(). I
> realize the current code is equivalent to __set_current_state()
> but it might as well be done right.
Agreed; however, I also don't want to introduce unnecessary
bloat, so I need to understand first what cases need it - it
is kind of hard for me. Care to let me know some gotchas?
> > - current->state = TASK_INTERRUPTIBLE;
> > + __set_current_state (TASK_INTERRUPTIBLE);
> > add_wait_queue(fl_wait, &wait);
> > if (timeout == 0)
>
> At least this guy should be set_current_state(), on quick glance.
Is that because it is called lockless? ... grunt, in some areas
is kind of very obscure to guess if it is or not.
Inaky Perez-Gonzalez -- Not speaking for Intel - opinions are my own [or my
fault]
^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: [PATCH 2.5.52] Use __set_current_state() instead of current-> state = (take 1)
2002-12-19 0:46 [PATCH 2.5.52] Use __set_current_state() instead of current-> state " Perez-Gonzalez, Inaky
@ 2002-12-19 1:03 ` Robert Love
0 siblings, 0 replies; 11+ messages in thread
From: Robert Love @ 2002-12-19 1:03 UTC (permalink / raw)
To: Perez-Gonzalez, Inaky; +Cc: torvalds, linux-kernel
On Wed, 2002-12-18 at 19:46, Perez-Gonzalez, Inaky wrote:
> > Some of these should probably be set_current_state(). I
> > realize the current code is equivalent to __set_current_state()
> > but it might as well be done right.
>
> Agreed; however, I also don't want to introduce unnecessary
> bloat, so I need to understand first what cases need it - it
> is kind of hard for me. Care to let me know some gotchas?
set_current_state() includes a write barrier to ensure the setting of
the state is flushed before any further instructions. This is to
provide a memory barrier for weak-ordering processors that can and will
rearrange the writes.
Not all processors like those made by your employer are strongly-ordered
:)
Anyhow, the problem is when the setting of the state is set to
TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE. In those cases, it may be
possible for the state to actually be flushed to memory AFTER the wake
up event has occurred.
So you have code like this:
__set_current_state(TASK_INTERRUPTIBLE);
add_waitqueue(...);
but the processor ends up storing the current->state write after the
task is added to the waitqueue. In between those events, the wake up
event occurs and the task is woken up. Then the write to current->state
is flushed. So you end up with:
add_waitqueue(...);
interrupt or whatever occurs and wakes up the wait queue
task is now woken up and running
current->state = TASK_INTERRUPTIBLE /* BOOM! */
the task is marked sleeping now, but its wait event has already occurred
-- its in for a long sleep...
So to ensure the write is flushed then and there, and that the processor
(or compile, but we do not really worry about it because the call to
add_waitqueue will act as a compiler barrier, for example) does not move
the write to after the potential wake up, we use the write barrier.
In short, set_current_state() uses a memory barrier to ensure the state
setting occurs before any potential wake up activity, eliminating a
potential race and process deadlock.
It sounds complicated but its pretty simple... my explanation was
probably way too long - better than any I have seen here before on why
we have these things, at least. Hope it helps.
If in doubt, just make all of them set_current_state(). That is the
standard API, and its at least safe.
Robert Love
^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: [PATCH 2.5.52] Use __set_current_state() instead of current-> state = (take 1)
@ 2002-12-19 1:53 Perez-Gonzalez, Inaky
2002-12-19 2:04 ` Robert Love
0 siblings, 1 reply; 11+ messages in thread
From: Perez-Gonzalez, Inaky @ 2002-12-19 1:53 UTC (permalink / raw)
To: 'Robert Love', 'torvalds@transmeta.com'; +Cc: linux-kernel
> > Agreed; however, I also don't want to introduce unnecessary
> > bloat, so I need to understand first what cases need it - it
> > is kind of hard for me. Care to let me know some gotchas?
>
> set_current_state() includes a write barrier to ensure the setting of
> the state is flushed before any further instructions. This is to
> provide a memory barrier for weak-ordering processors that
> can and will rearrange the writes.
It is what I was expecting, given xchg() being in the equation;
then it is reduced to a problem of guessing what can be the
delay for the flushing of the write ... beautiful ...
So, in that scenario, it means that:
- any setting before a return should be barriered unless we
return to a place[s] known to be harmless
- any setting to TASK_RUNNING should be kind of safe
- exec.c:de_thread(),
while (atomic_read(&oldsig->count) > count) {
oldsig->group_exit_task = current;
- current->state = TASK_UNINTERRUPTIBLE;
+ __set_current_state(TASK_UNINTERRUPTIBLE);
spin_unlock_irq(&oldsig->siglock);
Should be safe, as spin_unlock_irq() will do memory clobber
on sti() [undependant from UP/SMP].
- namei.c:do_follow_link()
if (current->total_link_count >= 40)
goto loop;
if (need_resched()) {
- current->state = TASK_RUNNING;
+ __set_current_state(TASK_RUNNING);
schedule();
}
err = security_inode_follow_link(dentry, nd);
There is a function for it, cond_resched().
So, sending an updated patch right now
> Not all processors like those made by your employer are
> strongly-ordered :)
You'd be surprised how little about those gory details I do know :]
Inaky Perez-Gonzalez -- Not speaking for Intel - opinions are my own [or my
fault]
^ permalink raw reply [flat|nested] 11+ messages in thread* RE: [PATCH 2.5.52] Use __set_current_state() instead of current-> state = (take 1)
2002-12-19 1:53 Perez-Gonzalez, Inaky
@ 2002-12-19 2:04 ` Robert Love
0 siblings, 0 replies; 11+ messages in thread
From: Robert Love @ 2002-12-19 2:04 UTC (permalink / raw)
To: Perez-Gonzalez, Inaky; +Cc: torvalds, linux-kernel
On Wed, 2002-12-18 at 20:53, Perez-Gonzalez, Inaky wrote:
> - any setting before a return should be barriered unless we
> return to a place[s] known to be harmless
Not sure.
> - any setting to TASK_RUNNING should be kind of safe
Yes, I agree. It may race, but with what?
> - exec.c:de_thread(),
>
> while (atomic_read(&oldsig->count) > count) {
> oldsig->group_exit_task = current;
> - current->state = TASK_UNINTERRUPTIBLE;
> + __set_current_state(TASK_UNINTERRUPTIBLE);
> spin_unlock_irq(&oldsig->siglock);
>
> Should be safe, as spin_unlock_irq() will do memory clobber
> on sti() [undependant from UP/SMP].
The memory clobber only acts as a compiler barrier and insures the
compiler does not reorder the statements from the order in the C code.
What we need is a memory barrier to ensure the processor does not
reorder statements. In other words, the processor can completely
rearrange loads and stores as they are issued to it, as long as it does
not break obvious data dependencies. On a weakly ordered processor,
sans memory barrier, there is no telling when and where a store will
actually reach memory. This is regardless of the order of the C code or
anything else.
That said, I do not know if the above example is a problem or not. On a
very quick glance, the only issue I saw is the one I pointed out
earlier, and you fixed it.
Robert Love
^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: [PATCH 2.5.52] Use __set_current_state() instead of current-> state = (take 1)
@ 2002-12-19 2:40 Perez-Gonzalez, Inaky
2002-12-19 3:19 ` Robert Love
0 siblings, 1 reply; 11+ messages in thread
From: Perez-Gonzalez, Inaky @ 2002-12-19 2:40 UTC (permalink / raw)
To: 'Robert Love'; +Cc: torvalds, linux-kernel
> > - any setting before a return should be barriered unless we
> > return to a place[s] known to be harmless
>
> Not sure.
Well, I think it makes kind of sense. If we know we are
returning to some place where nothing bad could happen
with reordering ... well, so be it, don't use __set_...()
> > - any setting to TASK_RUNNING should be kind of safe
>
> Yes, I agree. It may race, but with what?
Hmmm ... I guess it could race, but it would not really
be that important [unless maybe certain constructs], as
the only setting it will go to would be [UN]INTERRUPTIBLE,
where it would be waken up afterwards, as it'd be out of the
rq, STOPPED, same thing, out of the rq, ZOMBIE or DEAD [gone].
I guess in this case, the only one I can come up with,
there would be an inconsistency between the actual state
of the task in task->state and the real one [or its position
in whatever rq or wq].
Could that have ugly consequences? I dunno ...
> > - current->state = TASK_UNINTERRUPTIBLE;
> > + __set_current_state(TASK_UNINTERRUPTIBLE);
> > spin_unlock_irq(&oldsig->siglock);
> >
> > Should be safe, as spin_unlock_irq() will do memory clobber
> > on sti() [undependant from UP/SMP].
>
> The memory clobber only acts as a compiler barrier and insures the
> compiler does not reorder the statements from the order in the C code.
while (1)
printk ("Inaky, remember, clobber is a hint to gcc\n");
And that would now really work when CONFIG_X86_OOSTORE=1 is required
[after all, it is a write, so it'd need the equivalent of a wmb() or
xchg()].
Okay, changing that one too, just in case.
Inaky Perez-Gonzalez -- Not speaking for Intel - opinions are my own [or my
fault]
^ permalink raw reply [flat|nested] 11+ messages in thread* RE: [PATCH 2.5.52] Use __set_current_state() instead of current-> state = (take 1)
2002-12-19 2:40 Perez-Gonzalez, Inaky
@ 2002-12-19 3:19 ` Robert Love
0 siblings, 0 replies; 11+ messages in thread
From: Robert Love @ 2002-12-19 3:19 UTC (permalink / raw)
To: Perez-Gonzalez, Inaky; +Cc: torvalds, linux-kernel
On Wed, 2002-12-18 at 21:40, Perez-Gonzalez, Inaky wrote:
> Well, I think it makes kind of sense. If we know we are
> returning to some place where nothing bad could happen
> with reordering ... well, so be it, don't use __set_...()
Oh, I see. If it returns to somewhere that immediately e.g. puts it on
a wait queue. In that case, yep: need the barrier version.
> And that would now really work when CONFIG_X86_OOSTORE=1 is required
> [after all, it is a write, so it'd need the equivalent of a wmb() or
> xchg()].
Is this a hint that your employer may have an x86 chip in the future
with weak ordering? :)
> Okay, changing that one too, just in case.
Good - better safe than sorry.
Robert Love
^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: [PATCH 2.5.52] Use __set_current_state() instead of current-> state = (take 1)
@ 2002-12-19 19:04 Perez-Gonzalez, Inaky
2002-12-20 3:08 ` Alan Cox
0 siblings, 1 reply; 11+ messages in thread
From: Perez-Gonzalez, Inaky @ 2002-12-19 19:04 UTC (permalink / raw)
To: 'Robert Love'; +Cc: linux-kernel
> > And that would now really work when CONFIG_X86_OOSTORE=1 is required
> > [after all, it is a write, so it'd need the equivalent of a wmb() or
> > xchg()].
>
> Is this a hint that your employer may have an x86 chip in the future
> with weak ordering? :)
Hmmm ... taking into account that there are some many thousands of
employees in my company and that I know less than one hundred ...
and that they are all software ... well, I don't think I am into
the rumour mill :]
Inaky Perez-Gonzalez -- Not speaking for Intel - opinions are my own [or my
fault]
^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: [PATCH 2.5.52] Use __set_current_state() instead of current-> state = (take 1)
2002-12-19 19:04 Perez-Gonzalez, Inaky
@ 2002-12-20 3:08 ` Alan Cox
2002-12-20 19:36 ` Oliver Xymoron
0 siblings, 1 reply; 11+ messages in thread
From: Alan Cox @ 2002-12-20 3:08 UTC (permalink / raw)
To: Perez-Gonzalez, Inaky; +Cc: 'Robert Love', Linux Kernel Mailing List
On Thu, 2002-12-19 at 19:04, Perez-Gonzalez, Inaky wrote:
>
> > > And that would now really work when CONFIG_X86_OOSTORE=1 is required
> > > [after all, it is a write, so it'd need the equivalent of a wmb() or
> > > xchg()].
> >
> > Is this a hint that your employer may have an x86 chip in the future
> > with weak ordering? :)
>
> Hmmm ... taking into account that there are some many thousands of
> employees in my company and that I know less than one hundred ...
> and that they are all software ... well, I don't think I am into
> the rumour mill :]
Also OOSTORE is there because other vendors already make weak store
order capable x86 processors. One example of this is the Winchip - where
turning off strict store ordering is worth 30% performance.
In addition you have to treat store ordering/locking carefully due to
the pentium pro store fencing errata. (Thats why our < PII kernel
generates lock movb to unlock when in theory the lock isnt needed).
Alan
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2.5.52] Use __set_current_state() instead of current-> state = (take 1)
2002-12-20 3:08 ` Alan Cox
@ 2002-12-20 19:36 ` Oliver Xymoron
0 siblings, 0 replies; 11+ messages in thread
From: Oliver Xymoron @ 2002-12-20 19:36 UTC (permalink / raw)
To: Alan Cox
Cc: Perez-Gonzalez, Inaky, 'Robert Love',
Linux Kernel Mailing List
On Fri, Dec 20, 2002 at 03:08:28AM +0000, Alan Cox wrote:
> In addition you have to treat store ordering/locking carefully due to
> the pentium pro store fencing errata. (Thats why our < PII kernel
> generates lock movb to unlock when in theory the lock isnt needed).
Except the SMP causality test app I wrote got success reports from every
PPro stepping level.
--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2002-12-20 19:28 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-12-18 23:56 [PATCH 2.5.52] Use __set_current_state() instead of current->state = (take 1) Inaky Perez-Gonzalez
2002-12-19 0:11 ` Robert Love
-- strict thread matches above, loose matches on Subject: below --
2002-12-19 0:46 [PATCH 2.5.52] Use __set_current_state() instead of current-> state " Perez-Gonzalez, Inaky
2002-12-19 1:03 ` Robert Love
2002-12-19 1:53 Perez-Gonzalez, Inaky
2002-12-19 2:04 ` Robert Love
2002-12-19 2:40 Perez-Gonzalez, Inaky
2002-12-19 3:19 ` Robert Love
2002-12-19 19:04 Perez-Gonzalez, Inaky
2002-12-20 3:08 ` Alan Cox
2002-12-20 19:36 ` Oliver Xymoron
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox