* [PATCH] Export current_is_keventd() for libphy @ 2006-12-03 5:50 Ben Collins 2006-12-03 9:16 ` Andrew Morton 0 siblings, 1 reply; 54+ messages in thread From: Ben Collins @ 2006-12-03 5:50 UTC (permalink / raw) To: linux-kernel; +Cc: Linus Torvalds Fixes this problem when libphy is compiled as module: WARNING: "current_is_keventd" [drivers/net/phy/libphy.ko] undefined! diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 17c2f03..1cf226b 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -608,6 +608,7 @@ int current_is_keventd(void) return ret; } +EXPORT_SYMBOL_GPL(current_is_keventd); #ifdef CONFIG_HOTPLUG_CPU /* Take the work from this (downed) CPU. */ ^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-03 5:50 [PATCH] Export current_is_keventd() for libphy Ben Collins @ 2006-12-03 9:16 ` Andrew Morton 2006-12-04 19:17 ` Steve Fox 2006-12-05 17:48 ` Maciej W. Rozycki 0 siblings, 2 replies; 54+ messages in thread From: Andrew Morton @ 2006-12-03 9:16 UTC (permalink / raw) To: Ben Collins; +Cc: linux-kernel, Linus Torvalds, Maciej W. Rozycki, Jeff Garzik On Sun, 03 Dec 2006 00:50:55 -0500 Ben Collins <ben.collins@ubuntu.com> wrote: > Fixes this problem when libphy is compiled as module: > > WARNING: "current_is_keventd" [drivers/net/phy/libphy.ko] undefined! > > diff --git a/kernel/workqueue.c b/kernel/workqueue.c > index 17c2f03..1cf226b 100644 > --- a/kernel/workqueue.c > +++ b/kernel/workqueue.c > @@ -608,6 +608,7 @@ int current_is_keventd(void) > return ret; > > } > +EXPORT_SYMBOL_GPL(current_is_keventd); > > #ifdef CONFIG_HOTPLUG_CPU > /* Take the work from this (downed) CPU. */ wtf? That code was merged? This bug has been known for months and after several unsuccessful attempts at trying to understand what on earth that hackery is supposed to be doing I gave up on it and asked Jeff to look after it. Maciej, please, in very small words written in a very large font, explain to us what is going on in phy_stop_interrupts()? Include pictures. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-03 9:16 ` Andrew Morton @ 2006-12-04 19:17 ` Steve Fox 2006-12-05 18:05 ` Maciej W. Rozycki 2006-12-05 17:48 ` Maciej W. Rozycki 1 sibling, 1 reply; 54+ messages in thread From: Steve Fox @ 2006-12-04 19:17 UTC (permalink / raw) To: Andrew Morton; +Cc: Ben Collins, linux-kernel, Linus Torvalds, macro, afleming On Sun, 2006-12-03 at 01:16 -0800, Andrew Morton wrote: > On Sun, 03 Dec 2006 00:50:55 -0500 > Ben Collins <ben.collins@ubuntu.com> wrote: > > > Fixes this problem when libphy is compiled as module: > > > > WARNING: "current_is_keventd" [drivers/net/phy/libphy.ko] undefined! > > > > diff --git a/kernel/workqueue.c b/kernel/workqueue.c > > index 17c2f03..1cf226b 100644 > > --- a/kernel/workqueue.c > > +++ b/kernel/workqueue.c > > @@ -608,6 +608,7 @@ int current_is_keventd(void) > > return ret; > > > > } > > +EXPORT_SYMBOL_GPL(current_is_keventd); > > > > #ifdef CONFIG_HOTPLUG_CPU > > /* Take the work from this (downed) CPU. */ > > wtf? That code was merged? This bug has been known for months and after > several unsuccessful attempts at trying to understand what on earth that > hackery is supposed to be doing I gave up on it and asked Jeff to look after > it. > > Maciej, please, in very small words written in a very large font, explain to > us what is going on in phy_stop_interrupts()? Include pictures. Unfortunately Maciej didn't get cc'ed correctly on your mail. Hopefully I've added the right person to this post as well as Andy who has also recently changed this code. We're also hitting this on one of the test.kernel.org machines, bl6-13. -- Steve Fox IBM Linux Technology Center ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-04 19:17 ` Steve Fox @ 2006-12-05 18:05 ` Maciej W. Rozycki 0 siblings, 0 replies; 54+ messages in thread From: Maciej W. Rozycki @ 2006-12-05 18:05 UTC (permalink / raw) To: Steve Fox Cc: Andrew Morton, Ben Collins, linux-kernel, Linus Torvalds, afleming On Mon, 4 Dec 2006, Steve Fox wrote: > Unfortunately Maciej didn't get cc'ed correctly on your mail. Hopefully > I've added the right person to this post as well as Andy who has also > recently changed this code. Thanks -- my parser of the LKML does not always trigger. Maciej ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-03 9:16 ` Andrew Morton 2006-12-04 19:17 ` Steve Fox @ 2006-12-05 17:48 ` Maciej W. Rozycki 2006-12-05 18:07 ` Linus Torvalds ` (2 more replies) 1 sibling, 3 replies; 54+ messages in thread From: Maciej W. Rozycki @ 2006-12-05 17:48 UTC (permalink / raw) To: Andrew Morton Cc: Andy Fleming, Ben Collins, linux-kernel, Linus Torvalds, Jeff Garzik On Sun, 3 Dec 2006, Andrew Morton wrote: > wtf? That code was merged? This bug has been known for months and after > several unsuccessful attempts at trying to understand what on earth that > hackery is supposed to be doing I gave up on it and asked Jeff to look after > it. I am surprised it got merged too -- my understanding from the discussion was it was to be sorted out somehow within libphy, one way or another. > Maciej, please, in very small words written in a very large font, explain to > us what is going on in phy_stop_interrupts()? Include pictures. Would ASCII art qualify as pictures? Regardless, I am providing a textual description, which please feel free to skip to the conclusion below if uninterested in the details. Essentially there is a race when disconnecting from a PHY, because interrupt delivery uses the event queue for processing. The function to handle interrupts that is called from the event queue is phy_change(). It takes a pointer to a structure that is associated with the PHY. At the time phy_stop_interrupts() is called there may be one or more calls to phy_change() still pending on the event queue. They may not be able to be processed until the structure passed to phy_change() have been freed, at which point calling the function is wrong. One way of avoiding it is calling flush_scheduled_work() from phy_stop_interrupts(). This is fine as long as a caller of phy_stop_interrupts() (not necessarily the immediate one calling into libphy) does not hold the netlink lock. If a caller indeed holds the netlink lock, then a driver effectively calling phy_stop_interrupts() may arrange for the function to be itself scheduled through the event queue. This has the effect of avoiding the race as well, as the queue is processed in order, except it causes more hassle for the driver. Hence the choice was left to the driver's author -- if a driver "knows" the netlink lock is not going to be held at that point, it may call phy_stop_interrupts() directly, otherwise it shall use the event queue. With such an assumption in place the function has to check somehow whether it has been scheduled through the queue or not and act accordingly, which is why that "if" clause is there. Now I gather the conclusion was the whole mess was going to be included within libphy and not exposed to Ethernet MAC drivers. This way the library would schedule both phy_stop_interrupts() and mdiobus_unregister() (which is actually the function freeing the PHY device structure) through the event queue as needed without a MAC driver having to know. And the whole question that remains is whether it is Andy (cc-ed) or me who is more competent to implement this change. ;-) Maciej ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-05 17:48 ` Maciej W. Rozycki @ 2006-12-05 18:07 ` Linus Torvalds 2006-12-05 19:31 ` Andrew Morton 2006-12-05 18:57 ` Andy Fleming 2006-12-05 20:39 ` Andrew Morton 2 siblings, 1 reply; 54+ messages in thread From: Linus Torvalds @ 2006-12-05 18:07 UTC (permalink / raw) To: Maciej W. Rozycki Cc: Andrew Morton, Andy Fleming, Ben Collins, linux-kernel, Jeff Garzik On Tue, 5 Dec 2006, Maciej W. Rozycki wrote: > > One way of avoiding it is calling flush_scheduled_work() from > phy_stop_interrupts(). This is fine as long as a caller of > phy_stop_interrupts() (not necessarily the immediate one calling into > libphy) does not hold the netlink lock. > > If a caller indeed holds the netlink lock, then a driver effectively > calling phy_stop_interrupts() may arrange for the function to be itself > scheduled through the event queue. This has the effect of avoiding the > race as well, as the queue is processed in order, except it causes more > hassle for the driver. I would personally be ok with "flush_scheduled_work()" _itself_ noticing that it is actually waiting to flush "itself", and just being a no-op in that case. However, having outside code do that detection for it seems to be ugly as hell. And something like the network drievr layer shouldn't know about keventd details like this. So if you can move that deadlock avoidance logic into "flush_scheduled_work()" itself, we'd avoid the stupid export, and we'd also avoid the driver layer having to care about these things. It would still be _ugly_, but I think it would be less so. Hmm? Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-05 18:07 ` Linus Torvalds @ 2006-12-05 19:31 ` Andrew Morton 0 siblings, 0 replies; 54+ messages in thread From: Andrew Morton @ 2006-12-05 19:31 UTC (permalink / raw) To: Linus Torvalds Cc: Maciej W. Rozycki, Andy Fleming, Ben Collins, linux-kernel, Jeff Garzik On Tue, 5 Dec 2006 10:07:21 -0800 (PST) Linus Torvalds <torvalds@osdl.org> wrote: > > > On Tue, 5 Dec 2006, Maciej W. Rozycki wrote: > > > > One way of avoiding it is calling flush_scheduled_work() from > > phy_stop_interrupts(). This is fine as long as a caller of > > phy_stop_interrupts() (not necessarily the immediate one calling into > > libphy) does not hold the netlink lock. > > > > If a caller indeed holds the netlink lock, then a driver effectively > > calling phy_stop_interrupts() may arrange for the function to be itself > > scheduled through the event queue. This has the effect of avoiding the > > race as well, as the queue is processed in order, except it causes more > > hassle for the driver. > > I would personally be ok with "flush_scheduled_work()" _itself_ noticing > that it is actually waiting to flush "itself", and just being a no-op in > that case. It does do that: static void flush_cpu_workqueue(struct cpu_workqueue_struct *cwq) { if (cwq->thread == current) { /* * Probably keventd trying to flush its own queue. So simply run * it by hand rather than deadlocking. */ run_workqueue(cwq); ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-05 17:48 ` Maciej W. Rozycki 2006-12-05 18:07 ` Linus Torvalds @ 2006-12-05 18:57 ` Andy Fleming 2006-12-06 12:31 ` Maciej W. Rozycki 2006-12-05 20:39 ` Andrew Morton 2 siblings, 1 reply; 54+ messages in thread From: Andy Fleming @ 2006-12-05 18:57 UTC (permalink / raw) To: Maciej W. Rozycki Cc: Andrew Morton, Ben Collins, linux-kernel, Linus Torvalds, Jeff Garzik On Dec 5, 2006, at 11:48, Maciej W. Rozycki wrote: > > Essentially there is a race when disconnecting from a PHY, because > interrupt delivery uses the event queue for processing. The > function to > handle interrupts that is called from the event queue is phy_change(). > It takes a pointer to a structure that is associated with the PHY. > At the > time phy_stop_interrupts() is called there may be one or more calls to > phy_change() still pending on the event queue. They may not be > able to be > processed until the structure passed to phy_change() have been > freed, at > which point calling the function is wrong. > > One way of avoiding it is calling flush_scheduled_work() from > phy_stop_interrupts(). This is fine as long as a caller of > phy_stop_interrupts() (not necessarily the immediate one calling into > libphy) does not hold the netlink lock. > > If a caller indeed holds the netlink lock, then a driver effectively > calling phy_stop_interrupts() may arrange for the function to be > itself > scheduled through the event queue. This has the effect of avoiding > the > race as well, as the queue is processed in order, except it causes > more > hassle for the driver. Hence the choice was left to the driver's > author > -- if a driver "knows" the netlink lock is not going to be held at > that > point, it may call phy_stop_interrupts() directly, otherwise it > shall use > the event queue. We need to make sure there are no more pending phy_change() invocations in the work queue, this is true, however I'm not convinced that this avoids the problem. And now that I come back to this email after Linus's response, let me add that I agree with his suggestion. I still don't think it solves the original problem, though. Unless I'm missing something, Maciej's suggested fix (having the driver invoke phy_stop_interrupts() from a work queue) doesn't stop the race: * Schedule stop_interrupts and freeing of memory. * interrupt occurs, and schedules phy_change * work_queue triggers, and stop_interrupts is invoked. It doesn't call flush_scheduled_work, because it's being called from keventd. * The invoker of stop_interrupts (presumably some function in the driver) frees up memory, including the phy_device. * phy_change is invoked() from the work queue, and starts accessing freed memory Am I misunderstanding how work queues operate? If I'm right, an ugly solution would have to disable the PHY interrupts before scheduling the cleanup. But is there really no way to tell the kernel to squash all pending work that came from *this* work_queue? > > With such an assumption in place the function has to check somehow > whether it has been scheduled through the queue or not and act > accordingly, which is why that "if" clause is there. > > Now I gather the conclusion was the whole mess was going to be > included > within libphy and not exposed to Ethernet MAC drivers. This way the > library would schedule both phy_stop_interrupts() and > mdiobus_unregister() > (which is actually the function freeing the PHY device structure) > through > the event queue as needed without a MAC driver having to know. I suggested this, mostly so that drivers wouldn't have to be aware of this. But I'm not quite sure what happens when you unload a module. Does some stuff stay behind if needed? If you unload the ethernet driver, that will usually remove the bus controller for the PHY, which would prevent any scheduled code from accessing the PHY. Andy ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-05 18:57 ` Andy Fleming @ 2006-12-06 12:31 ` Maciej W. Rozycki 0 siblings, 0 replies; 54+ messages in thread From: Maciej W. Rozycki @ 2006-12-06 12:31 UTC (permalink / raw) To: Andy Fleming Cc: Andrew Morton, Ben Collins, linux-kernel, Linus Torvalds, Jeff Garzik On Tue, 5 Dec 2006, Andy Fleming wrote: > We need to make sure there are no more pending phy_change() invocations in the > work queue, this is true, however I'm not convinced that this avoids the > problem. And now that I come back to this email after Linus's response, let > me add that I agree with his suggestion. I still don't think it solves the > original problem, though. Unless I'm missing something, Maciej's suggested > fix (having the driver invoke phy_stop_interrupts() from a work queue) doesn't > stop the race: > > * Schedule stop_interrupts and freeing of memory. > * interrupt occurs, and schedules phy_change > * work_queue triggers, and stop_interrupts is invoked. It doesn't call > flush_scheduled_work, because it's being called from keventd. > * The invoker of stop_interrupts (presumably some function in the driver) > frees up memory, including the phy_device. > * phy_change is invoked() from the work queue, and starts accessing freed > memory This is not going to happen with my other changes to the file applied. The reason is at the time phy_stop_interrupts() is called phy_stop() has already run and switched the state of the PHY being handled to PHY_HALTED. As a result any subsequent calls to phy_interrupt() that might have happened after phy_stop() have not scheduled calls to phy_change() for this PHY as will not any that may happen up until free_irq() have unregistered the interrupt for the PHY. > I suggested this, mostly so that drivers wouldn't have to be aware of this. > But I'm not quite sure what happens when you unload a module. Does some stuff > stay behind if needed? If you unload the ethernet driver, that will usually > remove the bus controller for the PHY, which would prevent any scheduled code > from accessing the PHY. Hmm, I am unsure if there is anything that would ensure flushing of the queue after its stop() call has finished and before a driver is removed (its module_exit() call is invoked), probably nothing, and that is why I put explicit flush_scheduled_work() in sb1250-mac.c:sbmac_remove() and the driver's open() call checks whether a possible previous instance of the structure used by phy_change() have not been freed yet. There may be a cleaner way of doing it, but I will have to think about it. Maciej ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-05 17:48 ` Maciej W. Rozycki 2006-12-05 18:07 ` Linus Torvalds 2006-12-05 18:57 ` Andy Fleming @ 2006-12-05 20:39 ` Andrew Morton 2006-12-05 20:59 ` Andy Fleming 2 siblings, 1 reply; 54+ messages in thread From: Andrew Morton @ 2006-12-05 20:39 UTC (permalink / raw) To: Maciej W. Rozycki Cc: Andy Fleming, Ben Collins, linux-kernel, Linus Torvalds, Jeff Garzik On Tue, 5 Dec 2006 17:48:05 +0000 (GMT) "Maciej W. Rozycki" <macro@linux-mips.org> wrote: > Essentially there is a race when disconnecting from a PHY, because > interrupt delivery uses the event queue for processing. The function to > handle interrupts that is called from the event queue is phy_change(). > It takes a pointer to a structure that is associated with the PHY. At the > time phy_stop_interrupts() is called there may be one or more calls to > phy_change() still pending on the event queue. They may not be able to be > processed until the structure passed to phy_change() have been freed, at > which point calling the function is wrong. > > One way of avoiding it is calling flush_scheduled_work() from > phy_stop_interrupts(). This is fine as long as a caller of > phy_stop_interrupts() (not necessarily the immediate one calling into > libphy) does not hold the netlink lock. So let me try to rephrase... - phy_change() is the workqueue callback function. It is executed by keventd. - Something under phy_change() takes rtnl_lock() (but what??) - phy_stop_interrupts() does flush_scheduled_work(). This has to following logic: - if I am kevetnd, run phy_change() directly. - If I am not keventd, wait for keventd() to run phy_change() - So if the caller of phy_stop_interrupt() already holds rtnl_lock(), and if that caller is keventd then it will recur onto rntl_lock() and will deadlock. Problem is, if the caller of phy_stop_interrupt() is *not* keventd, that caller will still deadlock, because that caller is waiting for keventd to run phy_change(), and keventd cannot do that, because the not-keventd process already holds rtnl_lock. Now, afaict, there are only two callers of phy_stop_interrupts(): the close() handlers of gianfar.c and fs_enet-main.c (confusingly held in netdevice.stop (confusingly called by dev_close())). Via phy_disconnect. Did I miss anything? And the dev_close() caller holds rtnl_lock. Summary: a) Please tell us what code under phy_change() wants to take rtnl_lock b) I think it should deadlock whether or not the caller of phy_stop_interrupt() is keventd. What am I missing? ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-05 20:39 ` Andrew Morton @ 2006-12-05 20:59 ` Andy Fleming 2006-12-05 21:26 ` Andrew Morton 0 siblings, 1 reply; 54+ messages in thread From: Andy Fleming @ 2006-12-05 20:59 UTC (permalink / raw) To: Andrew Morton Cc: Maciej W. Rozycki, Ben Collins, linux-kernel, Linus Torvalds, Jeff Garzik On Dec 5, 2006, at 14:39, Andrew Morton wrote: > On Tue, 5 Dec 2006 17:48:05 +0000 (GMT) > "Maciej W. Rozycki" <macro@linux-mips.org> wrote: > >> Essentially there is a race when disconnecting from a PHY, because >> interrupt delivery uses the event queue for processing. The >> function to >> handle interrupts that is called from the event queue is phy_change >> (). >> It takes a pointer to a structure that is associated with the >> PHY. At the >> time phy_stop_interrupts() is called there may be one or more >> calls to >> phy_change() still pending on the event queue. They may not be >> able to be >> processed until the structure passed to phy_change() have been >> freed, at >> which point calling the function is wrong. >> >> One way of avoiding it is calling flush_scheduled_work() from >> phy_stop_interrupts(). This is fine as long as a caller of >> phy_stop_interrupts() (not necessarily the immediate one calling into >> libphy) does not hold the netlink lock. > > So let me try to rephrase... > > - phy_change() is the workqueue callback function. It is executed by > keventd. > > - Something under phy_change() takes rtnl_lock() (but what??) I don't think it's phy_change(). It's something else that may be scheduled. > > - phy_stop_interrupts() does flush_scheduled_work(). This has to > following logic: > > - if I am kevetnd, run phy_change() directly. > > - If I am not keventd, wait for keventd() to run phy_change() > > - So if the caller of phy_stop_interrupt() already holds rtnl_lock(), > and if that caller is keventd then it will recur onto rntl_lock() > and > will deadlock. > > Problem is, if the caller of phy_stop_interrupt() is *not* keventd, > that > caller will still deadlock, because that caller is waiting for > keventd to > run phy_change(), and keventd cannot do that, because the not-keventd > process already holds rtnl_lock. > > > Now, afaict, there are only two callers of phy_stop_interrupts(): the > close() handlers of gianfar.c and fs_enet-main.c (confusingly held in > netdevice.stop (confusingly called by dev_close())). Via > phy_disconnect. > Did I miss anything? Right now, that's probably about right. > > And the dev_close() caller holds rtnl_lock. > Ok, I think this is the summary: - phy_change() is the work queue callback function (scheduled when a PHY interrupt occurs) - dev_close() invokes the controller's stop/close/whatever function, and it calls phy_disconnect() - phy_disconnect() calls phy_stop_interrupts(). To prevent any pending phy_change() calls from getting confused, phy_stop_interrupts () needs to flush the queue. Otherwise, subsequent memory freeings will leave phy_change() hanging. - If phy_stop_interrupts() calls flush_scheduled_work(), keventd will execute its queues while rtnl_lock is held, providing opportunity for other callbacks to deadlock. - innocent puppies are slaughtered, and the world mourns. Maciej's solution is to schedule phy_disconnect() to be called from a work queue. That solution should work, but it sounds like it doesn't require the check for if keventd is running. Of course, my objection to it is that it now requires the ethernet controller to be excessively aware of the details of how the PHY Lib is handling the PHY interrupts (by scheduling them on a work queue). Andy ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-05 20:59 ` Andy Fleming @ 2006-12-05 21:26 ` Andrew Morton 2006-12-05 21:37 ` Roland Dreier 0 siblings, 1 reply; 54+ messages in thread From: Andrew Morton @ 2006-12-05 21:26 UTC (permalink / raw) To: Andy Fleming Cc: Maciej W. Rozycki, Ben Collins, linux-kernel, Linus Torvalds, Jeff Garzik On Tue, 5 Dec 2006 14:59:31 -0600 Andy Fleming <afleming@freescale.com> wrote: > Ok, I think this is the summary: > > - phy_change() is the work queue callback function (scheduled when a > PHY interrupt occurs) > > - dev_close() invokes the controller's stop/close/whatever function, > and it calls phy_disconnect() > > - phy_disconnect() calls phy_stop_interrupts(). To prevent any > pending phy_change() calls from getting confused, phy_stop_interrupts > () needs to flush the queue. Otherwise, subsequent memory freeings > will leave phy_change() hanging. > > - If phy_stop_interrupts() calls flush_scheduled_work(), keventd will > execute its queues while rtnl_lock is held, providing opportunity for > other callbacks to deadlock. > > - innocent puppies are slaughtered, and the world mourns. > ah, OK. So it's some other queued-up callback which takes rtnl_lock. But I still don't see what's special about keventd. If, say, /sbin/ip is running flush_scheduled_work() under rtnl_lock then it too should deadlock in this scenario. > > Maciej's solution is to schedule phy_disconnect() to be called from a > work queue. That solution should work, but it sounds like it doesn't > require the check for if keventd is running. > > Of course, my objection to it is that it now requires the ethernet > controller to be excessively aware of the details of how the PHY Lib > is handling the PHY interrupts (by scheduling them on a work queue). So what's a good fix? a) Ban the calling of flush_scheduled_work() from under rtnl_lock(). Sounds hard. b) Ban the queueing of callback functions which take rtnl_lock(). This sounds like a plain bad idea - callbacks are low-level things which ought to be able to take locks. c) Cancel the phy_change() callback within phy_stop_interrupts() so we don't need to run flush_scheduled_work() at all. This will almost work, as long as it's done in workqueue.c with appropriate locking. The bug occurs when some other CPU is running phy_change() right now - we'll end up freeing data which that CPU is presently playing with. But perhaps we can take care of this within workqueue.c. We need a cancel function which will cancel the work and, if its callback is presently executing it will block until that execution has completed. If we require of the calling subsystem a) that the work will not get rescheduled (that means phy_interrupt()) and b) that the callback does not rearm the work then things get simpler. But still not very simple. It gets ugly with per-CPU qorkqueues. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-05 21:26 ` Andrew Morton @ 2006-12-05 21:37 ` Roland Dreier 2006-12-05 21:57 ` Andrew Morton 0 siblings, 1 reply; 54+ messages in thread From: Roland Dreier @ 2006-12-05 21:37 UTC (permalink / raw) To: Andrew Morton Cc: Andy Fleming, Maciej W. Rozycki, Ben Collins, linux-kernel, Linus Torvalds, Jeff Garzik > a) Ban the calling of flush_scheduled_work() from under rtnl_lock(). > Sounds hard. Unfortunate if this is happening a lot. It seems like the most sensible fix -- flush_scheduled_work() is in effect calling into an unknown and changeable in the future set of functions (since it waits for them to finish), and it seems error-prone to hold a lock across such a call. > This will almost work, as long as it's done in workqueue.c with > appropriate locking. The bug occurs when some other CPU is running > phy_change() right now - we'll end up freeing data which that CPU is > presently playing with. > > But perhaps we can take care of this within workqueue.c. We need a > cancel function which will cancel the work and, if its callback is > presently executing it will block until that execution has completed. I may be misunderstanding you, but this seems to deadlock in exactly the same way: if someone calls this cancel routine holding rtnl_lock, and the work function that will also take rtnl_lock has just started, it will get stuck when the work function tries to take rtnl_lock. - R. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-05 21:37 ` Roland Dreier @ 2006-12-05 21:57 ` Andrew Morton 2006-12-05 23:49 ` Roland Dreier ` (2 more replies) 0 siblings, 3 replies; 54+ messages in thread From: Andrew Morton @ 2006-12-05 21:57 UTC (permalink / raw) To: Roland Dreier Cc: Andy Fleming, Maciej W. Rozycki, Ben Collins, linux-kernel, Linus Torvalds, Jeff Garzik On Tue, 05 Dec 2006 13:37:37 -0800 Roland Dreier <rdreier@cisco.com> wrote: > > a) Ban the calling of flush_scheduled_work() from under rtnl_lock(). > > Sounds hard. > > Unfortunate if this is happening a lot. It seems like the most > sensible fix -- flush_scheduled_work() is in effect calling into > an unknown and changeable in the future set of functions (since it > waits for them to finish), and it seems error-prone to hold a lock > across such a call. yes, I agree. It's really bad to be calling flush_scheduled_work() with any locks held at all. Fragile, hard-to-maintain, source of once-in-a-blue-moon failures, etc. I guess lockdep will help. But running flush_scheduled_work() from within dev_close() is a very sensible thing to do, and dev_close is called under rtnl_lock(). davem is -> thattaway ;) > > This will almost work, as long as it's done in workqueue.c with > > appropriate locking. The bug occurs when some other CPU is running > > phy_change() right now - we'll end up freeing data which that CPU is > > presently playing with. > > > > But perhaps we can take care of this within workqueue.c. We need a > > cancel function which will cancel the work and, if its callback is > > presently executing it will block until that execution has completed. > > I may be misunderstanding you, but this seems to deadlock in exactly > the same way: if someone calls this cancel routine holding rtnl_lock, > and the work function that will also take rtnl_lock has just started, > it will get stuck when the work function tries to take rtnl_lock. Ah. The point is that the phy code doesn't want to flush _all_ pending callbacks. It only wants to flush its own one. And its own one doesn't take rtnl_lock(). IOW, the phy code has no interest in running some random other subsystem's callback - it just wants to run its own. Hence no deadlock. Maybe the lesson here is that flush_scheduled_work() is a bad function. It should really be flush_this_work(struct work_struct *w). That is in fact what approximately 100% of the flush_scheduled_work() callers actually want to do. hmm. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-05 21:57 ` Andrew Morton @ 2006-12-05 23:49 ` Roland Dreier 2006-12-05 23:52 ` Roland Dreier 2006-12-06 15:25 ` Maciej W. Rozycki 2 siblings, 0 replies; 54+ messages in thread From: Roland Dreier @ 2006-12-05 23:49 UTC (permalink / raw) To: Andrew Morton Cc: Andy Fleming, Maciej W. Rozycki, Ben Collins, linux-kernel, Linus Torvalds, Jeff Garzik > But running flush_scheduled_work() from within dev_close() is a very > sensible thing to do, and dev_close is called under rtnl_lock(). I can't argue with that -- this has actually bitten me in the past. Hmm, I'll try to understand why we need rtnl_lock() to cover dev_close... - R. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-05 21:57 ` Andrew Morton 2006-12-05 23:49 ` Roland Dreier @ 2006-12-05 23:52 ` Roland Dreier 2006-12-06 15:25 ` Maciej W. Rozycki 2 siblings, 0 replies; 54+ messages in thread From: Roland Dreier @ 2006-12-05 23:52 UTC (permalink / raw) To: Andrew Morton Cc: Andy Fleming, Maciej W. Rozycki, Ben Collins, linux-kernel, Linus Torvalds, Jeff Garzik > Ah. The point is that the phy code doesn't want to flush _all_ pending > callbacks. It only wants to flush its own one. And its own one doesn't > take rtnl_lock(). OK, got it. You're absolutely correct. > Maybe the lesson here is that flush_scheduled_work() is a bad function. > It should really be flush_this_work(struct work_struct *w). That is in > fact what approximately 100% of the flush_scheduled_work() callers actually > want to do. I think flush_this_work() runs into trouble if it means "make sure everything up to <this work> has completed" because it still syncs with everything before <this work>, which has the same risk of deadlock. And I'm not totally sure everyone who does flush_scheduled_work() really means "cancel my work" -- they might mean "finish up my work". For example I would have to do some archeology to remember exactly what I needed flush_scheduled_work() when I wrote drivers/infiniband/ulp/ipoib - R. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-05 21:57 ` Andrew Morton 2006-12-05 23:49 ` Roland Dreier 2006-12-05 23:52 ` Roland Dreier @ 2006-12-06 15:25 ` Maciej W. Rozycki 2006-12-06 15:57 ` Andrew Morton 2 siblings, 1 reply; 54+ messages in thread From: Maciej W. Rozycki @ 2006-12-06 15:25 UTC (permalink / raw) To: Andrew Morton Cc: Roland Dreier, Andy Fleming, Ben Collins, linux-kernel, Linus Torvalds, Jeff Garzik On Tue, 5 Dec 2006, Andrew Morton wrote: > But running flush_scheduled_work() from within dev_close() is a very > sensible thing to do, and dev_close is called under rtnl_lock(). > davem is -> thattaway ;) And when within dev_close() there is quite a chance there is linkwatch_event() somewhere in the event queue already. ;-) > Ah. The point is that the phy code doesn't want to flush _all_ pending > callbacks. It only wants to flush its own one. And its own one doesn't > take rtnl_lock(). > > IOW, the phy code has no interest in running some random other subsystem's > callback - it just wants to run its own. Hence no deadlock. Both are true. It's linkwatch_event() that's somewhere in the queue already that makes the trouble here. > Maybe the lesson here is that flush_scheduled_work() is a bad function. > It should really be flush_this_work(struct work_struct *w). That is in > fact what approximately 100% of the flush_scheduled_work() callers actually > want to do. There may be cases where flush_scheduled_work() is indeed needed, but likely outside drivers, and I agree such a specific call would be useful and work here. Maciej ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-06 15:25 ` Maciej W. Rozycki @ 2006-12-06 15:57 ` Andrew Morton 2006-12-06 17:17 ` Linus Torvalds 0 siblings, 1 reply; 54+ messages in thread From: Andrew Morton @ 2006-12-06 15:57 UTC (permalink / raw) To: Maciej W. Rozycki Cc: Roland Dreier, Andy Fleming, Ben Collins, linux-kernel, Linus Torvalds, Jeff Garzik On Wed, 6 Dec 2006 15:25:22 +0000 (GMT) "Maciej W. Rozycki" <macro@linux-mips.org> wrote: > > Maybe the lesson here is that flush_scheduled_work() is a bad function. > > It should really be flush_this_work(struct work_struct *w). That is in > > fact what approximately 100% of the flush_scheduled_work() callers actually > > want to do. > > There may be cases where flush_scheduled_work() is indeed needed, but > likely outside drivers, and I agree such a specific call would be useful > and work here. I think so too. But it would be imprudent to hang around waiting for me to write it :( ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-06 15:57 ` Andrew Morton @ 2006-12-06 17:17 ` Linus Torvalds 2006-12-06 17:43 ` David Howells 2006-12-07 1:21 ` Linus Torvalds 0 siblings, 2 replies; 54+ messages in thread From: Linus Torvalds @ 2006-12-06 17:17 UTC (permalink / raw) To: Andrew Morton, David Howells Cc: Maciej W. Rozycki, Roland Dreier, Andy Fleming, Ben Collins, Linux Kernel Mailing List, Jeff Garzik On Wed, 6 Dec 2006, Andrew Morton wrote: > > I think so too. But it would be imprudent to hang around waiting for me > to write it :( How about something like this? This (a) depends on the just-merged "struct work" cleanup (b) is totally untested (c) probably kills you slowly and painfully (d) may breed frikken sharks with lasers on their frikken heads! (e) does compile, but I don't guarantee anything else. (f) may, in other words, be totally broken And, btw: it may not work. Just in case that wasn't clear. This is a quick hack from me just sitting down and seeing if I can still do kernel programming, or whether I'm just relegated to merge other peoples code. Linus PS. It might be broken. PPS. David Howells added to participant list, hopefully he can double-check all my assumptions, since he's touched the workqueue code last. Tag, you're it! ---- diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c index 4044bb1..e175f39 100644 --- a/drivers/net/phy/phy.c +++ b/drivers/net/phy/phy.c @@ -587,8 +587,7 @@ int phy_stop_interrupts(struct phy_device *phydev) * Finish any pending work; we might have been scheduled * to be called from keventd ourselves, though. */ - if (!current_is_keventd()) - flush_scheduled_work(); + run_scheduled_work(&phydev->phy_queue); free_irq(phydev->irq, phydev); diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h index 4a3ea83..a601ed5 100644 --- a/include/linux/workqueue.h +++ b/include/linux/workqueue.h @@ -160,6 +160,7 @@ extern int queue_delayed_work_on(int cpu, struct workqueue_struct *wq, extern void FASTCALL(flush_workqueue(struct workqueue_struct *wq)); extern int FASTCALL(schedule_work(struct work_struct *work)); +extern int FASTCALL(run_scheduled_work(struct work_struct *work)); extern int FASTCALL(schedule_delayed_work(struct delayed_work *work, unsigned long delay)); extern int schedule_delayed_work_on(int cpu, struct delayed_work *work, unsigned long delay); diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 8d1e7cb..fcacf06 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -103,6 +103,79 @@ static inline void *get_wq_data(struct work_struct *work) return (void *) (work->management & WORK_STRUCT_WQ_DATA_MASK); } +static int __run_work(struct cpu_workqueue_struct *cwq, struct work_struct *work) +{ + int ret = 0; + unsigned long flags; + + spin_lock_irqsave(&cwq->lock, flags); + /* + * We need to re-validate the work info after we've gotten + * the cpu_workqueue lock. We can run the work now iff: + * + * - the wq_data still matches the cpu_workqueue_struct + * - AND the work is still marked pending + * - AND the work is still on a list (which will be this + * workqueue_struct list) + * + * All these conditions are important, because we + * need to protect against the work being run right + * now on another CPU (all but the last one might be + * true if it's currently running and has not been + * released yet, for example). + */ + if (get_wq_data(work) == cwq + && test_bit(WORK_STRUCT_PENDING, &work->management) + && !list_empty(&work->entry)) { + work_func_t f = work->func; + list_del_init(&work->entry); + spin_unlock_irqrestore(&cwq->lock, flags); + + if (!test_bit(WORK_STRUCT_NOAUTOREL, &work->management)) + work_release(work); + f(work); + + spin_lock_irqsave(&cwq->lock, flags); + cwq->remove_sequence++; + wake_up(&cwq->work_done); + ret = 1; + } + spin_unlock_irqrestore(&cwq->lock, flags); + return ret; +} + +/** + * run_scheduled_work - run scheduled work synchronously + * @work: work to run + * + * This checks if the work was pending, and runs it + * synchronously if so. It returns a boolean to indicate + * whether it had any scheduled work to run or not. + * + * NOTE! This _only_ works for normal work_structs. You + * CANNOT use this for delayed work, because the wq data + * for delayed work will not point properly to the per- + * CPU workqueue struct, but will change! + */ +int fastcall run_scheduled_work(struct work_struct *work) +{ + for (;;) { + struct cpu_workqueue_struct *cwq; + + if (!test_bit(WORK_STRUCT_PENDING, &work->management)) + return 0; + if (list_empty(&work->entry)) + return 0; + /* NOTE! This depends intimately on __queue_work! */ + cwq = get_wq_data(work); + if (!cwq) + return 0; + if (__run_work(cwq, work)) + return 1; + } +} +EXPORT_SYMBOL(run_scheduled_work); + /* Preempt must be disabled. */ static void __queue_work(struct cpu_workqueue_struct *cwq, struct work_struct *work) ^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-06 17:17 ` Linus Torvalds @ 2006-12-06 17:43 ` David Howells 2006-12-06 17:50 ` Jeff Garzik 2006-12-06 17:53 ` Linus Torvalds 2006-12-07 1:21 ` Linus Torvalds 1 sibling, 2 replies; 54+ messages in thread From: David Howells @ 2006-12-06 17:43 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Morton, David Howells, Maciej W. Rozycki, Roland Dreier, Andy Fleming, Ben Collins, Linux Kernel Mailing List, Jeff Garzik Linus Torvalds <torvalds@osdl.org> wrote: > How about something like this? At first glance, this looks reasonable. It also looks like it should be used to replace a lot of the cancel_delayed_work() calls that attempt to cancel _undelayed_ work items. That would allow a number of work items to be downgraded from delayed_work to work_struct. Also, the name "run_scheduled_work" sort of suggests that the work *will* be run regardless of whether it was pending or not. Given the confusion over cancel_delayed_work(), I imagine this will rain confusion too. + if (get_wq_data(work) == cwq + && test_bit(WORK_STRUCT_PENDING, &work->management) I wonder if those can be combined, perhaps: + if ((work->management & ~WORK_STRUCT_NOAUTOREL) == + ((unsigned long) cwq | (1 << WORK_STRUCT_PENDING)) Otherwise for i386 the compiler can't combine them because test_bit() is done with inline asm. And: + if (!test_bit(WORK_STRUCT_PENDING, &work->management)) Should possibly be: + if (!work_pending(work)) David ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-06 17:43 ` David Howells @ 2006-12-06 17:50 ` Jeff Garzik 2006-12-06 18:07 ` Linus Torvalds 2006-12-06 17:53 ` Linus Torvalds 1 sibling, 1 reply; 54+ messages in thread From: Jeff Garzik @ 2006-12-06 17:50 UTC (permalink / raw) To: David Howells Cc: Linus Torvalds, Andrew Morton, Maciej W. Rozycki, Roland Dreier, Andy Fleming, Ben Collins, Linux Kernel Mailing List Honestly I wonder if some of these situations really want "kill_scheduled_work_unless_it_is_already_running_right_now_if_so_wait_for_it" Since its during shutdown, usually the task just wants to know that the code in the workqueue won't be touching the hardware or data structures after <this> point. Jeff ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-06 17:50 ` Jeff Garzik @ 2006-12-06 18:07 ` Linus Torvalds 0 siblings, 0 replies; 54+ messages in thread From: Linus Torvalds @ 2006-12-06 18:07 UTC (permalink / raw) To: Jeff Garzik Cc: David Howells, Andrew Morton, Maciej W. Rozycki, Roland Dreier, Andy Fleming, Ben Collins, Linux Kernel Mailing List On Wed, 6 Dec 2006, Jeff Garzik wrote: > > Honestly I wonder if some of these situations really want > "kill_scheduled_work_unless_it_is_already_running_right_now_if_so_wait_for_it" We could do that, and the code to do it would be fairly close to what the "run it" code is. Just replace the if (!test_bit(WORK_STRUCT_NOAUTOREL, &work->management)) work_release(work); f(work); with an unconditional work_release(work); instead. However, I think I'd prefer my patch for now, if only because that "work_release()" thing kind of worries me. You're supposed to either do the release yourself in the work function _after_ you've done all book-keeping, or if the thing doesn't need any book-keeping at all, you can do the "autorelease" thing. The "kill without running" breaks that conceptual model. So a "kill_work()" usage should almost always end up being something like if (kill_work(work)) release anything that running the work function would release but then for the (probably common) case where there is nothing that the work function releases, that would obviously boil down to just a kill_work(work); However, my point is that the _workqueue_ logic cannot know what the user of the workqueue wants, so the "run_scheduled_work()" approach is much saner in this respect. NOTE NOTE NOTE! In neither case can we do anything at all about a workqueue entry that is actually _already_ running on another CPU. My suggested patch will simply not run it at all synchronously (and return 0), and a "kill_work()" thing couldn't do anything either. The required synchronization for something like that simply doesn't exist. Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-06 17:43 ` David Howells 2006-12-06 17:50 ` Jeff Garzik @ 2006-12-06 17:53 ` Linus Torvalds 2006-12-06 17:58 ` Linus Torvalds 2006-12-06 18:02 ` David Howells 1 sibling, 2 replies; 54+ messages in thread From: Linus Torvalds @ 2006-12-06 17:53 UTC (permalink / raw) To: David Howells Cc: Andrew Morton, Maciej W. Rozycki, Roland Dreier, Andy Fleming, Ben Collins, Linux Kernel Mailing List, Jeff Garzik On Wed, 6 Dec 2006, David Howells wrote: > > + if (get_wq_data(work) == cwq > + && test_bit(WORK_STRUCT_PENDING, &work->management) > > I wonder if those can be combined, perhaps: Gcc should do it for us, afaik. I didn't check, but gcc is generally pretty good at combining logical operations like this, because it's very common. > Otherwise for i386 the compiler can't combine them because test_bit() is done > with inline asm. Nope. Look again. test_bit() with a constant number is done very much in C, and very much on purpose. _Exactly_ to allow the compiler to combine these kinds of things. > And: > > + if (!test_bit(WORK_STRUCT_PENDING, &work->management)) > > Should possibly be: > > + if (!work_pending(work)) Yeah, that's a worthy cleanup. Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-06 17:53 ` Linus Torvalds @ 2006-12-06 17:58 ` Linus Torvalds 2006-12-06 18:33 ` Linus Torvalds 2006-12-06 18:02 ` David Howells 1 sibling, 1 reply; 54+ messages in thread From: Linus Torvalds @ 2006-12-06 17:58 UTC (permalink / raw) To: David Howells Cc: Andrew Morton, Maciej W. Rozycki, Roland Dreier, Andy Fleming, Ben Collins, Linux Kernel Mailing List, Jeff Garzik On Wed, 6 Dec 2006, Linus Torvalds wrote: > > Gcc should do it for us, afaik. I didn't check, but gcc is generally > pretty good at combining logical operations like this, because it's very > common. Sadly, gcc doesn't do it in this case. Still, I'd rather keep the source clean, and hope that the compiler improves eventually, than to make the code uglier. I replaced both of the "test_bit(WORK_STRUCT_PENDING, &work->management)" with "work_pending(work)" in my tree. Now somebody who knows how to trigger this thing should just _test_ it. Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-06 17:58 ` Linus Torvalds @ 2006-12-06 18:33 ` Linus Torvalds 2006-12-06 18:37 ` Linus Torvalds 2006-12-06 18:43 ` David Howells 0 siblings, 2 replies; 54+ messages in thread From: Linus Torvalds @ 2006-12-06 18:33 UTC (permalink / raw) To: David Howells Cc: Andrew Morton, Maciej W. Rozycki, Roland Dreier, Andy Fleming, Ben Collins, Linux Kernel Mailing List, Jeff Garzik, linux-arch On Wed, 6 Dec 2006, Linus Torvalds wrote: > > Sadly, gcc doesn't do it in this case. Still, I'd rather keep the source > clean, and hope that the compiler improves eventually, than to make the > code uglier. Actually, it's our own damn fault. We've long had our arguments "const volatile" to test_bit(), which basically means that gcc can't do the optimization. Damn. And they need to be "volatile" not because we _want_ the thing to be volaile, but because these things are occationally used on volatile objects (so if the function doesn't take a volatile pointer, it would warn). That's why so many of these helper functions use "const volatile" pointers, which on the face of it would seem strange if you don't realize that it's more about a C type issue than about the _actual_ type being either "const" or "volatile". Sad. I guess we could remove the "const volatile" from the _cast_, but the thing is, if the underlying object we're actually looking at really _is_ volatile, we shouldn't do that. GAAH. Really sad. I doubt any of the callers actually want the "volatile" access at all. Things like <linux/page-flags.h> definitely _don't_ want it. I suspect we should just face up to the fact that (a) "volatile" on kernel data is basically always a bug, and you should use locking. "volatile" doesn't help anything at all with memory ordering and friends, so it's insane to think it "solves" anything on its own. (b) on "iomem" pointers it does make sense, but those need special accessor functions _anyway_, so things like test_bit() wouldn't work on them. (c) if you spin on a value changing, you should use "cpu_relax()" or "barrier()" anyway, which will force gcc to re-load any values from memory over the loop. and remove the "volatile" from all the bitop accessor functions. Or at least from "test_bit()". It's not like it's synchronous _anyway_ (there's no memory barriers etc). Hmm? Comments? linux-arch added to Cc, just in case people care (this particular thing is in <asm-*/bitops.h>, so it _is_ architecture- specific, but we should probably all agree on it first) Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-06 18:33 ` Linus Torvalds @ 2006-12-06 18:37 ` Linus Torvalds 2006-12-06 18:43 ` David Howells 1 sibling, 0 replies; 54+ messages in thread From: Linus Torvalds @ 2006-12-06 18:37 UTC (permalink / raw) To: David Howells Cc: Andrew Morton, Maciej W. Rozycki, Roland Dreier, Andy Fleming, Ben Collins, Linux Kernel Mailing List, Jeff Garzik, linux-arch On Wed, 6 Dec 2006, Linus Torvalds wrote: > > and remove the "volatile" from all the bitop accessor functions. It might also be interesting to see if this would change code-size at all. There's a number of things that check different bits in the same word right now, and they just reload the word unnecessarily and do multiple tests. Some of the page flags functions obviously already work around this by doing horrible things by hand instead, eg: (page->flags & ( 1 << PG_lru | 1 << PG_private | 1 << PG_locked | 1 << PG_active | 1 << PG_reclaim | 1 << PG_slab | 1 << PG_swapcache | 1 << PG_writeback | 1 << PG_reserved | 1 << PG_buddy )) in the free_pages_check() thing. It may make sense there, but we really _should_ allow gcc to just do things like this for us, and just use the proper functions to test bits. Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-06 18:33 ` Linus Torvalds 2006-12-06 18:37 ` Linus Torvalds @ 2006-12-06 18:43 ` David Howells 2006-12-06 19:02 ` Linus Torvalds 1 sibling, 1 reply; 54+ messages in thread From: David Howells @ 2006-12-06 18:43 UTC (permalink / raw) To: Linus Torvalds Cc: David Howells, Andrew Morton, Maciej W. Rozycki, Roland Dreier, Andy Fleming, Ben Collins, Linux Kernel Mailing List, Jeff Garzik, linux-arch Linus Torvalds <torvalds@osdl.org> wrote: > (a) "volatile" on kernel data is basically always a bug, and you should > use locking. But what about when you're building a lock? Actually, I suspect correct usage of asm constraints and memory barriers trumps volatile anyway even there. David ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-06 18:43 ` David Howells @ 2006-12-06 19:02 ` Linus Torvalds 0 siblings, 0 replies; 54+ messages in thread From: Linus Torvalds @ 2006-12-06 19:02 UTC (permalink / raw) To: David Howells Cc: Andrew Morton, Maciej W. Rozycki, Roland Dreier, Andy Fleming, Ben Collins, Linux Kernel Mailing List, Jeff Garzik, linux-arch On Wed, 6 Dec 2006, David Howells wrote: > > Linus Torvalds <torvalds@osdl.org> wrote: > > > (a) "volatile" on kernel data is basically always a bug, and you should > > use locking. > > But what about when you're building a lock? Actually, I suspect correct usage > of asm constraints and memory barriers trumps volatile anyway even there. The word you look for is not "suspect". You _cannot_ build a lock using "volatile", unless your CPU is strictly in-order and has an in-order memory subsystem too (so, for example, while all ia64 implementations today are in-order, they do /not/ have an in-order memory subsystem). Only then could you do locking with volatile and some crazy Peterson's algorithm. I don't think any such CPU actually exists. Anyway, we've had this discussion before on linux-kernel, it really boils down to that "volatile" is basically never correct with the exception of flags that don't have any meaning and that you don't actually _care_ about the exact value (the low word of "jiffies" being the canonical example of something where "volatile" is actually fine, and where - as long as you can load it atomically - "volatile" really does make sense). Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-06 17:53 ` Linus Torvalds 2006-12-06 17:58 ` Linus Torvalds @ 2006-12-06 18:02 ` David Howells 1 sibling, 0 replies; 54+ messages in thread From: David Howells @ 2006-12-06 18:02 UTC (permalink / raw) To: Linus Torvalds Cc: David Howells, Andrew Morton, Maciej W. Rozycki, Roland Dreier, Andy Fleming, Ben Collins, Linux Kernel Mailing List, Jeff Garzik Linus Torvalds <torvalds@osdl.org> wrote: > test_bit() with a constant number is done very much in C, and very much on > purpose. _Exactly_ to allow the compiler to combine these kinds of things. Ah... I've read that several times, and each time I've assumed it's referring to *addr not nr as being constant. David ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-06 17:17 ` Linus Torvalds 2006-12-06 17:43 ` David Howells @ 2006-12-07 1:21 ` Linus Torvalds 2006-12-07 6:42 ` Andrew Morton 2006-12-07 15:28 ` Maciej W. Rozycki 1 sibling, 2 replies; 54+ messages in thread From: Linus Torvalds @ 2006-12-07 1:21 UTC (permalink / raw) To: Andrew Morton, David Howells Cc: Maciej W. Rozycki, Roland Dreier, Andy Fleming, Ben Collins, Linux Kernel Mailing List, Jeff Garzik On Wed, 6 Dec 2006, Linus Torvalds wrote: > > How about something like this? I didn't get any answers on this. I'd like to get this issue resolved, but since I don't even use libphy on my main machine, I need somebody else to test it for me. Just to remind you all, here's the patch again. This is identical to the previous version except for the trivial cleanup to use "work_pending()" instead of open-coding it in two places. Linus ---- diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c index 4044bb1..e175f39 100644 --- a/drivers/net/phy/phy.c +++ b/drivers/net/phy/phy.c @@ -587,8 +587,7 @@ int phy_stop_interrupts(struct phy_device *phydev) * Finish any pending work; we might have been scheduled * to be called from keventd ourselves, though. */ - if (!current_is_keventd()) - flush_scheduled_work(); + run_scheduled_work(&phydev->phy_queue); free_irq(phydev->irq, phydev); diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h index 4a3ea83..a601ed5 100644 --- a/include/linux/workqueue.h +++ b/include/linux/workqueue.h @@ -160,6 +160,7 @@ extern int queue_delayed_work_on(int cpu, struct workqueue_struct *wq, extern void FASTCALL(flush_workqueue(struct workqueue_struct *wq)); extern int FASTCALL(schedule_work(struct work_struct *work)); +extern int FASTCALL(run_scheduled_work(struct work_struct *work)); extern int FASTCALL(schedule_delayed_work(struct delayed_work *work, unsigned long delay)); extern int schedule_delayed_work_on(int cpu, struct delayed_work *work, unsigned long delay); diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 8d1e7cb..36f9b78 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -103,6 +103,79 @@ static inline void *get_wq_data(struct work_struct *work) return (void *) (work->management & WORK_STRUCT_WQ_DATA_MASK); } +static int __run_work(struct cpu_workqueue_struct *cwq, struct work_struct *work) +{ + int ret = 0; + unsigned long flags; + + spin_lock_irqsave(&cwq->lock, flags); + /* + * We need to re-validate the work info after we've gotten + * the cpu_workqueue lock. We can run the work now iff: + * + * - the wq_data still matches the cpu_workqueue_struct + * - AND the work is still marked pending + * - AND the work is still on a list (which will be this + * workqueue_struct list) + * + * All these conditions are important, because we + * need to protect against the work being run right + * now on another CPU (all but the last one might be + * true if it's currently running and has not been + * released yet, for example). + */ + if (get_wq_data(work) == cwq + && work_pending(work) + && !list_empty(&work->entry)) { + work_func_t f = work->func; + list_del_init(&work->entry); + spin_unlock_irqrestore(&cwq->lock, flags); + + if (!test_bit(WORK_STRUCT_NOAUTOREL, &work->management)) + work_release(work); + f(work); + + spin_lock_irqsave(&cwq->lock, flags); + cwq->remove_sequence++; + wake_up(&cwq->work_done); + ret = 1; + } + spin_unlock_irqrestore(&cwq->lock, flags); + return ret; +} + +/** + * run_scheduled_work - run scheduled work synchronously + * @work: work to run + * + * This checks if the work was pending, and runs it + * synchronously if so. It returns a boolean to indicate + * whether it had any scheduled work to run or not. + * + * NOTE! This _only_ works for normal work_structs. You + * CANNOT use this for delayed work, because the wq data + * for delayed work will not point properly to the per- + * CPU workqueue struct, but will change! + */ +int fastcall run_scheduled_work(struct work_struct *work) +{ + for (;;) { + struct cpu_workqueue_struct *cwq; + + if (!work_pending(work)) + return 0; + if (list_empty(&work->entry)) + return 0; + /* NOTE! This depends intimately on __queue_work! */ + cwq = get_wq_data(work); + if (!cwq) + return 0; + if (__run_work(cwq, work)) + return 1; + } +} +EXPORT_SYMBOL(run_scheduled_work); + /* Preempt must be disabled. */ static void __queue_work(struct cpu_workqueue_struct *cwq, struct work_struct *work) ^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-07 1:21 ` Linus Torvalds @ 2006-12-07 6:42 ` Andrew Morton 2006-12-07 7:49 ` Andrew Morton 2006-12-07 16:49 ` Linus Torvalds 2006-12-07 15:28 ` Maciej W. Rozycki 1 sibling, 2 replies; 54+ messages in thread From: Andrew Morton @ 2006-12-07 6:42 UTC (permalink / raw) To: Linus Torvalds Cc: David Howells, Maciej W. Rozycki, Roland Dreier, Andy Fleming, Ben Collins, Linux Kernel Mailing List, Jeff Garzik On Wed, 6 Dec 2006 17:21:50 -0800 (PST) Linus Torvalds <torvalds@osdl.org> wrote: > > > On Wed, 6 Dec 2006, Linus Torvalds wrote: > > > > How about something like this? > > I didn't get any answers on this. I'd like to get this issue resolved, but > since I don't even use libphy on my main machine, I need somebody else to > test it for me. > > Just to remind you all, here's the patch again. This is identical to the > previous version except for the trivial cleanup to use "work_pending()" > instead of open-coding it in two places. > > Linus > > ... > > +static int __run_work(struct cpu_workqueue_struct *cwq, struct work_struct *work) > +{ > + int ret = 0; > + unsigned long flags; > + > + spin_lock_irqsave(&cwq->lock, flags); > + /* > + * We need to re-validate the work info after we've gotten > + * the cpu_workqueue lock. We can run the work now iff: > + * > + * - the wq_data still matches the cpu_workqueue_struct > + * - AND the work is still marked pending > + * - AND the work is still on a list (which will be this > + * workqueue_struct list) > + * > + * All these conditions are important, because we > + * need to protect against the work being run right > + * now on another CPU (all but the last one might be > + * true if it's currently running and has not been > + * released yet, for example). > + */ > + if (get_wq_data(work) == cwq > + && work_pending(work) > + && !list_empty(&work->entry)) { > + work_func_t f = work->func; > + list_del_init(&work->entry); > + spin_unlock_irqrestore(&cwq->lock, flags); > + > + if (!test_bit(WORK_STRUCT_NOAUTOREL, &work->management)) > + work_release(work); > + f(work); > + > + spin_lock_irqsave(&cwq->lock, flags); > + cwq->remove_sequence++; > + wake_up(&cwq->work_done); > + ret = 1; > + } > + spin_unlock_irqrestore(&cwq->lock, flags); > + return ret; > +} > + > +/** > + * run_scheduled_work - run scheduled work synchronously > + * @work: work to run > + * > + * This checks if the work was pending, and runs it > + * synchronously if so. It returns a boolean to indicate > + * whether it had any scheduled work to run or not. > + * > + * NOTE! This _only_ works for normal work_structs. You > + * CANNOT use this for delayed work, because the wq data > + * for delayed work will not point properly to the per- > + * CPU workqueue struct, but will change! > + */ > +int fastcall run_scheduled_work(struct work_struct *work) > +{ > + for (;;) { > + struct cpu_workqueue_struct *cwq; > + > + if (!work_pending(work)) > + return 0; But this will return to the caller if the callback is presently running on a different CPU. The whole point here is to be able to reliably kill off the pending work so that the caller can free resources. > + if (list_empty(&work->entry)) > + return 0; > + /* NOTE! This depends intimately on __queue_work! */ > + cwq = get_wq_data(work); > + if (!cwq) > + return 0; > + if (__run_work(cwq, work)) > + return 1; > + } > +} > +EXPORT_SYMBOL(run_scheduled_work); Also, I worry that this code can run the callback on the caller's CPU. Users of per-cpu workqueues can legitimately assume that each callback runs on the right CPU. I doubt if many callers _do_ do that - there's schedule_delayed_work_on(), but that's a bit different. A solution to both problems is of course to block the caller if the callback is running. We can perhaps borrow cwq->work_done for that. But I wouldn't want to think about an implementation as long as we have that WORK_STRUCT_NOAUTOREL horror in there. Can we just nuke that? Only three drivers need it and I bet they can be modified to use the usual mechanisms. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-07 6:42 ` Andrew Morton @ 2006-12-07 7:49 ` Andrew Morton 2006-12-07 10:29 ` David Howells 2006-12-07 16:49 ` Linus Torvalds 1 sibling, 1 reply; 54+ messages in thread From: Andrew Morton @ 2006-12-07 7:49 UTC (permalink / raw) To: Linus Torvalds, David Howells, Maciej W. Rozycki, Roland Dreier, Andy Fleming, Ben Collins, Linux Kernel Mailing List, Jeff Garzik On Wed, 6 Dec 2006 22:42:07 -0800 Andrew Morton <akpm@osdl.org> wrote: > But I wouldn't want to think about an implementation as long as we have > that WORK_STRUCT_NOAUTOREL horror in there. Can we just nuke that? Only > three drivers need it and I bet they can be modified to use the usual > mechanisms. I guess I don't understand exactly what problem the noautorel stuff is trying to solve. It _seems_ to me that in all cases we can simply stuff the old `data' field in alongside the controlling work_struct or delayed_work which wants to operate on it. Bridge is the simple case.. diff -puN net/bridge/br_private.h~bridge-avoid-using-noautorel-workqueues net/bridge/br_private.h --- a/net/bridge/br_private.h~bridge-avoid-using-noautorel-workqueues +++ a/net/bridge/br_private.h @@ -83,6 +83,7 @@ struct net_bridge_port struct timer_list message_age_timer; struct kobject kobj; struct delayed_work carrier_check; + struct net_device *carrier_check_dev; struct rcu_head rcu; }; diff -puN net/bridge/br_if.c~bridge-avoid-using-noautorel-workqueues net/bridge/br_if.c --- a/net/bridge/br_if.c~bridge-avoid-using-noautorel-workqueues +++ a/net/bridge/br_if.c @@ -83,14 +83,11 @@ static void port_carrier_check(struct wo struct net_device *dev; struct net_bridge *br; - dev = container_of(work, struct net_bridge_port, - carrier_check.work)->dev; - work_release(work); - + p = container_of(work, struct net_bridge_port, carrier_check.work); + dev = p->carrier_check_dev; rtnl_lock(); - p = dev->br_port; - if (!p) - goto done; + if (!dev->br_port) + goto done; /* Can this happen? */ br = p->br; if (netif_carrier_ok(dev)) @@ -280,7 +277,8 @@ static struct net_bridge_port *new_nbp(s p->port_no = index; br_init_port(p); p->state = BR_STATE_DISABLED; - INIT_DELAYED_WORK_NAR(&p->carrier_check, port_carrier_check); + p->carrier_check_dev = dev; + INIT_DELAYED_WORK(&p->carrier_check, port_carrier_check); br_stp_port_timer_init(p); kobject_init(&p->kobj); _ What am I missing? ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-07 7:49 ` Andrew Morton @ 2006-12-07 10:29 ` David Howells 2006-12-07 10:42 ` Andrew Morton 0 siblings, 1 reply; 54+ messages in thread From: David Howells @ 2006-12-07 10:29 UTC (permalink / raw) To: Andrew Morton Cc: Linus Torvalds, David Howells, Maciej W. Rozycki, Roland Dreier, Andy Fleming, Ben Collins, Linux Kernel Mailing List, Jeff Garzik Andrew Morton <akpm@osdl.org> wrote: > I guess I don't understand exactly what problem the noautorel stuff is > trying to solve. It _seems_ to me that in all cases we can simply stuff > the old `data' field in alongside the controlling work_struct or > delayed_work which wants to operate on it. The problem is that you have to be able to guarantee that the data is still accessible once you clear the pending bit. The pending bit is your only guaranteed protection, and once it is clear, the containing structure might be deallocated. I would like to be able to get rid of the NAR bit too, but I'm not confident that in all cases I can. It'll take a bit more study of the code to be able to do that. David ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-07 10:29 ` David Howells @ 2006-12-07 10:42 ` Andrew Morton 2006-12-07 17:05 ` Jeff Garzik 0 siblings, 1 reply; 54+ messages in thread From: Andrew Morton @ 2006-12-07 10:42 UTC (permalink / raw) To: David Howells Cc: torvalds, dhowells, macro, rdreier, afleming, ben.collins, linux-kernel, jeff > On Thu, 07 Dec 2006 10:29:39 +0000 David Howells <dhowells@redhat.com> wrote: > Andrew Morton <akpm@osdl.org> wrote: > > > I guess I don't understand exactly what problem the noautorel stuff is > > trying to solve. It _seems_ to me that in all cases we can simply stuff > > the old `data' field in alongside the controlling work_struct or > > delayed_work which wants to operate on it. > > The problem is that you have to be able to guarantee that the data is still > accessible once you clear the pending bit. The pending bit is your only > guaranteed protection, and once it is clear, the containing structure might be > deallocated. > > I would like to be able to get rid of the NAR bit too, but I'm not confident > that in all cases I can. It'll take a bit more study of the code to be able > to do that. > But anyone who is going to free the structure which contains the work_struct would need to run flush_workqueue() beforehand, after having ensured that the work won't reschedule itself. So the struct-which-contains-the-work_struct is safe during the callback's execution. If that's not being done then the code was buggy in 2.6.19, too.. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-07 10:42 ` Andrew Morton @ 2006-12-07 17:05 ` Jeff Garzik 2006-12-07 17:57 ` Andrew Morton ` (2 more replies) 0 siblings, 3 replies; 54+ messages in thread From: Jeff Garzik @ 2006-12-07 17:05 UTC (permalink / raw) To: Andrew Morton, torvalds, macro Cc: David Howells, rdreier, afleming, ben.collins, linux-kernel Yes, I merged the code, but looking deeper at phy its clear I missed some things. Looking into libphy's workqueue stuff, it has the following sequence: disable interrupts schedule_work() ... time passes ... ... workqueue routine is called ... enable interrupts handle interrupt I really have to question if a workqueue was the best choice of direction for such a sequence. You don't want to put off handling an interrupt, with interrupts disabled, for a potentially unbounded amount of time. Maybe the best course of action is to mark it with CONFIG_BROKEN until it gets fixed. Jeff ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-07 17:05 ` Jeff Garzik @ 2006-12-07 17:57 ` Andrew Morton 2006-12-07 18:17 ` Andrew Morton 2006-12-08 16:52 ` [PATCH] group xtime, xtime_lock, wall_to_monotonic, avenrun, calc_load_count fields together in ktimed Eric Dumazet 2006-12-07 18:08 ` [PATCH] Export current_is_keventd() for libphy Maciej W. Rozycki 2006-12-07 18:59 ` Andy Fleming 2 siblings, 2 replies; 54+ messages in thread From: Andrew Morton @ 2006-12-07 17:57 UTC (permalink / raw) To: Jeff Garzik Cc: torvalds, macro, David Howells, rdreier, afleming, ben.collins, linux-kernel On Thu, 07 Dec 2006 12:05:38 -0500 Jeff Garzik <jeff@garzik.org> wrote: > Yes, I merged the code, but looking deeper at phy its clear I missed > some things. > > Looking into libphy's workqueue stuff, it has the following sequence: > > disable interrupts > schedule_work() > > ... time passes ... > ... workqueue routine is called ... > > enable interrupts > handle interrupt > > I really have to question if a workqueue was the best choice of > direction for such a sequence. You don't want to put off handling an > interrupt, with interrupts disabled, for a potentially unbounded amount > of time. That'll lock the box on UP, or if the timer fires on the current CPU? > Maybe the best course of action is to mark it with CONFIG_BROKEN until > it gets fixed. hm, maybe. I wonder if as a short-term palliative we could remove the current_is_keventd() call and drop rtnl_lock. Or export current_is_keventd ;) ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-07 17:57 ` Andrew Morton @ 2006-12-07 18:17 ` Andrew Morton 2006-12-08 16:52 ` [PATCH] group xtime, xtime_lock, wall_to_monotonic, avenrun, calc_load_count fields together in ktimed Eric Dumazet 1 sibling, 0 replies; 54+ messages in thread From: Andrew Morton @ 2006-12-07 18:17 UTC (permalink / raw) To: Jeff Garzik, torvalds, macro, David Howells, rdreier, afleming, ben.collins, linux-kernel On Thu, 7 Dec 2006 09:57:15 -0800 Andrew Morton <akpm@osdl.org> wrote: > On Thu, 07 Dec 2006 12:05:38 -0500 > Jeff Garzik <jeff@garzik.org> wrote: > > > Yes, I merged the code, but looking deeper at phy its clear I missed > > some things. > > > > Looking into libphy's workqueue stuff, it has the following sequence: > > > > disable interrupts > > schedule_work() > > > > ... time passes ... > > ... workqueue routine is called ... > > > > enable interrupts > > handle interrupt > > > > I really have to question if a workqueue was the best choice of > > direction for such a sequence. You don't want to put off handling an > > interrupt, with interrupts disabled, for a potentially unbounded amount > > of time. > > That'll lock the box on UP, or if the timer fires on the current CPU? oh. "disable interrupts" == disable_irq(), not local_irq_disable()? Not so bad ;) ^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH] group xtime, xtime_lock, wall_to_monotonic, avenrun, calc_load_count fields together in ktimed 2006-12-07 17:57 ` Andrew Morton 2006-12-07 18:17 ` Andrew Morton @ 2006-12-08 16:52 ` Eric Dumazet 2006-12-09 5:46 ` Andrew Morton 2006-12-13 21:26 ` [PATCH] Introduce time_data, a new structure to hold jiffies, xtime, xtime_lock, wall_to_monotonic, calc_load_count and avenrun Eric Dumazet 1 sibling, 2 replies; 54+ messages in thread From: Eric Dumazet @ 2006-12-08 16:52 UTC (permalink / raw) To: Andrew Morton; +Cc: Andi Kleen, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1269 bytes --] This patch introduces a new structure called ktimed (Kernel Time Data), where some time keeping related variables are put together to share as few cache lines as possible. This avoid some false sharing, (since linker could put calc_load_count in a *random* cache line for example) I also optimized calc_load() to not always call count_active_tasks() : It should call it only once every 5 seconds (LOAD_FREQ=5*HZ) Note : x86_64 was using an arch specific placement of __xtime and __xtime_lock (see arch/x86_64/kernel/vmlinux.lds.S). (vsyscall stuff) It is now using a specific placement of __ktimed, since xtime and xtime_lock are now fields from __ktimed. Note : I failed to move jiffies64 as well in ktimed : too many changes needed because of jiffies aliasing (and endianess), but it could be done. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- arch/x86_64/kernel/time.c | 5 ++- arch/x86_64/kernel/vmlinux.lds.S | 7 +--- arch/x86_64/kernel/vsyscall.c | 1 include/asm-x86_64/vsyscall.h | 13 +++------ include/linux/sched.h | 1 include/linux/time.h | 18 ++++++++++-- kernel/timer.c | 41 ++++++++++++----------------- 7 files changed, 43 insertions(+), 43 deletions(-) [-- Attachment #2: xtime_s.patch --] [-- Type: text/plain, Size: 6905 bytes --] --- linux-2.6.19/include/linux/time.h 2006-12-08 11:40:46.000000000 +0100 +++ linux-2.6.19-ed/include/linux/time.h 2006-12-08 16:58:57.000000000 +0100 @@ -88,9 +88,21 @@ static inline struct timespec timespec_s #define timespec_valid(ts) \ (((ts)->tv_sec >= 0) && (((unsigned long) (ts)->tv_nsec) < NSEC_PER_SEC)) -extern struct timespec xtime; -extern struct timespec wall_to_monotonic; -extern seqlock_t xtime_lock; +/* + * define a structure to keep all fields close to each others. + */ +struct ktimed_struct { + struct timespec _xtime; + struct timespec wall_to_monotonic; + seqlock_t lock; + unsigned long avenrun[3]; + int calc_load_count; +}; +extern struct ktimed_struct ktimed; +#define xtime ktimed._xtime +#define wall_to_monotonic ktimed.wall_to_monotonic +#define xtime_lock ktimed.lock +#define avenrun ktimed.avenrun void timekeeping_init(void); --- linux-2.6.19/kernel/timer.c 2006-12-08 11:50:11.000000000 +0100 +++ linux-2.6.19-ed/kernel/timer.c 2006-12-08 18:13:24.000000000 +0100 @@ -570,11 +570,13 @@ found: * however, we will ALWAYS keep the tv_nsec part positive so we can use * the usual normalization. */ -struct timespec xtime __attribute__ ((aligned (16))); -struct timespec wall_to_monotonic __attribute__ ((aligned (16))); - -EXPORT_SYMBOL(xtime); - +#ifndef ARCH_HAVE_KTIMED +struct ktimed_struct ktimed __cacheline_aligned = { + .lock = __SEQLOCK_UNLOCKED(ktimed.lock), + .calc_load_count = LOAD_FREQ, +}; +EXPORT_SYMBOL(ktimed); +#endif /* XXX - all of this timekeeping code should be later moved to time.c */ #include <linux/clocksource.h> @@ -995,9 +997,6 @@ static unsigned long count_active_tasks( * * Requires xtime_lock to access. */ -unsigned long avenrun[3]; - -EXPORT_SYMBOL(avenrun); /* * calc_load - given tick count, update the avenrun load estimates. @@ -1006,27 +1005,21 @@ EXPORT_SYMBOL(avenrun); static inline void calc_load(unsigned long ticks) { unsigned long active_tasks; /* fixed-point */ - static int count = LOAD_FREQ; - active_tasks = count_active_tasks(); - for (count -= ticks; count < 0; count += LOAD_FREQ) { - CALC_LOAD(avenrun[0], EXP_1, active_tasks); - CALC_LOAD(avenrun[1], EXP_5, active_tasks); - CALC_LOAD(avenrun[2], EXP_15, active_tasks); + ktimed.calc_load_count -= ticks; + + if (unlikely(ktimed.calc_load_count < 0)) { + active_tasks = count_active_tasks(); + do { + ktimed.calc_load_count += LOAD_FREQ; + CALC_LOAD(avenrun[0], EXP_1, active_tasks); + CALC_LOAD(avenrun[1], EXP_5, active_tasks); + CALC_LOAD(avenrun[2], EXP_15, active_tasks); + } while (ktimed.calc_load_count < 0); } } /* - * This read-write spinlock protects us from races in SMP while - * playing with xtime and avenrun. - */ -#ifndef ARCH_HAVE_XTIME_LOCK -__cacheline_aligned_in_smp DEFINE_SEQLOCK(xtime_lock); - -EXPORT_SYMBOL(xtime_lock); -#endif - -/* * This function runs timers and the timer-tq in bottom half context. */ static void run_timer_softirq(struct softirq_action *h) --- linux-2.6.19/include/linux/sched.h 2006-12-08 12:10:45.000000000 +0100 +++ linux-2.6.19-ed/include/linux/sched.h 2006-12-08 12:10:59.000000000 +0100 @@ -104,7 +104,6 @@ struct futex_pi_state; * the EXP_n values would be 1981, 2034 and 2043 if still using only * 11 bit fractions. */ -extern unsigned long avenrun[]; /* Load averages */ #define FSHIFT 11 /* nr of bits of precision */ #define FIXED_1 (1<<FSHIFT) /* 1.0 as fixed-point */ --- linux-2.6.19/include/asm-x86_64/vsyscall.h 2006-12-08 11:47:03.000000000 +0100 +++ linux-2.6.19-ed/include/asm-x86_64/vsyscall.h 2006-12-08 18:04:29.000000000 +0100 @@ -20,8 +20,7 @@ enum vsyscall_num { #define __section_jiffies __attribute__ ((unused, __section__ (".jiffies"), aligned(16))) #define __section_sys_tz __attribute__ ((unused, __section__ (".sys_tz"), aligned(16))) #define __section_sysctl_vsyscall __attribute__ ((unused, __section__ (".sysctl_vsyscall"), aligned(16))) -#define __section_xtime __attribute__ ((unused, __section__ (".xtime"), aligned(16))) -#define __section_xtime_lock __attribute__ ((unused, __section__ (".xtime_lock"), aligned(16))) +#define __section_ktimed __attribute__ ((unused, __section__ (".ktimed"), aligned(16))) #define VXTIME_TSC 1 #define VXTIME_HPET 2 @@ -45,21 +44,19 @@ struct vxtime_data { /* vsyscall space (readonly) */ extern struct vxtime_data __vxtime; extern int __vgetcpu_mode; -extern struct timespec __xtime; extern volatile unsigned long __jiffies; extern struct timezone __sys_tz; -extern seqlock_t __xtime_lock; +extern struct ktimed_struct __ktimed; +#define __xtime_lock __ktimed.lock +#define __xtime __ktimed._xtime /* kernel space (writeable) */ extern struct vxtime_data vxtime; extern int vgetcpu_mode; extern struct timezone sys_tz; extern int sysctl_vsyscall; -extern seqlock_t xtime_lock; -extern int sysctl_vsyscall; - -#define ARCH_HAVE_XTIME_LOCK 1 +#define ARCH_HAVE_KTIMED 1 #endif /* __KERNEL__ */ --- linux-2.6.19/arch/x86_64/kernel/vmlinux.lds.S 2006-12-08 16:03:13.000000000 +0100 +++ linux-2.6.19-ed/arch/x86_64/kernel/vmlinux.lds.S 2006-12-08 17:52:29.000000000 +0100 @@ -94,8 +94,8 @@ SECTIONS __vsyscall_0 = VSYSCALL_VIRT_ADDR; . = ALIGN(CONFIG_X86_L1_CACHE_BYTES); - .xtime_lock : AT(VLOAD(.xtime_lock)) { *(.xtime_lock) } - xtime_lock = VVIRT(.xtime_lock); + .ktimed : AT(VLOAD(.ktimed)) { *(.ktimed) } + ktimed = VVIRT(.ktimed); .vxtime : AT(VLOAD(.vxtime)) { *(.vxtime) } vxtime = VVIRT(.vxtime); @@ -109,9 +109,6 @@ SECTIONS .sysctl_vsyscall : AT(VLOAD(.sysctl_vsyscall)) { *(.sysctl_vsyscall) } sysctl_vsyscall = VVIRT(.sysctl_vsyscall); - .xtime : AT(VLOAD(.xtime)) { *(.xtime) } - xtime = VVIRT(.xtime); - . = ALIGN(CONFIG_X86_L1_CACHE_BYTES); .jiffies : AT(VLOAD(.jiffies)) { *(.jiffies) } jiffies = VVIRT(.jiffies); --- linux-2.6.19/arch/x86_64/kernel/vsyscall.c 2006-11-29 22:57:37.000000000 +0100 +++ linux-2.6.19-ed/arch/x86_64/kernel/vsyscall.c 2006-12-08 18:02:49.000000000 +0100 @@ -44,7 +44,6 @@ #define __vsyscall(nr) __attribute__ ((unused,__section__(".vsyscall_" #nr))) int __sysctl_vsyscall __section_sysctl_vsyscall = 1; -seqlock_t __xtime_lock __section_xtime_lock = SEQLOCK_UNLOCKED; int __vgetcpu_mode __section_vgetcpu_mode; #include <asm/unistd.h> --- linux-2.6.19/arch/x86_64/kernel/time.c 2006-12-08 16:22:15.000000000 +0100 +++ linux-2.6.19-ed/arch/x86_64/kernel/time.c 2006-12-08 18:13:24.000000000 +0100 @@ -77,7 +77,10 @@ unsigned long long monotonic_base; struct vxtime_data __vxtime __section_vxtime; /* for vsyscalls */ volatile unsigned long __jiffies __section_jiffies = INITIAL_JIFFIES; -struct timespec __xtime __section_xtime; +struct ktimed_struct __ktimed __section_ktimed = { + .lock = __SEQLOCK_UNLOCKED(ktimed.lock), + .calc_load_count = LOAD_FREQ, +}; struct timezone __sys_tz __section_sys_tz; /* ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] group xtime, xtime_lock, wall_to_monotonic, avenrun, calc_load_count fields together in ktimed 2006-12-08 16:52 ` [PATCH] group xtime, xtime_lock, wall_to_monotonic, avenrun, calc_load_count fields together in ktimed Eric Dumazet @ 2006-12-09 5:46 ` Andrew Morton 2006-12-09 6:07 ` Randy Dunlap 2006-12-11 20:44 ` Eric Dumazet 2006-12-13 21:26 ` [PATCH] Introduce time_data, a new structure to hold jiffies, xtime, xtime_lock, wall_to_monotonic, calc_load_count and avenrun Eric Dumazet 1 sibling, 2 replies; 54+ messages in thread From: Andrew Morton @ 2006-12-09 5:46 UTC (permalink / raw) To: Eric Dumazet; +Cc: Andi Kleen, linux-kernel On Fri, 8 Dec 2006 17:52:09 +0100 Eric Dumazet <dada1@cosmosbay.com> wrote: > This patch introduces a new structure called ktimed (Kernel Time Data), where > some time keeping related variables are put together to share as few cache > lines as possible. This avoid some false sharing, (since linker could put > calc_load_count in a *random* cache line for example) > > I also optimized calc_load() to not always call count_active_tasks() : > It should call it only once every 5 seconds (LOAD_FREQ=5*HZ) > > Note : x86_64 was using an arch specific placement of __xtime and __xtime_lock > (see arch/x86_64/kernel/vmlinux.lds.S). (vsyscall stuff) > It is now using a specific placement of __ktimed, since xtime and xtime_lock > are now fields from __ktimed. > > Note : I failed to move jiffies64 as well in ktimed : too many changes needed > because of jiffies aliasing (and endianess), but it could be done. > Sounds like you have about three patches there. <save attachment, read from file, s/^/> /> > > -extern struct timespec xtime; > -extern struct timespec wall_to_monotonic; > -extern seqlock_t xtime_lock; > +/* > + * define a structure to keep all fields close to each others. > + */ > +struct ktimed_struct { > + struct timespec _xtime; > + struct timespec wall_to_monotonic; > + seqlock_t lock; > + unsigned long avenrun[3]; > + int calc_load_count; > +}; crappy name, but I guess it doesn't matter as nobody will use it at this stage. But... > +extern struct ktimed_struct ktimed; > +#define xtime ktimed._xtime > +#define wall_to_monotonic ktimed.wall_to_monotonic > +#define xtime_lock ktimed.lock > +#define avenrun ktimed.avenrun They'll use these instead. Frankly, I think we'd be better off removing these macros and, longer-term, use write_seqlock(time_data.xtime_lock); The approach you have here would be a good transition-period thing. > void timekeeping_init(void); > > --- linux-2.6.19/kernel/timer.c 2006-12-08 11:50:11.000000000 +0100 > +++ linux-2.6.19-ed/kernel/timer.c 2006-12-08 18:13:24.000000000 +0100 > @@ -570,11 +570,13 @@ found: > * however, we will ALWAYS keep the tv_nsec part positive so we can use > * the usual normalization. > */ > -struct timespec xtime __attribute__ ((aligned (16))); > -struct timespec wall_to_monotonic __attribute__ ((aligned (16))); > - > -EXPORT_SYMBOL(xtime); > - > +#ifndef ARCH_HAVE_KTIMED argh, another ARCH_HAVE_MESS. Due to the x86_64 vsyscall code. Could you please see if we can nuke this by making kernel/timer.c:xtime_lock use attribute(weak)? In a separate patch ;) > +struct ktimed_struct ktimed __cacheline_aligned = { > + .lock = __SEQLOCK_UNLOCKED(ktimed.lock), > + .calc_load_count = LOAD_FREQ, > +}; > +EXPORT_SYMBOL(ktimed); > +#endif > > /* XXX - all of this timekeeping code should be later moved to time.c */ > #include <linux/clocksource.h> > @@ -995,9 +997,6 @@ static unsigned long count_active_tasks( > * > * Requires xtime_lock to access. > */ > -unsigned long avenrun[3]; > - > -EXPORT_SYMBOL(avenrun); > > /* > * calc_load - given tick count, update the avenrun load estimates. > @@ -1006,27 +1005,21 @@ EXPORT_SYMBOL(avenrun); > static inline void calc_load(unsigned long ticks) > { > unsigned long active_tasks; /* fixed-point */ > - static int count = LOAD_FREQ; > > - active_tasks = count_active_tasks(); > - for (count -= ticks; count < 0; count += LOAD_FREQ) { > - CALC_LOAD(avenrun[0], EXP_1, active_tasks); > - CALC_LOAD(avenrun[1], EXP_5, active_tasks); > - CALC_LOAD(avenrun[2], EXP_15, active_tasks); > + ktimed.calc_load_count -= ticks; > + > + if (unlikely(ktimed.calc_load_count < 0)) { > + active_tasks = count_active_tasks(); > + do { > + ktimed.calc_load_count += LOAD_FREQ; > + CALC_LOAD(avenrun[0], EXP_1, active_tasks); > + CALC_LOAD(avenrun[1], EXP_5, active_tasks); > + CALC_LOAD(avenrun[2], EXP_15, active_tasks); > + } while (ktimed.calc_load_count < 0); > } > } > > ... > > +extern struct ktimed_struct __ktimed; > +#define __xtime_lock __ktimed.lock > +#define __xtime __ktimed._xtime > > /* kernel space (writeable) */ > extern struct vxtime_data vxtime; > extern int vgetcpu_mode; > extern struct timezone sys_tz; > extern int sysctl_vsyscall; > -extern seqlock_t xtime_lock; > > -extern int sysctl_vsyscall; > - > -#define ARCH_HAVE_XTIME_LOCK 1 > +#define ARCH_HAVE_KTIMED 1 > hm, the patch seems to transform a mess into a mess. I guess it's a messy problem. I agree that aggregating all the time-related things into a struct like this makes some sense. As does aggregating them all into a similar-looking namespace, but that'd probably be too intrusive - too late for that. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] group xtime, xtime_lock, wall_to_monotonic, avenrun, calc_load_count fields together in ktimed 2006-12-09 5:46 ` Andrew Morton @ 2006-12-09 6:07 ` Randy Dunlap 2006-12-11 20:44 ` Eric Dumazet 1 sibling, 0 replies; 54+ messages in thread From: Randy Dunlap @ 2006-12-09 6:07 UTC (permalink / raw) To: Andrew Morton; +Cc: Eric Dumazet, Andi Kleen, linux-kernel On Fri, 8 Dec 2006 21:46:25 -0800 Andrew Morton wrote: > On Fri, 8 Dec 2006 17:52:09 +0100 > Eric Dumazet <dada1@cosmosbay.com> wrote: [snip] > Sounds like you have about three patches there. > > <save attachment, read from file, s/^/> /> > > > > > -extern struct timespec xtime; > > -extern struct timespec wall_to_monotonic; > > -extern seqlock_t xtime_lock; > > +/* > > + * define a structure to keep all fields close to each others. > > + */ > > +struct ktimed_struct { > > + struct timespec _xtime; > > + struct timespec wall_to_monotonic; > > + seqlock_t lock; > > + unsigned long avenrun[3]; > > + int calc_load_count; > > +}; > > crappy name, but I guess it doesn't matter as nobody will use it at this > stage. But... ktimedata would be better IMO. somethingd looks like a daemon. > > > +extern struct ktimed_struct ktimed; > > +#define xtime ktimed._xtime > > +#define wall_to_monotonic ktimed.wall_to_monotonic > > +#define xtime_lock ktimed.lock > > +#define avenrun ktimed.avenrun > > They'll use these instead. > > Frankly, I think we'd be better off removing these macros and, longer-term, > use > > write_seqlock(time_data.xtime_lock); > > The approach you have here would be a good transition-period thing. [snip] > hm, the patch seems to transform a mess into a mess. I guess it's a messy > problem. > > I agree that aggregating all the time-related things into a struct like > this makes some sense. As does aggregating them all into a similar-looking > namespace, but that'd probably be too intrusive - too late for that. Just curious, is the change measurable (time/performance wise)? Patch makes sense anyway. --- ~Randy ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] group xtime, xtime_lock, wall_to_monotonic, avenrun, calc_load_count fields together in ktimed 2006-12-09 5:46 ` Andrew Morton 2006-12-09 6:07 ` Randy Dunlap @ 2006-12-11 20:44 ` Eric Dumazet 2006-12-11 22:00 ` Andrew Morton 1 sibling, 1 reply; 54+ messages in thread From: Eric Dumazet @ 2006-12-11 20:44 UTC (permalink / raw) To: Andrew Morton; +Cc: Andi Kleen, linux-kernel Andrew Morton a écrit : > > hm, the patch seems to transform a mess into a mess. I guess it's a messy > problem. > > I agree that aggregating all the time-related things into a struct like > this makes some sense. As does aggregating them all into a similar-looking > namespace, but that'd probably be too intrusive - too late for that. Hi Andrew, thanks for your comments. I sent two patches for the __attribute__((weak)) xtime_lock thing, and calc_load() optimization, which dont depend on ktimed. Should I now send patches for aggregating things or is it considered too intrusive ? (Sorry if I didnt understand your last sentence) If yes, should I send separate patches to : 1) define an empty ktimed (or with a placeholder for jiffies64, not yet used) 2) move xtime into ktimed 3) move xtime_lock into ktimed 4) move wall_to_monotonic into ktimed 5) move calc_load.count into ktimed 6) move avenrun into ktimed. 7) patches to use ktimed.jiffies64 on various arches (with the problem of aliasing jiffies) Thank you Eric ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] group xtime, xtime_lock, wall_to_monotonic, avenrun, calc_load_count fields together in ktimed 2006-12-11 20:44 ` Eric Dumazet @ 2006-12-11 22:00 ` Andrew Morton 0 siblings, 0 replies; 54+ messages in thread From: Andrew Morton @ 2006-12-11 22:00 UTC (permalink / raw) To: Eric Dumazet; +Cc: Andi Kleen, linux-kernel On Mon, 11 Dec 2006 21:44:34 +0100 Eric Dumazet <dada1@cosmosbay.com> wrote: > Andrew Morton a __crit : > > > > hm, the patch seems to transform a mess into a mess. I guess it's a messy > > problem. > > > > I agree that aggregating all the time-related things into a struct like > > this makes some sense. As does aggregating them all into a similar-looking > > namespace, but that'd probably be too intrusive - too late for that. > > > Hi Andrew, thanks for your comments. > > I sent two patches for the __attribute__((weak)) xtime_lock thing, and > calc_load() optimization, which dont depend on ktimed. yup, thanks. > Should I now send patches for aggregating things or is it considered too > intrusive ? The previous version didn't look too intrusive. But it would be nice to have a plan to get rid of the macros: #define xtime_lock ktimed.xtime_lock and just open-code this everywhere. > (Sorry if I didnt understand your last sentence) What I meant was: if we're not going to to aggregate all these globals like this: ktimed.xtime_lock ktimed.wall_to_monotonic then it would be nice if they were at least aggregated by naming convention: time_management_time_lock time_management_wall_to_monotonic etc so the reader can see that these things are all part of the same subsystem. But the proposed ktimed.xtime_lock achieves that, and has runtime benefits too. Can we please not call it ktimed? That sounds like a kernel thread to me. time_data would be better. > If yes, should I send separate patches to : > > 1) define an empty ktimed (or with a placeholder for jiffies64, not yet used) > 2) move xtime into ktimed > 3) move xtime_lock into ktimed > 4) move wall_to_monotonic into ktimed > 5) move calc_load.count into ktimed > 6) move avenrun into ktimed. A single patch there would suffice, I suspect. > 7) patches to use ktimed.jiffies64 on various arches (with the problem of > aliasing jiffies) That might be a sprinkle of per-arch patches, but I'm not sure what is entailed here. ^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH] Introduce time_data, a new structure to hold jiffies, xtime, xtime_lock, wall_to_monotonic, calc_load_count and avenrun 2006-12-08 16:52 ` [PATCH] group xtime, xtime_lock, wall_to_monotonic, avenrun, calc_load_count fields together in ktimed Eric Dumazet 2006-12-09 5:46 ` Andrew Morton @ 2006-12-13 21:26 ` Eric Dumazet 2006-12-15 5:24 ` Andrew Morton 2006-12-15 16:21 ` Eric Dumazet 1 sibling, 2 replies; 54+ messages in thread From: Eric Dumazet @ 2006-12-13 21:26 UTC (permalink / raw) To: Andrew Morton; +Cc: Andi Kleen, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1462 bytes --] This patch introduces a new structure called time_data, where some time keeping related variables are put together to share as few cache lines as possible. This should reduce timer interrupt latency and cache lines ping pongs in SMP machines. struct time_data { u64 _jiffies_64; struct timespec _xtime; struct timespec _wall_to_monotonic; seqlock_t lock; int calc_load_count; unsigned int _avenrun[3]; }; _jiffies_64 is a place holder so that arches can (optionally) aliases jiffies_64/jiffies in time_data. This patch does the thing for i386 and x86_64. avenrun, xtime, xtime_lock, wall_to_monotonic, are now temporary defined as macros to make this patch not too invasive, but we can in future patches gradually deletes these macros. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- arch/i386/kernel/vmlinux.lds.S | 3 + arch/s390/appldata/appldata_os.c | 18 ++++----- arch/x86_64/kernel/time.c | 8 +++- arch/x86_64/kernel/vmlinux.lds.S | 14 ++----- arch/x86_64/kernel/vsyscall.c | 15 +++----- include/asm-x86_64/vsyscall.h | 8 +--- include/linux/jiffies.h | 2 - include/linux/sched.h | 10 +++-- include/linux/time.h | 33 +++++++++++++++-- kernel/timer.c | 54 ++++++++++++++--------------- 10 files changed, 92 insertions(+), 73 deletions(-) [-- Attachment #2: time_data.patch --] [-- Type: text/plain, Size: 14108 bytes --] --- linux-2.6.19/include/linux/time.h 2006-12-13 20:21:23.000000000 +0100 +++ linux-2.6.19/include/linux/time.h 2006-12-13 18:06:44.000000000 +0100 @@ -88,21 +88,44 @@ static inline struct timespec timespec_s #define timespec_valid(ts) \ (((ts)->tv_sec >= 0) && (((unsigned long) (ts)->tv_nsec) < NSEC_PER_SEC)) -extern struct timespec xtime; -extern struct timespec wall_to_monotonic; -extern seqlock_t xtime_lock __attribute__((weak)); +/* + * Structure time_data holds most timekeeping related variables + * to make them share as few cache lines as possible. + */ +struct time_data { + /* + * We want to alias jiffies_64 to time_data._jiffies_64, + * in vmlinux.lds.S files (jiffies_64 = time_data) + * so keep _jiffies_64 at the beginning of time_data + */ + u64 _jiffies_64; + struct timespec _xtime; + struct timespec _wall_to_monotonic; + seqlock_t lock; + int calc_load_count; + unsigned int _avenrun[3]; +}; +/* + * time_data has weak attribute because some arches (x86_64) want + * to have a read_only version in their vsyscall page + */ +extern struct time_data time_data __attribute__((weak)); +#define xtime time_data._xtime +#define wall_to_monotonic time_data._wall_to_monotonic +#define xtime_lock time_data.lock +#define avenrun time_data._avenrun void timekeeping_init(void); static inline unsigned long get_seconds(void) { - return xtime.tv_sec; + return time_data._xtime.tv_sec; } struct timespec current_kernel_time(void); #define CURRENT_TIME (current_kernel_time()) -#define CURRENT_TIME_SEC ((struct timespec) { xtime.tv_sec, 0 }) +#define CURRENT_TIME_SEC ((struct timespec) { time_data._xtime.tv_sec, 0 }) extern void do_gettimeofday(struct timeval *tv); extern int do_settimeofday(struct timespec *tv); --- linux-2.6.19/include/linux/jiffies.h 2006-12-12 00:32:00.000000000 +0100 +++ linux-2.6.19/include/linux/jiffies.h 2006-12-13 18:11:30.000000000 +0100 @@ -78,7 +78,7 @@ * without sampling the sequence number in xtime_lock. * get_jiffies_64() will do this for you as appropriate. */ -extern u64 __jiffy_data jiffies_64; +extern u64 __jiffy_data jiffies_64 __attribute__((weak)); extern unsigned long volatile __jiffy_data jiffies; #if (BITS_PER_LONG < 64) --- linux-2.6.19/include/linux/sched.h 2006-12-08 12:10:45.000000000 +0100 +++ linux-2.6.19/include/linux/sched.h 2006-12-13 15:54:29.000000000 +0100 @@ -104,7 +104,6 @@ struct futex_pi_state; * the EXP_n values would be 1981, 2034 and 2043 if still using only * 11 bit fractions. */ -extern unsigned long avenrun[]; /* Load averages */ #define FSHIFT 11 /* nr of bits of precision */ #define FIXED_1 (1<<FSHIFT) /* 1.0 as fixed-point */ @@ -114,9 +113,12 @@ extern unsigned long avenrun[]; /* Load #define EXP_15 2037 /* 1/exp(5sec/15min) */ #define CALC_LOAD(load,exp,n) \ - load *= exp; \ - load += n*(FIXED_1-exp); \ - load >>= FSHIFT; + { \ + unsigned long temp = load; \ + temp *= exp; \ + temp += n*(FIXED_1-exp); \ + load = temp >> FSHIFT; \ + } extern unsigned long total_forks; extern int nr_threads; --- linux-2.6.19/kernel/timer.c 2006-12-13 13:28:11.000000000 +0100 +++ linux-2.6.19/kernel/timer.c 2006-12-13 20:58:53.000000000 +0100 @@ -41,7 +41,7 @@ #include <asm/timex.h> #include <asm/io.h> -u64 jiffies_64 __cacheline_aligned_in_smp = INITIAL_JIFFIES; +__attribute__((weak)) u64 jiffies_64 __cacheline_aligned_in_smp = INITIAL_JIFFIES; EXPORT_SYMBOL(jiffies_64); @@ -570,10 +570,12 @@ found: * however, we will ALWAYS keep the tv_nsec part positive so we can use * the usual normalization. */ -struct timespec xtime __attribute__ ((aligned (16))); -struct timespec wall_to_monotonic __attribute__ ((aligned (16))); - -EXPORT_SYMBOL(xtime); +__attribute__((weak)) struct time_data time_data __cacheline_aligned = { + ._jiffies_64 = INITIAL_JIFFIES, + .lock = __SEQLOCK_UNLOCKED(time_data.lock), + .calc_load_count = LOAD_FREQ, +}; +EXPORT_SYMBOL(time_data); /* XXX - all of this timekeeping code should be later moved to time.c */ @@ -995,9 +997,6 @@ static unsigned long count_active_tasks( * * Requires xtime_lock to access. */ -unsigned long avenrun[3]; - -EXPORT_SYMBOL(avenrun); /* * calc_load - given tick count, update the avenrun load estimates. @@ -1006,29 +1005,20 @@ EXPORT_SYMBOL(avenrun); static inline void calc_load(unsigned long ticks) { unsigned long active_tasks; /* fixed-point */ - static int count = LOAD_FREQ; - count -= ticks; - if (unlikely(count < 0)) { + time_data.calc_load_count -= ticks; + if (unlikely(time_data.calc_load_count < 0)) { active_tasks = count_active_tasks(); do { - CALC_LOAD(avenrun[0], EXP_1, active_tasks); - CALC_LOAD(avenrun[1], EXP_5, active_tasks); - CALC_LOAD(avenrun[2], EXP_15, active_tasks); - count += LOAD_FREQ; - } while (count < 0); + CALC_LOAD(time_data._avenrun[0], EXP_1, active_tasks); + CALC_LOAD(time_data._avenrun[1], EXP_5, active_tasks); + CALC_LOAD(time_data._avenrun[2], EXP_15, active_tasks); + time_data.calc_load_count += LOAD_FREQ; + } while (time_data.calc_load_count < 0); } } /* - * This read-write spinlock protects us from races in SMP while - * playing with xtime and avenrun. - */ -__attribute__((weak)) __cacheline_aligned_in_smp DEFINE_SEQLOCK(xtime_lock); - -EXPORT_SYMBOL(xtime_lock); - -/* * This function runs timers and the timer-tq in bottom half context. */ static void run_timer_softirq(struct softirq_action *h) @@ -1253,6 +1243,16 @@ asmlinkage long sys_gettid(void) } /** + * avenrun_to_loads - convert an _avenrun[] to sysinfo.loads[] + * @idx: index (0..2) in time_data._avenrun[] + */ +static inline unsigned long avenrun_to_loads(unsigned int idx) +{ + unsigned long ret = time_data._avenrun[idx]; + return ret << (SI_LOAD_SHIFT - FSHIFT); +} + +/** * sys_sysinfo - fill in sysinfo struct * @info: pointer to buffer to fill */ @@ -1285,9 +1285,9 @@ asmlinkage long sys_sysinfo(struct sysin } val.uptime = tp.tv_sec + (tp.tv_nsec ? 1 : 0); - val.loads[0] = avenrun[0] << (SI_LOAD_SHIFT - FSHIFT); - val.loads[1] = avenrun[1] << (SI_LOAD_SHIFT - FSHIFT); - val.loads[2] = avenrun[2] << (SI_LOAD_SHIFT - FSHIFT); + val.loads[0] = avenrun_to_loads(0); + val.loads[1] = avenrun_to_loads(1); + val.loads[2] = avenrun_to_loads(2); val.procs = nr_threads; } while (read_seqretry(&xtime_lock, seq)); --- linux-2.6.19/include/asm-x86_64/vsyscall.h 2006-12-13 13:31:50.000000000 +0100 +++ linux-2.6.19/include/asm-x86_64/vsyscall.h 2006-12-13 21:20:18.000000000 +0100 @@ -17,11 +17,9 @@ enum vsyscall_num { #define __section_vxtime __attribute__ ((unused, __section__ (".vxtime"), aligned(16))) #define __section_vgetcpu_mode __attribute__ ((unused, __section__ (".vgetcpu_mode"), aligned(16))) -#define __section_jiffies __attribute__ ((unused, __section__ (".jiffies"), aligned(16))) #define __section_sys_tz __attribute__ ((unused, __section__ (".sys_tz"), aligned(16))) #define __section_sysctl_vsyscall __attribute__ ((unused, __section__ (".sysctl_vsyscall"), aligned(16))) -#define __section_xtime __attribute__ ((unused, __section__ (".xtime"), aligned(16))) -#define __section_xtime_lock __attribute__ ((unused, __section__ (".xtime_lock"), aligned(16))) +#define __section_time_data __attribute__ ((unused, __section__ (".time_data"), aligned(16))) #define VXTIME_TSC 1 #define VXTIME_HPET 2 @@ -43,12 +41,10 @@ struct vxtime_data { #define hpet_writel(d,a) writel(d, (void __iomem *)fix_to_virt(FIX_HPET_BASE) + a) /* vsyscall space (readonly) */ +extern struct time_data __time_data; extern struct vxtime_data __vxtime; extern int __vgetcpu_mode; -extern struct timespec __xtime; -extern volatile unsigned long __jiffies; extern struct timezone __sys_tz; -extern seqlock_t __xtime_lock; /* kernel space (writeable) */ extern struct vxtime_data vxtime; --- linux-2.6.19/arch/x86_64/kernel/vsyscall.c 2006-11-29 22:57:37.000000000 +0100 +++ linux-2.6.19/arch/x86_64/kernel/vsyscall.c 2006-12-13 21:04:03.000000000 +0100 @@ -44,7 +44,6 @@ #define __vsyscall(nr) __attribute__ ((unused,__section__(".vsyscall_" #nr))) int __sysctl_vsyscall __section_sysctl_vsyscall = 1; -seqlock_t __xtime_lock __section_xtime_lock = SEQLOCK_UNLOCKED; int __vgetcpu_mode __section_vgetcpu_mode; #include <asm/unistd.h> @@ -66,10 +65,10 @@ static __always_inline void do_vgettimeo unsigned long sec, usec; do { - sequence = read_seqbegin(&__xtime_lock); + sequence = read_seqbegin(&__time_data.lock); - sec = __xtime.tv_sec; - usec = __xtime.tv_nsec / 1000; + sec = __time_data._xtime.tv_sec; + usec = __time_data._xtime.tv_nsec / 1000; if (__vxtime.mode != VXTIME_HPET) { t = get_cycles_sync(); @@ -83,7 +82,7 @@ static __always_inline void do_vgettimeo fix_to_virt(VSYSCALL_HPET) + 0xf0) - __vxtime.last) * __vxtime.quot) >> 32; } - } while (read_seqretry(&__xtime_lock, sequence)); + } while (read_seqretry(&__time_data.lock, sequence)); tv->tv_sec = sec + usec / 1000000; tv->tv_usec = usec % 1000000; @@ -131,8 +130,8 @@ time_t __vsyscall(1) vtime(time_t *t) if (!__sysctl_vsyscall) return time_syscall(t); else if (t) - *t = __xtime.tv_sec; - return __xtime.tv_sec; + *t = __time_data._xtime.tv_sec; + return __time_data._xtime.tv_sec; } /* Fast way to get current CPU and node. @@ -157,7 +156,7 @@ vgetcpu(unsigned *cpu, unsigned *node, s We do this here because otherwise user space would do it on its own in a likely inferior way (no access to jiffies). If you don't like it pass NULL. */ - if (tcache && tcache->blob[0] == (j = __jiffies)) { + if (tcache && tcache->blob[0] == (j = __time_data._jiffies_64)) { p = tcache->blob[1]; } else if (__vgetcpu_mode == VGETCPU_RDTSCP) { /* Load per CPU data from RDTSCP */ --- linux-2.6.19/arch/x86_64/kernel/time.c 2006-12-08 16:22:15.000000000 +0100 +++ linux-2.6.19/arch/x86_64/kernel/time.c 2006-12-13 15:47:01.000000000 +0100 @@ -76,8 +76,12 @@ unsigned long long monotonic_base; struct vxtime_data __vxtime __section_vxtime; /* for vsyscalls */ -volatile unsigned long __jiffies __section_jiffies = INITIAL_JIFFIES; -struct timespec __xtime __section_xtime; +struct time_data __time_data __section_time_data = { + ._jiffies_64 = INITIAL_JIFFIES, + .lock = __SEQLOCK_UNLOCKED(time_data.lock), + .calc_load_count = LOAD_FREQ, +}; + struct timezone __sys_tz __section_sys_tz; /* --- linux-2.6.19/arch/x86_64/kernel/vmlinux.lds.S 2006-12-08 16:03:13.000000000 +0100 +++ linux-2.6.19/arch/x86_64/kernel/vmlinux.lds.S 2006-12-13 21:04:03.000000000 +0100 @@ -12,7 +12,6 @@ OUTPUT_FORMAT("elf64-x86-64", "elf64-x86-64", "elf64-x86-64") OUTPUT_ARCH(i386:x86-64) ENTRY(phys_startup_64) -jiffies_64 = jiffies; PHDRS { text PT_LOAD FLAGS(5); /* R_E */ data PT_LOAD FLAGS(7); /* RWE */ @@ -94,8 +93,10 @@ SECTIONS __vsyscall_0 = VSYSCALL_VIRT_ADDR; . = ALIGN(CONFIG_X86_L1_CACHE_BYTES); - .xtime_lock : AT(VLOAD(.xtime_lock)) { *(.xtime_lock) } - xtime_lock = VVIRT(.xtime_lock); + .time_data : AT(VLOAD(.time_data)) { *(.time_data) } + time_data = VVIRT(.time_data); + jiffies = time_data; + jiffies_64 = time_data; .vxtime : AT(VLOAD(.vxtime)) { *(.vxtime) } vxtime = VVIRT(.vxtime); @@ -109,13 +110,6 @@ SECTIONS .sysctl_vsyscall : AT(VLOAD(.sysctl_vsyscall)) { *(.sysctl_vsyscall) } sysctl_vsyscall = VVIRT(.sysctl_vsyscall); - .xtime : AT(VLOAD(.xtime)) { *(.xtime) } - xtime = VVIRT(.xtime); - - . = ALIGN(CONFIG_X86_L1_CACHE_BYTES); - .jiffies : AT(VLOAD(.jiffies)) { *(.jiffies) } - jiffies = VVIRT(.jiffies); - .vsyscall_1 ADDR(.vsyscall_0) + 1024: AT(VLOAD(.vsyscall_1)) { *(.vsyscall_1) } .vsyscall_2 ADDR(.vsyscall_0) + 2048: AT(VLOAD(.vsyscall_2)) { *(.vsyscall_2) } .vsyscall_3 ADDR(.vsyscall_0) + 3072: AT(VLOAD(.vsyscall_3)) { *(.vsyscall_3) } --- linux-2.6.19/arch/i386/kernel/vmlinux.lds.S 2006-12-13 16:09:56.000000000 +0100 +++ linux-2.6.19/arch/i386/kernel/vmlinux.lds.S 2006-12-13 20:35:45.000000000 +0100 @@ -12,7 +12,8 @@ OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386") OUTPUT_ARCH(i386) ENTRY(phys_startup_32) -jiffies = jiffies_64; +jiffies = time_data; +jiffies_64 = time_data; PHDRS { text PT_LOAD FLAGS(5); /* R_E */ --- linux-2.6.19/arch/s390/appldata/appldata_os.c 2006-12-13 16:27:59.000000000 +0100 +++ linux-2.6.19/arch/s390/appldata/appldata_os.c 2006-12-13 16:27:59.000000000 +0100 @@ -68,7 +68,7 @@ struct appldata_os_data { u32 nr_running; /* number of runnable threads */ u32 nr_threads; /* number of threads */ - u32 avenrun[3]; /* average nr. of running processes during */ + u32 _avenrun[3]; /* average nr. of running processes during */ /* the last 1, 5 and 15 minutes */ /* New in 2.6 */ @@ -98,11 +98,11 @@ static inline void appldata_print_debug( P_DEBUG("nr_threads = %u\n", os_data->nr_threads); P_DEBUG("nr_running = %u\n", os_data->nr_running); P_DEBUG("nr_iowait = %u\n", os_data->nr_iowait); - P_DEBUG("avenrun(int) = %8x / %8x / %8x\n", os_data->avenrun[0], - os_data->avenrun[1], os_data->avenrun[2]); - a0 = os_data->avenrun[0]; - a1 = os_data->avenrun[1]; - a2 = os_data->avenrun[2]; + P_DEBUG("avenrun(int) = %8x / %8x / %8x\n", os_data->_avenrun[0], + os_data->_avenrun[1], os_data->_avenrun[2]); + a0 = os_data->_avenrun[0]; + a1 = os_data->_avenrun[1]; + a2 = os_data->_avenrun[2]; P_DEBUG("avenrun(float) = %d.%02d / %d.%02d / %d.%02d\n", LOAD_INT(a0), LOAD_FRAC(a0), LOAD_INT(a1), LOAD_FRAC(a1), LOAD_INT(a2), LOAD_FRAC(a2)); @@ -145,9 +145,9 @@ static void appldata_get_os_data(void *d os_data->nr_threads = nr_threads; os_data->nr_running = nr_running(); os_data->nr_iowait = nr_iowait(); - os_data->avenrun[0] = avenrun[0] + (FIXED_1/200); - os_data->avenrun[1] = avenrun[1] + (FIXED_1/200); - os_data->avenrun[2] = avenrun[2] + (FIXED_1/200); + os_data->_avenrun[0] = avenrun[0] + (FIXED_1/200); + os_data->_avenrun[1] = avenrun[1] + (FIXED_1/200); + os_data->_avenrun[2] = avenrun[2] + (FIXED_1/200); j = 0; for_each_online_cpu(i) { ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Introduce time_data, a new structure to hold jiffies, xtime, xtime_lock, wall_to_monotonic, calc_load_count and avenrun 2006-12-13 21:26 ` [PATCH] Introduce time_data, a new structure to hold jiffies, xtime, xtime_lock, wall_to_monotonic, calc_load_count and avenrun Eric Dumazet @ 2006-12-15 5:24 ` Andrew Morton 2006-12-15 11:21 ` Eric Dumazet 2006-12-15 16:21 ` Eric Dumazet 1 sibling, 1 reply; 54+ messages in thread From: Andrew Morton @ 2006-12-15 5:24 UTC (permalink / raw) To: Eric Dumazet; +Cc: Andi Kleen, linux-kernel On Wed, 13 Dec 2006 22:26:26 +0100 Eric Dumazet <dada1@cosmosbay.com> wrote: > This patch introduces a new structure called time_data, where some time > keeping related variables are put together to share as few cache lines as > possible. ia64 refers to xtime_lock from assembly and hence doesn't link. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Introduce time_data, a new structure to hold jiffies, xtime, xtime_lock, wall_to_monotonic, calc_load_count and avenrun 2006-12-15 5:24 ` Andrew Morton @ 2006-12-15 11:21 ` Eric Dumazet 0 siblings, 0 replies; 54+ messages in thread From: Eric Dumazet @ 2006-12-15 11:21 UTC (permalink / raw) To: Andrew Morton Cc: Andi Kleen, linux-kernel, Thomas Gleixner, Ingo Molnar, john stultz, Roman Zippel [-- Attachment #1: Type: text/plain, Size: 2573 bytes --] On Friday 15 December 2006 06:24, Andrew Morton wrote: > On Wed, 13 Dec 2006 22:26:26 +0100 > > Eric Dumazet <dada1@cosmosbay.com> wrote: > > This patch introduces a new structure called time_data, where some time > > keeping related variables are put together to share as few cache lines as > > possible. > > ia64 refers to xtime_lock from assembly and hence doesn't link. > I see, I missed this. In this new version, I define linker aliases on IA64 for xtime, xtime_lock and wall_to_monotonic. An ia64 expert might change arch/ia64/kernel/fsys.S later to only use time_data. Thank you [PATCH] Introduce time_data, a new structure to hold jiffies, xtime, xtime_lock, wall_to_monotonic, calc_load_count and avenrun This patch introduces a new structure called time_data, where some time keeping related variables are put together to share as few cache lines as possible. This should reduce timer interrupt latency and cache lines ping pongs in SMP machines. struct time_data { u64 _jiffies_64; struct timespec _xtime; struct timespec _wall_to_monotonic; seqlock_t lock; int calc_load_count; unsigned int _avenrun[3]; }; _jiffies_64 is a place holder so that arches can (optionally) aliases jiffies_64/jiffies in time_data. This patch does the thing for i386, x86_64 and ia64 avenrun, xtime, xtime_lock, wall_to_monotonic, are now temporary defined as macros to make this patch not too invasive, but we can in future patches gradually deletes these macros. For ia64, I added aliases for xtime, xtime_lock and wall_to_monotonic because arch/ia64/kernel/fsys.S uses these symbols. This file might be changed to directly access time_data in the future (reducing registers use as well) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- arch/i386/kernel/vmlinux.lds.S | 3 + arch/ia64/kernel/asm-offsets.c | 3 + arch/ia64/kernel/vmlinux.lds.S | 7 +++ arch/s390/appldata/appldata_os.c | 18 ++++----- arch/x86_64/kernel/time.c | 8 +++- arch/x86_64/kernel/vmlinux.lds.S | 14 ++----- arch/x86_64/kernel/vsyscall.c | 15 +++----- include/asm-x86_64/vsyscall.h | 8 +--- include/linux/jiffies.h | 2 - include/linux/sched.h | 10 +++-- include/linux/time.h | 33 +++++++++++++++-- kernel/timer.c | 54 ++++++++++++++--------------- 12 files changed, 101 insertions(+), 74 deletions(-) [-- Attachment #2: time_data.patch --] [-- Type: text/plain, Size: 15514 bytes --] --- linux-2.6.19/include/linux/time.h 2006-12-13 20:21:23.000000000 +0100 +++ linux-2.6.19-ed/include/linux/time.h 2006-12-15 11:22:46.000000000 +0100 @@ -88,21 +88,44 @@ static inline struct timespec timespec_s #define timespec_valid(ts) \ (((ts)->tv_sec >= 0) && (((unsigned long) (ts)->tv_nsec) < NSEC_PER_SEC)) -extern struct timespec xtime; -extern struct timespec wall_to_monotonic; -extern seqlock_t xtime_lock __attribute__((weak)); +/* + * Structure time_data holds most timekeeping related variables + * to make them share as few cache lines as possible. + */ +struct time_data { + /* + * We want to alias jiffies_64 to time_data._jiffies_64, + * in vmlinux.lds.S files (jiffies_64 = time_data) + * so keep _jiffies_64 at the beginning of time_data + */ + u64 _jiffies_64; + struct timespec _xtime; + struct timespec _wall_to_monotonic; + seqlock_t lock; + int calc_load_count; + unsigned int _avenrun[3]; +}; +/* + * time_data has weak attribute because some arches (x86_64) want + * to have a read_only version in their vsyscall page + */ +extern struct time_data time_data __attribute__((weak)); +#define xtime time_data._xtime +#define wall_to_monotonic time_data._wall_to_monotonic +#define xtime_lock time_data.lock +#define avenrun time_data._avenrun void timekeeping_init(void); static inline unsigned long get_seconds(void) { - return xtime.tv_sec; + return time_data._xtime.tv_sec; } struct timespec current_kernel_time(void); #define CURRENT_TIME (current_kernel_time()) -#define CURRENT_TIME_SEC ((struct timespec) { xtime.tv_sec, 0 }) +#define CURRENT_TIME_SEC ((struct timespec) { time_data._xtime.tv_sec, 0 }) extern void do_gettimeofday(struct timeval *tv); extern int do_settimeofday(struct timespec *tv); --- linux-2.6.19/include/linux/jiffies.h 2006-12-12 00:32:00.000000000 +0100 +++ linux-2.6.19-ed/include/linux/jiffies.h 2006-12-13 18:11:30.000000000 +0100 @@ -78,7 +78,7 @@ * without sampling the sequence number in xtime_lock. * get_jiffies_64() will do this for you as appropriate. */ -extern u64 __jiffy_data jiffies_64; +extern u64 __jiffy_data jiffies_64 __attribute__((weak)); extern unsigned long volatile __jiffy_data jiffies; #if (BITS_PER_LONG < 64) --- linux-2.6.19/include/linux/sched.h 2006-12-08 12:10:45.000000000 +0100 +++ linux-2.6.19-ed/include/linux/sched.h 2006-12-13 15:54:29.000000000 +0100 @@ -104,7 +104,6 @@ struct futex_pi_state; * the EXP_n values would be 1981, 2034 and 2043 if still using only * 11 bit fractions. */ -extern unsigned long avenrun[]; /* Load averages */ #define FSHIFT 11 /* nr of bits of precision */ #define FIXED_1 (1<<FSHIFT) /* 1.0 as fixed-point */ @@ -114,9 +113,12 @@ extern unsigned long avenrun[]; /* Load #define EXP_15 2037 /* 1/exp(5sec/15min) */ #define CALC_LOAD(load,exp,n) \ - load *= exp; \ - load += n*(FIXED_1-exp); \ - load >>= FSHIFT; + { \ + unsigned long temp = load; \ + temp *= exp; \ + temp += n*(FIXED_1-exp); \ + load = temp >> FSHIFT; \ + } extern unsigned long total_forks; extern int nr_threads; --- linux-2.6.19/kernel/timer.c 2006-12-13 13:28:11.000000000 +0100 +++ linux-2.6.19-ed/kernel/timer.c 2006-12-13 20:58:53.000000000 +0100 @@ -41,7 +41,7 @@ #include <asm/timex.h> #include <asm/io.h> -u64 jiffies_64 __cacheline_aligned_in_smp = INITIAL_JIFFIES; +__attribute__((weak)) u64 jiffies_64 __cacheline_aligned_in_smp = INITIAL_JIFFIES; EXPORT_SYMBOL(jiffies_64); @@ -570,10 +570,12 @@ found: * however, we will ALWAYS keep the tv_nsec part positive so we can use * the usual normalization. */ -struct timespec xtime __attribute__ ((aligned (16))); -struct timespec wall_to_monotonic __attribute__ ((aligned (16))); - -EXPORT_SYMBOL(xtime); +__attribute__((weak)) struct time_data time_data __cacheline_aligned = { + ._jiffies_64 = INITIAL_JIFFIES, + .lock = __SEQLOCK_UNLOCKED(time_data.lock), + .calc_load_count = LOAD_FREQ, +}; +EXPORT_SYMBOL(time_data); /* XXX - all of this timekeeping code should be later moved to time.c */ @@ -995,9 +997,6 @@ static unsigned long count_active_tasks( * * Requires xtime_lock to access. */ -unsigned long avenrun[3]; - -EXPORT_SYMBOL(avenrun); /* * calc_load - given tick count, update the avenrun load estimates. @@ -1006,29 +1005,20 @@ EXPORT_SYMBOL(avenrun); static inline void calc_load(unsigned long ticks) { unsigned long active_tasks; /* fixed-point */ - static int count = LOAD_FREQ; - count -= ticks; - if (unlikely(count < 0)) { + time_data.calc_load_count -= ticks; + if (unlikely(time_data.calc_load_count < 0)) { active_tasks = count_active_tasks(); do { - CALC_LOAD(avenrun[0], EXP_1, active_tasks); - CALC_LOAD(avenrun[1], EXP_5, active_tasks); - CALC_LOAD(avenrun[2], EXP_15, active_tasks); - count += LOAD_FREQ; - } while (count < 0); + CALC_LOAD(time_data._avenrun[0], EXP_1, active_tasks); + CALC_LOAD(time_data._avenrun[1], EXP_5, active_tasks); + CALC_LOAD(time_data._avenrun[2], EXP_15, active_tasks); + time_data.calc_load_count += LOAD_FREQ; + } while (time_data.calc_load_count < 0); } } /* - * This read-write spinlock protects us from races in SMP while - * playing with xtime and avenrun. - */ -__attribute__((weak)) __cacheline_aligned_in_smp DEFINE_SEQLOCK(xtime_lock); - -EXPORT_SYMBOL(xtime_lock); - -/* * This function runs timers and the timer-tq in bottom half context. */ static void run_timer_softirq(struct softirq_action *h) @@ -1253,6 +1243,16 @@ asmlinkage long sys_gettid(void) } /** + * avenrun_to_loads - convert an _avenrun[] to sysinfo.loads[] + * @idx: index (0..2) in time_data._avenrun[] + */ +static inline unsigned long avenrun_to_loads(unsigned int idx) +{ + unsigned long ret = time_data._avenrun[idx]; + return ret << (SI_LOAD_SHIFT - FSHIFT); +} + +/** * sys_sysinfo - fill in sysinfo struct * @info: pointer to buffer to fill */ @@ -1285,9 +1285,9 @@ asmlinkage long sys_sysinfo(struct sysin } val.uptime = tp.tv_sec + (tp.tv_nsec ? 1 : 0); - val.loads[0] = avenrun[0] << (SI_LOAD_SHIFT - FSHIFT); - val.loads[1] = avenrun[1] << (SI_LOAD_SHIFT - FSHIFT); - val.loads[2] = avenrun[2] << (SI_LOAD_SHIFT - FSHIFT); + val.loads[0] = avenrun_to_loads(0); + val.loads[1] = avenrun_to_loads(1); + val.loads[2] = avenrun_to_loads(2); val.procs = nr_threads; } while (read_seqretry(&xtime_lock, seq)); --- linux-2.6.19/include/asm-x86_64/vsyscall.h 2006-12-13 13:31:50.000000000 +0100 +++ linux-2.6.19-ed/include/asm-x86_64/vsyscall.h 2006-12-13 21:20:18.000000000 +0100 @@ -17,11 +17,9 @@ enum vsyscall_num { #define __section_vxtime __attribute__ ((unused, __section__ (".vxtime"), aligned(16))) #define __section_vgetcpu_mode __attribute__ ((unused, __section__ (".vgetcpu_mode"), aligned(16))) -#define __section_jiffies __attribute__ ((unused, __section__ (".jiffies"), aligned(16))) #define __section_sys_tz __attribute__ ((unused, __section__ (".sys_tz"), aligned(16))) #define __section_sysctl_vsyscall __attribute__ ((unused, __section__ (".sysctl_vsyscall"), aligned(16))) -#define __section_xtime __attribute__ ((unused, __section__ (".xtime"), aligned(16))) -#define __section_xtime_lock __attribute__ ((unused, __section__ (".xtime_lock"), aligned(16))) +#define __section_time_data __attribute__ ((unused, __section__ (".time_data"), aligned(16))) #define VXTIME_TSC 1 #define VXTIME_HPET 2 @@ -43,12 +41,10 @@ struct vxtime_data { #define hpet_writel(d,a) writel(d, (void __iomem *)fix_to_virt(FIX_HPET_BASE) + a) /* vsyscall space (readonly) */ +extern struct time_data __time_data; extern struct vxtime_data __vxtime; extern int __vgetcpu_mode; -extern struct timespec __xtime; -extern volatile unsigned long __jiffies; extern struct timezone __sys_tz; -extern seqlock_t __xtime_lock; /* kernel space (writeable) */ extern struct vxtime_data vxtime; --- linux-2.6.19/arch/x86_64/kernel/vsyscall.c 2006-11-29 22:57:37.000000000 +0100 +++ linux-2.6.19-ed/arch/x86_64/kernel/vsyscall.c 2006-12-13 21:04:03.000000000 +0100 @@ -44,7 +44,6 @@ #define __vsyscall(nr) __attribute__ ((unused,__section__(".vsyscall_" #nr))) int __sysctl_vsyscall __section_sysctl_vsyscall = 1; -seqlock_t __xtime_lock __section_xtime_lock = SEQLOCK_UNLOCKED; int __vgetcpu_mode __section_vgetcpu_mode; #include <asm/unistd.h> @@ -66,10 +65,10 @@ static __always_inline void do_vgettimeo unsigned long sec, usec; do { - sequence = read_seqbegin(&__xtime_lock); + sequence = read_seqbegin(&__time_data.lock); - sec = __xtime.tv_sec; - usec = __xtime.tv_nsec / 1000; + sec = __time_data._xtime.tv_sec; + usec = __time_data._xtime.tv_nsec / 1000; if (__vxtime.mode != VXTIME_HPET) { t = get_cycles_sync(); @@ -83,7 +82,7 @@ static __always_inline void do_vgettimeo fix_to_virt(VSYSCALL_HPET) + 0xf0) - __vxtime.last) * __vxtime.quot) >> 32; } - } while (read_seqretry(&__xtime_lock, sequence)); + } while (read_seqretry(&__time_data.lock, sequence)); tv->tv_sec = sec + usec / 1000000; tv->tv_usec = usec % 1000000; @@ -131,8 +130,8 @@ time_t __vsyscall(1) vtime(time_t *t) if (!__sysctl_vsyscall) return time_syscall(t); else if (t) - *t = __xtime.tv_sec; - return __xtime.tv_sec; + *t = __time_data._xtime.tv_sec; + return __time_data._xtime.tv_sec; } /* Fast way to get current CPU and node. @@ -157,7 +156,7 @@ vgetcpu(unsigned *cpu, unsigned *node, s We do this here because otherwise user space would do it on its own in a likely inferior way (no access to jiffies). If you don't like it pass NULL. */ - if (tcache && tcache->blob[0] == (j = __jiffies)) { + if (tcache && tcache->blob[0] == (j = __time_data._jiffies_64)) { p = tcache->blob[1]; } else if (__vgetcpu_mode == VGETCPU_RDTSCP) { /* Load per CPU data from RDTSCP */ --- linux-2.6.19/arch/x86_64/kernel/time.c 2006-12-08 16:22:15.000000000 +0100 +++ linux-2.6.19-ed/arch/x86_64/kernel/time.c 2006-12-13 15:47:01.000000000 +0100 @@ -76,8 +76,12 @@ unsigned long long monotonic_base; struct vxtime_data __vxtime __section_vxtime; /* for vsyscalls */ -volatile unsigned long __jiffies __section_jiffies = INITIAL_JIFFIES; -struct timespec __xtime __section_xtime; +struct time_data __time_data __section_time_data = { + ._jiffies_64 = INITIAL_JIFFIES, + .lock = __SEQLOCK_UNLOCKED(time_data.lock), + .calc_load_count = LOAD_FREQ, +}; + struct timezone __sys_tz __section_sys_tz; /* --- linux-2.6.19/arch/x86_64/kernel/vmlinux.lds.S 2006-12-08 16:03:13.000000000 +0100 +++ linux-2.6.19-ed/arch/x86_64/kernel/vmlinux.lds.S 2006-12-15 13:13:49.000000000 +0100 @@ -12,7 +12,6 @@ OUTPUT_FORMAT("elf64-x86-64", "elf64-x86-64", "elf64-x86-64") OUTPUT_ARCH(i386:x86-64) ENTRY(phys_startup_64) -jiffies_64 = jiffies; PHDRS { text PT_LOAD FLAGS(5); /* R_E */ data PT_LOAD FLAGS(7); /* RWE */ @@ -94,8 +93,10 @@ SECTIONS __vsyscall_0 = VSYSCALL_VIRT_ADDR; . = ALIGN(CONFIG_X86_L1_CACHE_BYTES); - .xtime_lock : AT(VLOAD(.xtime_lock)) { *(.xtime_lock) } - xtime_lock = VVIRT(.xtime_lock); + .time_data : AT(VLOAD(.time_data)) { *(.time_data) } + time_data = VVIRT(.time_data); + jiffies = time_data; + jiffies_64 = time_data; .vxtime : AT(VLOAD(.vxtime)) { *(.vxtime) } vxtime = VVIRT(.vxtime); @@ -109,13 +110,6 @@ SECTIONS .sysctl_vsyscall : AT(VLOAD(.sysctl_vsyscall)) { *(.sysctl_vsyscall) } sysctl_vsyscall = VVIRT(.sysctl_vsyscall); - .xtime : AT(VLOAD(.xtime)) { *(.xtime) } - xtime = VVIRT(.xtime); - - . = ALIGN(CONFIG_X86_L1_CACHE_BYTES); - .jiffies : AT(VLOAD(.jiffies)) { *(.jiffies) } - jiffies = VVIRT(.jiffies); - .vsyscall_1 ADDR(.vsyscall_0) + 1024: AT(VLOAD(.vsyscall_1)) { *(.vsyscall_1) } .vsyscall_2 ADDR(.vsyscall_0) + 2048: AT(VLOAD(.vsyscall_2)) { *(.vsyscall_2) } .vsyscall_3 ADDR(.vsyscall_0) + 3072: AT(VLOAD(.vsyscall_3)) { *(.vsyscall_3) } --- linux-2.6.19/arch/i386/kernel/vmlinux.lds.S 2006-12-13 16:09:56.000000000 +0100 +++ linux-2.6.19-ed/arch/i386/kernel/vmlinux.lds.S 2006-12-13 20:35:45.000000000 +0100 @@ -12,7 +12,8 @@ OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386") OUTPUT_ARCH(i386) ENTRY(phys_startup_32) -jiffies = jiffies_64; +jiffies = time_data; +jiffies_64 = time_data; PHDRS { text PT_LOAD FLAGS(5); /* R_E */ --- linux-2.6.19/arch/ia64/kernel/asm-offsets.c 2006-12-15 11:22:46.000000000 +0100 +++ linux-2.6.19-ed/arch/ia64/kernel/asm-offsets.c 2006-12-15 11:27:35.000000000 +0100 @@ -268,4 +268,7 @@ void foo(void) DEFINE(IA64_TIME_SOURCE_MMIO64, TIME_SOURCE_MMIO64); DEFINE(IA64_TIME_SOURCE_MMIO32, TIME_SOURCE_MMIO32); DEFINE(IA64_TIMESPEC_TV_NSEC_OFFSET, offsetof (struct timespec, tv_nsec)); + DEFINE(IA64_XTIME_OFFSET, offsetof (struct time_data, _xtime)); + DEFINE(IA64_WALL_TO_MONOTONIC_OFFSET, offsetof (struct time_data, _wall_to_monotonic)); + DEFINE(IA64_XTIME_LOCK_OFFSET, offsetof (struct time_data, lock)); } --- linux-2.6.19/arch/ia64/kernel/vmlinux.lds.S 2006-12-15 11:00:32.000000000 +0100 +++ linux-2.6.19-ed/arch/ia64/kernel/vmlinux.lds.S 2006-12-15 11:27:35.000000000 +0100 @@ -3,6 +3,7 @@ #include <asm/ptrace.h> #include <asm/system.h> #include <asm/pgtable.h> +#include <asm/asm-offsets.h> #define LOAD_OFFSET (KERNEL_START - KERNEL_TR_PAGE_SIZE) #include <asm-generic/vmlinux.lds.h> @@ -15,7 +16,11 @@ OUTPUT_FORMAT("elf64-ia64-little") OUTPUT_ARCH(ia64) ENTRY(phys_start) -jiffies = jiffies_64; +jiffies = time_data; +jiffies_64 = time_data; +xtime = time_data + IA64_XTIME_OFFSET; +wall_to_monotonic = time_data + IA64_WALL_TO_MONOTONIC_OFFSET; +xtime_lock = time_data + IA64_XTIME_LOCK_OFFSET; PHDRS { code PT_LOAD; percpu PT_LOAD; --- linux-2.6.19/arch/s390/appldata/appldata_os.c 2006-12-13 16:27:59.000000000 +0100 +++ linux-2.6.19-ed/arch/s390/appldata/appldata_os.c 2006-12-13 16:27:59.000000000 +0100 @@ -68,7 +68,7 @@ struct appldata_os_data { u32 nr_running; /* number of runnable threads */ u32 nr_threads; /* number of threads */ - u32 avenrun[3]; /* average nr. of running processes during */ + u32 _avenrun[3]; /* average nr. of running processes during */ /* the last 1, 5 and 15 minutes */ /* New in 2.6 */ @@ -98,11 +98,11 @@ static inline void appldata_print_debug( P_DEBUG("nr_threads = %u\n", os_data->nr_threads); P_DEBUG("nr_running = %u\n", os_data->nr_running); P_DEBUG("nr_iowait = %u\n", os_data->nr_iowait); - P_DEBUG("avenrun(int) = %8x / %8x / %8x\n", os_data->avenrun[0], - os_data->avenrun[1], os_data->avenrun[2]); - a0 = os_data->avenrun[0]; - a1 = os_data->avenrun[1]; - a2 = os_data->avenrun[2]; + P_DEBUG("avenrun(int) = %8x / %8x / %8x\n", os_data->_avenrun[0], + os_data->_avenrun[1], os_data->_avenrun[2]); + a0 = os_data->_avenrun[0]; + a1 = os_data->_avenrun[1]; + a2 = os_data->_avenrun[2]; P_DEBUG("avenrun(float) = %d.%02d / %d.%02d / %d.%02d\n", LOAD_INT(a0), LOAD_FRAC(a0), LOAD_INT(a1), LOAD_FRAC(a1), LOAD_INT(a2), LOAD_FRAC(a2)); @@ -145,9 +145,9 @@ static void appldata_get_os_data(void *d os_data->nr_threads = nr_threads; os_data->nr_running = nr_running(); os_data->nr_iowait = nr_iowait(); - os_data->avenrun[0] = avenrun[0] + (FIXED_1/200); - os_data->avenrun[1] = avenrun[1] + (FIXED_1/200); - os_data->avenrun[2] = avenrun[2] + (FIXED_1/200); + os_data->_avenrun[0] = avenrun[0] + (FIXED_1/200); + os_data->_avenrun[1] = avenrun[1] + (FIXED_1/200); + os_data->_avenrun[2] = avenrun[2] + (FIXED_1/200); j = 0; for_each_online_cpu(i) { ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Introduce time_data, a new structure to hold jiffies, xtime, xtime_lock, wall_to_monotonic, calc_load_count and avenrun 2006-12-13 21:26 ` [PATCH] Introduce time_data, a new structure to hold jiffies, xtime, xtime_lock, wall_to_monotonic, calc_load_count and avenrun Eric Dumazet 2006-12-15 5:24 ` Andrew Morton @ 2006-12-15 16:21 ` Eric Dumazet 1 sibling, 0 replies; 54+ messages in thread From: Eric Dumazet @ 2006-12-15 16:21 UTC (permalink / raw) To: Andrew Morton; +Cc: Andi Kleen, linux-kernel [-- Attachment #1: Type: text/plain, Size: 2074 bytes --] Just in case, I re-based the patch against 2.6.20-rc1-mm1 [PATCH] Introduce time_data, a new structure to hold jiffies, xtime, xtime_lock, wall_to_monotonic, calc_load_count and avenrun This patch introduces a new structure called time_data, where some time keeping related variables are put together to share as few cache lines as possible. This should reduce timer interrupt latency and cache lines ping pongs in SMP machines. struct time_data { u64 _jiffies_64; struct timespec _xtime; struct timespec _wall_to_monotonic; seqlock_t lock; int calc_load_count; unsigned int _avenrun[3]; }; _jiffies_64 is a place holder so that arches can (optionally) aliases jiffies_64/jiffies in time_data. This patch does the thing for i386, x86_64 and ia64 avenrun, xtime, xtime_lock, wall_to_monotonic, are now temporary defined as macros to make this patch not too invasive, but we can in future patches gradually deletes these macros. For ia64, I added aliases for xtime, xtime_lock and wall_to_monotonic because arch/ia64/kernel/fsys.S uses these symbols. This file might be changed to directly access time_data in the future (reducing registers use as well) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- arch/i386/kernel/vmlinux.lds.S | 3 + arch/ia64/kernel/asm-offsets.c | 3 + arch/ia64/kernel/vmlinux.lds.S | 7 +++ arch/s390/appldata/appldata_os.c | 18 ++++----- arch/x86_64/kernel/time.c | 8 +++- arch/x86_64/kernel/vmlinux.lds.S | 14 ++----- arch/x86_64/kernel/vsyscall.c | 15 +++----- include/asm-x86_64/vsyscall.h | 8 +--- include/linux/jiffies.h | 2 - include/linux/sched.h | 10 +++-- include/linux/time.h | 34 +++++++++++++++--- kernel/timer.c | 54 ++++++++++++++--------------- 12 files changed, 102 insertions(+), 74 deletions(-) [-- Attachment #2: time_data.patch --] [-- Type: text/plain, Size: 15720 bytes --] --- linux-2.6.20-rc1-mm1/include/linux/time.h 2006-12-15 15:46:16.000000000 +0100 +++ linux-2.6.20-rc1-mm1-ed/include/linux/time.h 2006-12-15 17:22:15.000000000 +0100 @@ -88,22 +88,46 @@ static inline struct timespec timespec_s #define timespec_valid(ts) \ (((ts)->tv_sec >= 0) && (((unsigned long) (ts)->tv_nsec) < NSEC_PER_SEC)) -extern struct timespec xtime; -extern struct timespec wall_to_monotonic; -extern seqlock_t xtime_lock __attribute__((weak)); + +/* + * Structure time_data holds most timekeeping related variables + * to make them share as few cache lines as possible. + */ +struct time_data { + /* + * We want to alias jiffies_64 to time_data._jiffies_64, + * in vmlinux.lds.S files (jiffies_64 = time_data) + * so keep _jiffies_64 at the beginning of time_data + */ + u64 _jiffies_64; + struct timespec _xtime; + struct timespec _wall_to_monotonic; + seqlock_t lock; + int calc_load_count; + unsigned int _avenrun[3]; +}; +/* + * time_data has weak attribute because some arches (x86_64) want + * to have a read_only version in their vsyscall page + */ +extern struct time_data time_data __attribute__((weak)); +#define xtime time_data._xtime +#define wall_to_monotonic time_data._wall_to_monotonic +#define xtime_lock time_data.lock +#define avenrun time_data._avenrun extern unsigned long read_persistent_clock(void); void timekeeping_init(void); static inline unsigned long get_seconds(void) { - return xtime.tv_sec; + return time_data._xtime.tv_sec; } struct timespec current_kernel_time(void); #define CURRENT_TIME (current_kernel_time()) -#define CURRENT_TIME_SEC ((struct timespec) { xtime.tv_sec, 0 }) +#define CURRENT_TIME_SEC ((struct timespec) { time_data._xtime.tv_sec, 0 }) extern void do_gettimeofday(struct timeval *tv); extern int do_settimeofday(struct timespec *tv); --- linux-2.6.20-rc1-mm1/include/linux/jiffies.h 2006-12-15 15:46:16.000000000 +0100 +++ linux-2.6.20-rc1-mm1-ed/include/linux/jiffies.h 2006-12-15 17:19:07.000000000 +0100 @@ -78,7 +78,7 @@ * without sampling the sequence number in xtime_lock. * get_jiffies_64() will do this for you as appropriate. */ -extern u64 __jiffy_data jiffies_64; +extern u64 __jiffy_data jiffies_64 __attribute__((weak)); extern unsigned long volatile __jiffy_data jiffies; #if (BITS_PER_LONG < 64) --- linux-2.6.20-rc1-mm1/include/linux/sched.h 2006-12-15 15:46:16.000000000 +0100 +++ linux-2.6.20-rc1-mm1-ed/include/linux/sched.h 2006-12-15 17:19:07.000000000 +0100 @@ -106,7 +106,6 @@ struct bio; * the EXP_n values would be 1981, 2034 and 2043 if still using only * 11 bit fractions. */ -extern unsigned long avenrun[]; /* Load averages */ #define FSHIFT 11 /* nr of bits of precision */ #define FIXED_1 (1<<FSHIFT) /* 1.0 as fixed-point */ @@ -116,9 +115,12 @@ extern unsigned long avenrun[]; /* Load #define EXP_15 2037 /* 1/exp(5sec/15min) */ #define CALC_LOAD(load,exp,n) \ - load *= exp; \ - load += n*(FIXED_1-exp); \ - load >>= FSHIFT; + { \ + unsigned long temp = load; \ + temp *= exp; \ + temp += n*(FIXED_1-exp); \ + load = temp >> FSHIFT; \ + } extern unsigned long total_forks; extern int nr_threads; --- linux-2.6.20-rc1-mm1/kernel/timer.c 2006-12-15 15:46:16.000000000 +0100 +++ linux-2.6.20-rc1-mm1-ed/kernel/timer.c 2006-12-15 17:19:07.000000000 +0100 @@ -42,7 +42,7 @@ #include <asm/timex.h> #include <asm/io.h> -u64 jiffies_64 __cacheline_aligned_in_smp = INITIAL_JIFFIES; +__attribute__((weak)) u64 jiffies_64 __cacheline_aligned_in_smp = INITIAL_JIFFIES; EXPORT_SYMBOL(jiffies_64); @@ -771,10 +771,12 @@ unsigned long next_timer_interrupt(void) * however, we will ALWAYS keep the tv_nsec part positive so we can use * the usual normalization. */ -struct timespec xtime __attribute__ ((aligned (16))); -struct timespec wall_to_monotonic __attribute__ ((aligned (16))); - -EXPORT_SYMBOL(xtime); +__attribute__((weak)) struct time_data time_data __cacheline_aligned = { + ._jiffies_64 = INITIAL_JIFFIES, + .lock = __SEQLOCK_UNLOCKED(time_data.lock), + .calc_load_count = LOAD_FREQ, +}; +EXPORT_SYMBOL(time_data); /* XXX - all of this timekeeping code should be later moved to time.c */ @@ -1238,9 +1240,6 @@ static unsigned long count_active_tasks( * * Requires xtime_lock to access. */ -unsigned long avenrun[3]; - -EXPORT_SYMBOL(avenrun); /* * calc_load - given tick count, update the avenrun load estimates. @@ -1249,29 +1248,20 @@ EXPORT_SYMBOL(avenrun); static inline void calc_load(unsigned long ticks) { unsigned long active_tasks; /* fixed-point */ - static int count = LOAD_FREQ; - count -= ticks; - if (unlikely(count < 0)) { + time_data.calc_load_count -= ticks; + if (unlikely(time_data.calc_load_count < 0)) { active_tasks = count_active_tasks(); do { - CALC_LOAD(avenrun[0], EXP_1, active_tasks); - CALC_LOAD(avenrun[1], EXP_5, active_tasks); - CALC_LOAD(avenrun[2], EXP_15, active_tasks); - count += LOAD_FREQ; - } while (count < 0); + CALC_LOAD(time_data._avenrun[0], EXP_1, active_tasks); + CALC_LOAD(time_data._avenrun[1], EXP_5, active_tasks); + CALC_LOAD(time_data._avenrun[2], EXP_15, active_tasks); + time_data.calc_load_count += LOAD_FREQ; + } while (time_data.calc_load_count < 0); } } /* - * This read-write spinlock protects us from races in SMP while - * playing with xtime and avenrun. - */ -__attribute__((weak)) __cacheline_aligned_in_smp DEFINE_SEQLOCK(xtime_lock); - -EXPORT_SYMBOL(xtime_lock); - -/* * This function runs timers and the timer-tq in bottom half context. */ static void run_timer_softirq(struct softirq_action *h) @@ -1497,6 +1487,16 @@ asmlinkage long sys_gettid(void) } /** + * avenrun_to_loads - convert an _avenrun[] to sysinfo.loads[] + * @idx: index (0..2) in time_data._avenrun[] + */ +static inline unsigned long avenrun_to_loads(unsigned int idx) +{ + unsigned long ret = time_data._avenrun[idx]; + return ret << (SI_LOAD_SHIFT - FSHIFT); +} + +/** * sys_sysinfo - fill in sysinfo struct * @info: pointer to buffer to fill */ @@ -1529,9 +1529,9 @@ asmlinkage long sys_sysinfo(struct sysin } val.uptime = tp.tv_sec + (tp.tv_nsec ? 1 : 0); - val.loads[0] = avenrun[0] << (SI_LOAD_SHIFT - FSHIFT); - val.loads[1] = avenrun[1] << (SI_LOAD_SHIFT - FSHIFT); - val.loads[2] = avenrun[2] << (SI_LOAD_SHIFT - FSHIFT); + val.loads[0] = avenrun_to_loads(0); + val.loads[1] = avenrun_to_loads(1); + val.loads[2] = avenrun_to_loads(2); val.procs = nr_threads; } while (read_seqretry(&xtime_lock, seq)); --- linux-2.6.20-rc1-mm1/include/asm-x86_64/vsyscall.h 2006-12-15 15:46:16.000000000 +0100 +++ linux-2.6.20-rc1-mm1-ed/include/asm-x86_64/vsyscall.h 2006-12-15 17:19:07.000000000 +0100 @@ -18,11 +18,9 @@ enum vsyscall_num { #define __section_vxtime __attribute__ ((unused, __section__ (".vxtime"), aligned(16))) #define __section_vgetcpu_mode __attribute__ ((unused, __section__ (".vgetcpu_mode"), aligned(16))) -#define __section_jiffies __attribute__ ((unused, __section__ (".jiffies"), aligned(16))) #define __section_sys_tz __attribute__ ((unused, __section__ (".sys_tz"), aligned(16))) #define __section_sysctl_vsyscall __attribute__ ((unused, __section__ (".sysctl_vsyscall"), aligned(16))) -#define __section_xtime __attribute__ ((unused, __section__ (".xtime"), aligned(16))) -#define __section_xtime_lock __attribute__ ((unused, __section__ (".xtime_lock"), aligned(16))) +#define __section_time_data __attribute__ ((unused, __section__ (".time_data"), aligned(16))) #define VXTIME_TSC 1 #define VXTIME_HPET 2 @@ -44,12 +42,10 @@ struct vxtime_data { #define hpet_writel(d,a) writel(d, (void __iomem *)fix_to_virt(FIX_HPET_BASE) + a) /* vsyscall space (readonly) */ +extern struct time_data __time_data; extern struct vxtime_data __vxtime; extern int __vgetcpu_mode; -extern struct timespec __xtime; -extern volatile unsigned long __jiffies; extern struct timezone __sys_tz; -extern seqlock_t __xtime_lock; /* kernel space (writeable) */ extern struct vxtime_data vxtime; --- linux-2.6.20-rc1-mm1/arch/x86_64/kernel/vsyscall.c 2006-12-14 02:14:23.000000000 +0100 +++ linux-2.6.20-rc1-mm1-ed/arch/x86_64/kernel/vsyscall.c 2006-12-15 17:19:07.000000000 +0100 @@ -45,7 +45,6 @@ #define __syscall_clobber "r11","rcx","memory" int __sysctl_vsyscall __section_sysctl_vsyscall = 1; -seqlock_t __xtime_lock __section_xtime_lock = SEQLOCK_UNLOCKED; int __vgetcpu_mode __section_vgetcpu_mode; #include <asm/unistd.h> @@ -67,10 +66,10 @@ static __always_inline void do_vgettimeo unsigned long sec, usec; do { - sequence = read_seqbegin(&__xtime_lock); + sequence = read_seqbegin(&__time_data.lock); - sec = __xtime.tv_sec; - usec = __xtime.tv_nsec / 1000; + sec = __time_data._xtime.tv_sec; + usec = __time_data._xtime.tv_nsec / 1000; if (__vxtime.mode != VXTIME_HPET) { t = get_cycles_sync(); @@ -84,7 +83,7 @@ static __always_inline void do_vgettimeo fix_to_virt(VSYSCALL_HPET) + 0xf0) - __vxtime.last) * __vxtime.quot) >> 32; } - } while (read_seqretry(&__xtime_lock, sequence)); + } while (read_seqretry(&__time_data.lock, sequence)); tv->tv_sec = sec + usec / 1000000; tv->tv_usec = usec % 1000000; @@ -132,8 +131,8 @@ time_t __vsyscall(1) vtime(time_t *t) if (!__sysctl_vsyscall) return time_syscall(t); else if (t) - *t = __xtime.tv_sec; - return __xtime.tv_sec; + *t = __time_data._xtime.tv_sec; + return __time_data._xtime.tv_sec; } /* Fast way to get current CPU and node. @@ -158,7 +157,7 @@ vgetcpu(unsigned *cpu, unsigned *node, s We do this here because otherwise user space would do it on its own in a likely inferior way (no access to jiffies). If you don't like it pass NULL. */ - if (tcache && tcache->blob[0] == (j = __jiffies)) { + if (tcache && tcache->blob[0] == (j = __time_data._jiffies_64)) { p = tcache->blob[1]; } else if (__vgetcpu_mode == VGETCPU_RDTSCP) { /* Load per CPU data from RDTSCP */ --- linux-2.6.20-rc1-mm1/arch/x86_64/kernel/time.c 2006-12-15 15:46:15.000000000 +0100 +++ linux-2.6.20-rc1-mm1-ed/arch/x86_64/kernel/time.c 2006-12-15 17:19:07.000000000 +0100 @@ -76,8 +76,12 @@ unsigned long long monotonic_base; struct vxtime_data __vxtime __section_vxtime; /* for vsyscalls */ -volatile unsigned long __jiffies __section_jiffies = INITIAL_JIFFIES; -struct timespec __xtime __section_xtime; +struct time_data __time_data __section_time_data = { + ._jiffies_64 = INITIAL_JIFFIES, + .lock = __SEQLOCK_UNLOCKED(time_data.lock), + .calc_load_count = LOAD_FREQ, +}; + struct timezone __sys_tz __section_sys_tz; /* --- linux-2.6.20-rc1-mm1/arch/x86_64/kernel/vmlinux.lds.S 2006-12-15 15:46:15.000000000 +0100 +++ linux-2.6.20-rc1-mm1-ed/arch/x86_64/kernel/vmlinux.lds.S 2006-12-15 17:22:15.000000000 +0100 @@ -12,7 +12,6 @@ OUTPUT_FORMAT("elf64-x86-64", "elf64-x86-64", "elf64-x86-64") OUTPUT_ARCH(i386:x86-64) ENTRY(phys_startup_64) -jiffies_64 = jiffies; _proxy_pda = 0; PHDRS { text PT_LOAD FLAGS(5); /* R_E */ @@ -88,8 +87,10 @@ SECTIONS __vsyscall_0 = VSYSCALL_VIRT_ADDR; . = ALIGN(CONFIG_X86_L1_CACHE_BYTES); - .xtime_lock : AT(VLOAD(.xtime_lock)) { *(.xtime_lock) } - xtime_lock = VVIRT(.xtime_lock); + .time_data : AT(VLOAD(.time_data)) { *(.time_data) } + time_data = VVIRT(.time_data); + jiffies = time_data; + jiffies_64 = time_data; .vxtime : AT(VLOAD(.vxtime)) { *(.vxtime) } vxtime = VVIRT(.vxtime); @@ -103,13 +104,6 @@ SECTIONS .sysctl_vsyscall : AT(VLOAD(.sysctl_vsyscall)) { *(.sysctl_vsyscall) } sysctl_vsyscall = VVIRT(.sysctl_vsyscall); - .xtime : AT(VLOAD(.xtime)) { *(.xtime) } - xtime = VVIRT(.xtime); - - . = ALIGN(CONFIG_X86_L1_CACHE_BYTES); - .jiffies : AT(VLOAD(.jiffies)) { *(.jiffies) } - jiffies = VVIRT(.jiffies); - .vsyscall_1 ADDR(.vsyscall_0) + 1024: AT(VLOAD(.vsyscall_1)) { *(.vsyscall_1) } .vsyscall_2 ADDR(.vsyscall_0) + 2048: AT(VLOAD(.vsyscall_2)) { *(.vsyscall_2) } .vsyscall_3 ADDR(.vsyscall_0) + 3072: AT(VLOAD(.vsyscall_3)) { *(.vsyscall_3) } --- linux-2.6.20-rc1-mm1/arch/i386/kernel/vmlinux.lds.S 2006-12-15 15:46:15.000000000 +0100 +++ linux-2.6.20-rc1-mm1-ed/arch/i386/kernel/vmlinux.lds.S 2006-12-15 17:22:15.000000000 +0100 @@ -25,7 +25,8 @@ OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386") OUTPUT_ARCH(i386) ENTRY(phys_startup_32) -jiffies = jiffies_64; +jiffies = time_data; +jiffies_64 = time_data; _proxy_pda = 0; PHDRS { --- linux-2.6.20-rc1-mm1/arch/ia64/kernel/asm-offsets.c 2006-12-14 02:14:23.000000000 +0100 +++ linux-2.6.20-rc1-mm1-ed/arch/ia64/kernel/asm-offsets.c 2006-12-15 17:19:07.000000000 +0100 @@ -268,4 +268,7 @@ void foo(void) DEFINE(IA64_TIME_SOURCE_MMIO64, TIME_SOURCE_MMIO64); DEFINE(IA64_TIME_SOURCE_MMIO32, TIME_SOURCE_MMIO32); DEFINE(IA64_TIMESPEC_TV_NSEC_OFFSET, offsetof (struct timespec, tv_nsec)); + DEFINE(IA64_XTIME_OFFSET, offsetof (struct time_data, _xtime)); + DEFINE(IA64_WALL_TO_MONOTONIC_OFFSET, offsetof (struct time_data, _wall_to_monotonic)); + DEFINE(IA64_XTIME_LOCK_OFFSET, offsetof (struct time_data, lock)); } --- linux-2.6.20-rc1-mm1/arch/ia64/kernel/vmlinux.lds.S 2006-12-15 15:46:15.000000000 +0100 +++ linux-2.6.20-rc1-mm1-ed/arch/ia64/kernel/vmlinux.lds.S 2006-12-15 17:19:07.000000000 +0100 @@ -4,6 +4,7 @@ #include <asm/system.h> #include <asm/sparsemem.h> #include <asm/pgtable.h> +#include <asm/asm-offsets.h> #define LOAD_OFFSET (KERNEL_START - KERNEL_TR_PAGE_SIZE) #include <asm-generic/vmlinux.lds.h> @@ -16,7 +17,11 @@ OUTPUT_FORMAT("elf64-ia64-little") OUTPUT_ARCH(ia64) ENTRY(phys_start) -jiffies = jiffies_64; +jiffies = time_data; +jiffies_64 = time_data; +xtime = time_data + IA64_XTIME_OFFSET; +wall_to_monotonic = time_data + IA64_WALL_TO_MONOTONIC_OFFSET; +xtime_lock = time_data + IA64_XTIME_LOCK_OFFSET; PHDRS { code PT_LOAD; percpu PT_LOAD; --- linux-2.6.20-rc1-mm1/arch/s390/appldata/appldata_os.c 2006-12-14 02:14:23.000000000 +0100 +++ linux-2.6.20-rc1-mm1-ed/arch/s390/appldata/appldata_os.c 2006-12-15 17:19:07.000000000 +0100 @@ -68,7 +68,7 @@ struct appldata_os_data { u32 nr_running; /* number of runnable threads */ u32 nr_threads; /* number of threads */ - u32 avenrun[3]; /* average nr. of running processes during */ + u32 _avenrun[3]; /* average nr. of running processes during */ /* the last 1, 5 and 15 minutes */ /* New in 2.6 */ @@ -98,11 +98,11 @@ static inline void appldata_print_debug( P_DEBUG("nr_threads = %u\n", os_data->nr_threads); P_DEBUG("nr_running = %u\n", os_data->nr_running); P_DEBUG("nr_iowait = %u\n", os_data->nr_iowait); - P_DEBUG("avenrun(int) = %8x / %8x / %8x\n", os_data->avenrun[0], - os_data->avenrun[1], os_data->avenrun[2]); - a0 = os_data->avenrun[0]; - a1 = os_data->avenrun[1]; - a2 = os_data->avenrun[2]; + P_DEBUG("avenrun(int) = %8x / %8x / %8x\n", os_data->_avenrun[0], + os_data->_avenrun[1], os_data->_avenrun[2]); + a0 = os_data->_avenrun[0]; + a1 = os_data->_avenrun[1]; + a2 = os_data->_avenrun[2]; P_DEBUG("avenrun(float) = %d.%02d / %d.%02d / %d.%02d\n", LOAD_INT(a0), LOAD_FRAC(a0), LOAD_INT(a1), LOAD_FRAC(a1), LOAD_INT(a2), LOAD_FRAC(a2)); @@ -145,9 +145,9 @@ static void appldata_get_os_data(void *d os_data->nr_threads = nr_threads; os_data->nr_running = nr_running(); os_data->nr_iowait = nr_iowait(); - os_data->avenrun[0] = avenrun[0] + (FIXED_1/200); - os_data->avenrun[1] = avenrun[1] + (FIXED_1/200); - os_data->avenrun[2] = avenrun[2] + (FIXED_1/200); + os_data->_avenrun[0] = avenrun[0] + (FIXED_1/200); + os_data->_avenrun[1] = avenrun[1] + (FIXED_1/200); + os_data->_avenrun[2] = avenrun[2] + (FIXED_1/200); j = 0; for_each_online_cpu(i) { ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-07 17:05 ` Jeff Garzik 2006-12-07 17:57 ` Andrew Morton @ 2006-12-07 18:08 ` Maciej W. Rozycki 2006-12-07 18:59 ` Andy Fleming 2 siblings, 0 replies; 54+ messages in thread From: Maciej W. Rozycki @ 2006-12-07 18:08 UTC (permalink / raw) To: Jeff Garzik, Andy Fleming Cc: Andrew Morton, Linus Torvalds, David Howells, rdreier, ben.collins, linux-kernel On Thu, 7 Dec 2006, Jeff Garzik wrote: > Looking into libphy's workqueue stuff, it has the following sequence: > > disable interrupts > schedule_work() > > ... time passes ... > ... workqueue routine is called ... > > enable interrupts > handle interrupt > > I really have to question if a workqueue was the best choice of direction for > such a sequence. You don't want to put off handling an interrupt, with > interrupts disabled, for a potentially unbounded amount of time. This is because to ack the interrupt in the device the MDIO bus has to be accessed and I gather for some implementations it may be too obnoxiously slow for the interrupt context to cope with. Note that only the interrupt line used for the PHY is disabled (though obviously with consequences to any sharers). Andy, could you please comment? Maciej ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-07 17:05 ` Jeff Garzik 2006-12-07 17:57 ` Andrew Morton 2006-12-07 18:08 ` [PATCH] Export current_is_keventd() for libphy Maciej W. Rozycki @ 2006-12-07 18:59 ` Andy Fleming 2 siblings, 0 replies; 54+ messages in thread From: Andy Fleming @ 2006-12-07 18:59 UTC (permalink / raw) To: Jeff Garzik Cc: Andrew Morton, torvalds, macro, David Howells, rdreier, ben.collins, linux-kernel On Dec 7, 2006, at 11:05, Jeff Garzik wrote: > Yes, I merged the code, but looking deeper at phy its clear I > missed some things. > > Looking into libphy's workqueue stuff, it has the following sequence: > > disable interrupts > schedule_work() > > ... time passes ... > ... workqueue routine is called ... > > enable interrupts > handle interrupt > > I really have to question if a workqueue was the best choice of > direction for such a sequence. You don't want to put off handling > an interrupt, with interrupts disabled, for a potentially unbounded > amount of time. > > Maybe the best course of action is to mark it with CONFIG_BROKEN > until it gets fixed. Yes, this is one of the design choices I made to be able to lock properly around MDIO transactions. 1) MDIO transactions take a long time 2) Some interfaces provide an interrupt to indicate the transaction has completed. So I didn't want an irq-disabling spin lock. It would prevent 2 from ever being used, and would mean all interrupts were disabled for thousands of cycles for MDIO transactions. So I decreed that no MDIO transactions can happen in interrupt context. But the registers to disable the specific PHY's interrupt are only accessible through MDIO, so in order to be able to follow that edict AND ever handle the interrupt, I need to disable the interrupt. But I only disable the PHY's interrupt (which is likely shared among some devices). I agree it's ugly, but I haven't yet figured out another way to do it. I'd love to eliminate the work queue altogether. I keep thinking that I could do something with semaphores, or spin_trylocks, but I'm not sure about the best way to let the interrupt handlers do their thing. Andy ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-07 6:42 ` Andrew Morton 2006-12-07 7:49 ` Andrew Morton @ 2006-12-07 16:49 ` Linus Torvalds 2006-12-07 17:52 ` Andrew Morton 1 sibling, 1 reply; 54+ messages in thread From: Linus Torvalds @ 2006-12-07 16:49 UTC (permalink / raw) To: Andrew Morton Cc: David Howells, Maciej W. Rozycki, Roland Dreier, Andy Fleming, Ben Collins, Linux Kernel Mailing List, Jeff Garzik On Wed, 6 Dec 2006, Andrew Morton wrote: > > But this will return to the caller if the callback is presently running on > a different CPU. The whole point here is to be able to reliably kill off > the pending work so that the caller can free resources. I mentioned that in one of the emails. We do not _have_ the information to not do that. It simply doesn't exist. We can either wait for _all_ pending entries on the to complete (by waiting for the workqueue counters for added/removed to be the same), or we can have the race. > Also, I worry that this code can run the callback on the caller's CPU. Right. > Users of per-cpu workqueues can legitimately assume that each callback runs > on the right CPU. Not if they use this interface, they can't. They asked for it, they get it. That's the unix philosophy. Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-07 16:49 ` Linus Torvalds @ 2006-12-07 17:52 ` Andrew Morton 2006-12-07 18:01 ` Linus Torvalds 0 siblings, 1 reply; 54+ messages in thread From: Andrew Morton @ 2006-12-07 17:52 UTC (permalink / raw) To: Linus Torvalds Cc: David Howells, Maciej W. Rozycki, Roland Dreier, Andy Fleming, Ben Collins, Linux Kernel Mailing List, Jeff Garzik On Thu, 7 Dec 2006 08:49:02 -0800 (PST) Linus Torvalds <torvalds@osdl.org> wrote: > > > On Wed, 6 Dec 2006, Andrew Morton wrote: > > > > But this will return to the caller if the callback is presently running on > > a different CPU. The whole point here is to be able to reliably kill off > > the pending work so that the caller can free resources. > > I mentioned that in one of the emails. > > We do not _have_ the information to not do that. It simply doesn't exist. > We can either wait for _all_ pending entries on the to complete (by > waiting for the workqueue counters for added/removed to be the same), or > we can have the race. Well we'll need to add the infrastructure to be able to do this, won't we? The whole point of calling flush_scheduled_work() (which we're trying to replace/simplify) is to block the caller until it is safe to release resources. It might be a challenge to do this without adding more stuff to work_struct though. umm.. Putting a work_struct* into struct cpu_workqueue_struct and then doing appropriate things with cpu_workqueue_struct.lock might work. <hack, hack> Something along these lines. The keventd-calls-flush_work() case rather sucks though. diff -puN kernel/workqueue.c~a kernel/workqueue.c --- a/kernel/workqueue.c~a +++ a/kernel/workqueue.c @@ -323,6 +323,7 @@ static void run_workqueue(struct cpu_wor work_func_t f = work->func; list_del_init(cwq->worklist.next); + cwq->current_work = work; spin_unlock_irqrestore(&cwq->lock, flags); BUG_ON(get_wq_data(work) != cwq); @@ -342,6 +343,7 @@ static void run_workqueue(struct cpu_wor } spin_lock_irqsave(&cwq->lock, flags); + cwq->current_work = NULL; cwq->remove_sequence++; wake_up(&cwq->work_done); } @@ -425,6 +427,64 @@ static void flush_cpu_workqueue(struct c } } +static void wait_on_work(struct cpu_workqueue_struct *cwq, + struct work_struct *work) +{ + DEFINE_WAIT(wait); + + spin_lock_irq(&cwq->lock); + while (cwq->current_work == work) { + prepare_to_wait(&cwq->work_done, &wait, TASK_UNINTERRUPTIBLE); + spin_unlock_irq(&cwq->lock); + schedule(); + spin_lock_irq(&cwq->lock); + } + finish_wait(&cwq->work_done, &wait); + spin_unlock_irq(&cwq->lock); +} + +static void flush_one_work(struct cpu_workqueue_struct *cwq, + struct work_struct *work) +{ + spin_lock_irq(&cwq->lock); + if (test_and_clear_bit(WORK_STRUCT_PENDING, &work->management)) { + list_del_init(&work->entry); + spin_unlock_irq(&cwq->lock); + return; + } + spin_unlock_irq(&cwq->lock); + + /* It's running, or it has completed */ + + if (cwq->thread == current) { + /* This stinks */ + /* + * Probably keventd trying to flush its own queue. So simply run + * it by hand rather than deadlocking. + */ + run_workqueue(cwq); + } else { + wait_on_work(cwq, work); + } +} + +void flush_work(struct work_struct *work) +{ + might_sleep(); + + if (is_single_threaded(wq)) { + /* Always use first cpu's area. */ + flush_one_work(per_cpu_ptr(wq->cpu_wq, singlethread_cpu), work); + } else { + int cpu; + + mutex_lock(&workqueue_mutex); + for_each_online_cpu(cpu) + flush_one_work(per_cpu_ptr(wq->cpu_wq, cpu), work); + mutex_unlock(&workqueue_mutex); + } +} + /** * flush_workqueue - ensure that any scheduled work has run to completion. * @wq: workqueue to flush _ ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-07 17:52 ` Andrew Morton @ 2006-12-07 18:01 ` Linus Torvalds 2006-12-07 18:16 ` Andrew Morton 0 siblings, 1 reply; 54+ messages in thread From: Linus Torvalds @ 2006-12-07 18:01 UTC (permalink / raw) To: Andrew Morton Cc: David Howells, Maciej W. Rozycki, Roland Dreier, Andy Fleming, Ben Collins, Linux Kernel Mailing List, Jeff Garzik On Thu, 7 Dec 2006, Andrew Morton wrote: > > umm.. Putting a work_struct* into struct cpu_workqueue_struct and then > doing appropriate things with cpu_workqueue_struct.lock might work. Yeah, that looks sane. We can't hide anything in "struct work", because we can't trust it any more once it's been dispatched, but adding a pointer to the cpu_workqueue_struct that is only used to compare against another pointer sounds fine. Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-07 18:01 ` Linus Torvalds @ 2006-12-07 18:16 ` Andrew Morton 2006-12-07 18:27 ` Linus Torvalds 0 siblings, 1 reply; 54+ messages in thread From: Andrew Morton @ 2006-12-07 18:16 UTC (permalink / raw) To: Linus Torvalds Cc: David Howells, Maciej W. Rozycki, Roland Dreier, Andy Fleming, Ben Collins, Linux Kernel Mailing List, Jeff Garzik On Thu, 7 Dec 2006 10:01:56 -0800 (PST) Linus Torvalds <torvalds@osdl.org> wrote: > > > On Thu, 7 Dec 2006, Andrew Morton wrote: > > > > umm.. Putting a work_struct* into struct cpu_workqueue_struct and then > > doing appropriate things with cpu_workqueue_struct.lock might work. > > Yeah, that looks sane. We can't hide anything in "struct work", because we > can't trust it any more once it's been dispatched, We _can_ trust it in the context of void flush_work(struct work_struct *work) because the caller "owns" the work_struct. It'd be pretty nutty for the caller to pass in a pointer to something which could be freed at any time. Most flush_work_queue() callers do something like: flush_scheduled_work(); kfree(my_object_which_contains_a_work_struct); hopefully libphy follows that model... > but adding a pointer to > the cpu_workqueue_struct that is only used to compare against another > pointer sounds fine. ho-hum. I'll take a look at turning that into something which compiles, then I'll convert a few oft-used flush_scheduled_work() callers over to use it. To do this on a sensible timescale perhaps means that we should export current_is_keventd(), get the howling hordes off our backs. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-07 18:16 ` Andrew Morton @ 2006-12-07 18:27 ` Linus Torvalds 0 siblings, 0 replies; 54+ messages in thread From: Linus Torvalds @ 2006-12-07 18:27 UTC (permalink / raw) To: Andrew Morton Cc: David Howells, Maciej W. Rozycki, Roland Dreier, Andy Fleming, Ben Collins, Linux Kernel Mailing List, Jeff Garzik On Thu, 7 Dec 2006, Andrew Morton wrote: > > We _can_ trust it in the context of > > void flush_work(struct work_struct *work) Yes, but the way the bits are defined, the "pending" bit is not meaningful as a synchronization event, for example - because _other_ users can't trust it once they've dispatched the function. So even in the synchronous run/flush_scheduled_work() kind of situation, you end up having to work with the fact that nobody _else_ can rely on the data structures, and that they are designed to work that way.. > ho-hum. I'll take a look at turning that into something which compiles, > then I'll convert a few oft-used flush_scheduled_work() callers over to use > it. To do this on a sensible timescale perhaps means that we should export > current_is_keventd(), get the howling hordes off our backs. Well, I simply committed my work that doesn't guarantee synchronization - the synchronization can now be added in kernel/workqueue.c any way we want. It's better than what we used to have, for sure, in both compiling and solving the practical problem, but also as a "go forward" point. Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] Export current_is_keventd() for libphy 2006-12-07 1:21 ` Linus Torvalds 2006-12-07 6:42 ` Andrew Morton @ 2006-12-07 15:28 ` Maciej W. Rozycki 1 sibling, 0 replies; 54+ messages in thread From: Maciej W. Rozycki @ 2006-12-07 15:28 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Morton, David Howells, Roland Dreier, Andy Fleming, Ben Collins, Linux Kernel Mailing List, Jeff Garzik On Wed, 6 Dec 2006, Linus Torvalds wrote: > I didn't get any answers on this. I'd like to get this issue resolved, but > since I don't even use libphy on my main machine, I need somebody else to > test it for me. I tested it in the evening with the system I implemented it for originally. It seems to work -- it does not oops as would the original code without the call to flush_scheduled_work() nor does it deadlock as would code with the call (and no special precautions). Thanks a lot (and congrats for still being capable of doing something else from merging). Maciej ^ permalink raw reply [flat|nested] 54+ messages in thread
end of thread, other threads:[~2006-12-15 16:21 UTC | newest] Thread overview: 54+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-12-03 5:50 [PATCH] Export current_is_keventd() for libphy Ben Collins 2006-12-03 9:16 ` Andrew Morton 2006-12-04 19:17 ` Steve Fox 2006-12-05 18:05 ` Maciej W. Rozycki 2006-12-05 17:48 ` Maciej W. Rozycki 2006-12-05 18:07 ` Linus Torvalds 2006-12-05 19:31 ` Andrew Morton 2006-12-05 18:57 ` Andy Fleming 2006-12-06 12:31 ` Maciej W. Rozycki 2006-12-05 20:39 ` Andrew Morton 2006-12-05 20:59 ` Andy Fleming 2006-12-05 21:26 ` Andrew Morton 2006-12-05 21:37 ` Roland Dreier 2006-12-05 21:57 ` Andrew Morton 2006-12-05 23:49 ` Roland Dreier 2006-12-05 23:52 ` Roland Dreier 2006-12-06 15:25 ` Maciej W. Rozycki 2006-12-06 15:57 ` Andrew Morton 2006-12-06 17:17 ` Linus Torvalds 2006-12-06 17:43 ` David Howells 2006-12-06 17:50 ` Jeff Garzik 2006-12-06 18:07 ` Linus Torvalds 2006-12-06 17:53 ` Linus Torvalds 2006-12-06 17:58 ` Linus Torvalds 2006-12-06 18:33 ` Linus Torvalds 2006-12-06 18:37 ` Linus Torvalds 2006-12-06 18:43 ` David Howells 2006-12-06 19:02 ` Linus Torvalds 2006-12-06 18:02 ` David Howells 2006-12-07 1:21 ` Linus Torvalds 2006-12-07 6:42 ` Andrew Morton 2006-12-07 7:49 ` Andrew Morton 2006-12-07 10:29 ` David Howells 2006-12-07 10:42 ` Andrew Morton 2006-12-07 17:05 ` Jeff Garzik 2006-12-07 17:57 ` Andrew Morton 2006-12-07 18:17 ` Andrew Morton 2006-12-08 16:52 ` [PATCH] group xtime, xtime_lock, wall_to_monotonic, avenrun, calc_load_count fields together in ktimed Eric Dumazet 2006-12-09 5:46 ` Andrew Morton 2006-12-09 6:07 ` Randy Dunlap 2006-12-11 20:44 ` Eric Dumazet 2006-12-11 22:00 ` Andrew Morton 2006-12-13 21:26 ` [PATCH] Introduce time_data, a new structure to hold jiffies, xtime, xtime_lock, wall_to_monotonic, calc_load_count and avenrun Eric Dumazet 2006-12-15 5:24 ` Andrew Morton 2006-12-15 11:21 ` Eric Dumazet 2006-12-15 16:21 ` Eric Dumazet 2006-12-07 18:08 ` [PATCH] Export current_is_keventd() for libphy Maciej W. Rozycki 2006-12-07 18:59 ` Andy Fleming 2006-12-07 16:49 ` Linus Torvalds 2006-12-07 17:52 ` Andrew Morton 2006-12-07 18:01 ` Linus Torvalds 2006-12-07 18:16 ` Andrew Morton 2006-12-07 18:27 ` Linus Torvalds 2006-12-07 15:28 ` Maciej W. Rozycki
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox