* [Intel-gfx] How to convert drivers/gpu/drm/i915/ to use local workqueue?
@ 2022-06-10 14:57 ` Tetsuo Handa
0 siblings, 0 replies; 22+ messages in thread
From: Tetsuo Handa @ 2022-06-10 14:57 UTC (permalink / raw)
To: Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin
Cc: Intel Graphics Development, DRI
Hello.
Like commit c4f135d643823a86 ("workqueue: Wrap flush_workqueue() using
a macro") explains, we are removing flush_scheduled_work() calls. And now
drivers/gpu/drm/i915/display/intel_display.c
drivers/gpu/drm/i915/gt/selftest_execlists.c
are the last flush_scheduled_work() callers which have no patch proposed.
I want to make a patch like
https://lkml.kernel.org/r/e9b95132-89cd-5cfc-1a09-966393c5ecb0@I-love.SAKURA.ne.jp
but I couldn't understand how to interpret drivers/gpu/drm/i915/ part.
There are many schedule_work()/schedule_delayed_work() callers within
drivers/gpu/drm/i915/ directory.
intel_modeset_driver_remove_noirq() in intel_display.c says
/* flush any delayed tasks or pending work */
flush_scheduled_work();
but intel_display.c itself does not call schedule_delayed_work().
Then, does this flush_scheduled_work() mean to wait all schedule_work()/schedule_delayed_work()
calls inside drivers/gpu/drm/i915/ directory?
wait_for_reset() in selftest_execlists.c says
flush_scheduled_work();
but selftest_execlists.c itself does not call schedule_work()/schedule_delayed_work().
Then, does this flush_scheduled_work() mean to wait all schedule_work()/schedule_delayed_work()
calls inside drivers/gpu/drm/i915/ directory, by sharing a WQ created for
intel_modeset_driver_remove_noirq() ?
^ permalink raw reply [flat|nested] 22+ messages in thread* How to convert drivers/gpu/drm/i915/ to use local workqueue? @ 2022-06-10 14:57 ` Tetsuo Handa 0 siblings, 0 replies; 22+ messages in thread From: Tetsuo Handa @ 2022-06-10 14:57 UTC (permalink / raw) To: Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin Cc: Intel Graphics Development, DRI Hello. Like commit c4f135d643823a86 ("workqueue: Wrap flush_workqueue() using a macro") explains, we are removing flush_scheduled_work() calls. And now drivers/gpu/drm/i915/display/intel_display.c drivers/gpu/drm/i915/gt/selftest_execlists.c are the last flush_scheduled_work() callers which have no patch proposed. I want to make a patch like https://lkml.kernel.org/r/e9b95132-89cd-5cfc-1a09-966393c5ecb0@I-love.SAKURA.ne.jp but I couldn't understand how to interpret drivers/gpu/drm/i915/ part. There are many schedule_work()/schedule_delayed_work() callers within drivers/gpu/drm/i915/ directory. intel_modeset_driver_remove_noirq() in intel_display.c says /* flush any delayed tasks or pending work */ flush_scheduled_work(); but intel_display.c itself does not call schedule_delayed_work(). Then, does this flush_scheduled_work() mean to wait all schedule_work()/schedule_delayed_work() calls inside drivers/gpu/drm/i915/ directory? wait_for_reset() in selftest_execlists.c says flush_scheduled_work(); but selftest_execlists.c itself does not call schedule_work()/schedule_delayed_work(). Then, does this flush_scheduled_work() mean to wait all schedule_work()/schedule_delayed_work() calls inside drivers/gpu/drm/i915/ directory, by sharing a WQ created for intel_modeset_driver_remove_noirq() ? ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Intel-gfx] How to convert drivers/gpu/drm/i915/ to use local workqueue? 2022-06-10 14:57 ` Tetsuo Handa @ 2022-06-30 4:30 ` Tetsuo Handa -1 siblings, 0 replies; 22+ messages in thread From: Tetsuo Handa @ 2022-06-30 4:30 UTC (permalink / raw) To: Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin Cc: Intel Graphics Development, DRI Ping? On 2022/06/10 23:57, Tetsuo Handa wrote: > Then, does this flush_scheduled_work() mean to wait all schedule_work()/schedule_delayed_work() > calls inside drivers/gpu/drm/i915/ directory? ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: How to convert drivers/gpu/drm/i915/ to use local workqueue? @ 2022-06-30 4:30 ` Tetsuo Handa 0 siblings, 0 replies; 22+ messages in thread From: Tetsuo Handa @ 2022-06-30 4:30 UTC (permalink / raw) To: Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin Cc: Intel Graphics Development, DRI Ping? On 2022/06/10 23:57, Tetsuo Handa wrote: > Then, does this flush_scheduled_work() mean to wait all schedule_work()/schedule_delayed_work() > calls inside drivers/gpu/drm/i915/ directory? ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Intel-gfx] How to convert drivers/gpu/drm/i915/ to use local workqueue? 2022-06-10 14:57 ` Tetsuo Handa @ 2022-06-30 7:46 ` Tvrtko Ursulin -1 siblings, 0 replies; 22+ messages in thread From: Tvrtko Ursulin @ 2022-06-30 7:46 UTC (permalink / raw) To: Tetsuo Handa, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi Cc: Intel Graphics Development, DRI Hi, On 10/06/2022 15:57, Tetsuo Handa wrote: > Hello. > > Like commit c4f135d643823a86 ("workqueue: Wrap flush_workqueue() using > a macro") explains, we are removing flush_scheduled_work() calls. And now > > drivers/gpu/drm/i915/display/intel_display.c > drivers/gpu/drm/i915/gt/selftest_execlists.c > > are the last flush_scheduled_work() callers which have no patch proposed. > I want to make a patch like > https://lkml.kernel.org/r/e9b95132-89cd-5cfc-1a09-966393c5ecb0@I-love.SAKURA.ne.jp > but I couldn't understand how to interpret drivers/gpu/drm/i915/ part. Could you provide some more context please? I did not immediately understand whether the goal is remove flush_schedule_work helper with no arguments, or actually stop drivers using the system work queues. Regards, Tvrtko > > > > There are many schedule_work()/schedule_delayed_work() callers within > drivers/gpu/drm/i915/ directory. > > intel_modeset_driver_remove_noirq() in intel_display.c says > > /* flush any delayed tasks or pending work */ > flush_scheduled_work(); > > but intel_display.c itself does not call schedule_delayed_work(). > Then, does this flush_scheduled_work() mean to wait all schedule_work()/schedule_delayed_work() > calls inside drivers/gpu/drm/i915/ directory? > > wait_for_reset() in selftest_execlists.c says > > flush_scheduled_work(); > > but selftest_execlists.c itself does not call schedule_work()/schedule_delayed_work(). > Then, does this flush_scheduled_work() mean to wait all schedule_work()/schedule_delayed_work() > calls inside drivers/gpu/drm/i915/ directory, by sharing a WQ created for > intel_modeset_driver_remove_noirq() ? ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: How to convert drivers/gpu/drm/i915/ to use local workqueue? @ 2022-06-30 7:46 ` Tvrtko Ursulin 0 siblings, 0 replies; 22+ messages in thread From: Tvrtko Ursulin @ 2022-06-30 7:46 UTC (permalink / raw) To: Tetsuo Handa, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi Cc: Intel Graphics Development, DRI Hi, On 10/06/2022 15:57, Tetsuo Handa wrote: > Hello. > > Like commit c4f135d643823a86 ("workqueue: Wrap flush_workqueue() using > a macro") explains, we are removing flush_scheduled_work() calls. And now > > drivers/gpu/drm/i915/display/intel_display.c > drivers/gpu/drm/i915/gt/selftest_execlists.c > > are the last flush_scheduled_work() callers which have no patch proposed. > I want to make a patch like > https://lkml.kernel.org/r/e9b95132-89cd-5cfc-1a09-966393c5ecb0@I-love.SAKURA.ne.jp > but I couldn't understand how to interpret drivers/gpu/drm/i915/ part. Could you provide some more context please? I did not immediately understand whether the goal is remove flush_schedule_work helper with no arguments, or actually stop drivers using the system work queues. Regards, Tvrtko > > > > There are many schedule_work()/schedule_delayed_work() callers within > drivers/gpu/drm/i915/ directory. > > intel_modeset_driver_remove_noirq() in intel_display.c says > > /* flush any delayed tasks or pending work */ > flush_scheduled_work(); > > but intel_display.c itself does not call schedule_delayed_work(). > Then, does this flush_scheduled_work() mean to wait all schedule_work()/schedule_delayed_work() > calls inside drivers/gpu/drm/i915/ directory? > > wait_for_reset() in selftest_execlists.c says > > flush_scheduled_work(); > > but selftest_execlists.c itself does not call schedule_work()/schedule_delayed_work(). > Then, does this flush_scheduled_work() mean to wait all schedule_work()/schedule_delayed_work() > calls inside drivers/gpu/drm/i915/ directory, by sharing a WQ created for > intel_modeset_driver_remove_noirq() ? ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Intel-gfx] How to convert drivers/gpu/drm/i915/ to use local workqueue? 2022-06-30 7:46 ` Tvrtko Ursulin @ 2022-06-30 8:06 ` Tetsuo Handa -1 siblings, 0 replies; 22+ messages in thread From: Tetsuo Handa @ 2022-06-30 8:06 UTC (permalink / raw) To: Tvrtko Ursulin, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi Cc: Intel Graphics Development, DRI On 2022/06/30 16:46, Tvrtko Ursulin wrote: > > Hi, > > On 10/06/2022 15:57, Tetsuo Handa wrote: >> Hello. >> >> Like commit c4f135d643823a86 ("workqueue: Wrap flush_workqueue() using >> a macro") explains, we are removing flush_scheduled_work() calls. And now >> >> drivers/gpu/drm/i915/display/intel_display.c >> drivers/gpu/drm/i915/gt/selftest_execlists.c >> >> are the last flush_scheduled_work() callers which have no patch proposed. >> I want to make a patch like >> https://lkml.kernel.org/r/e9b95132-89cd-5cfc-1a09-966393c5ecb0@I-love.SAKURA.ne.jp >> but I couldn't understand how to interpret drivers/gpu/drm/i915/ part. > > Could you provide some more context please? I did not immediately understand > whether the goal is remove flush_schedule_work helper with no arguments, or > actually stop drivers using the system work queues. The goal is to remove flush_schedule_work(). Any kernel module is expected to stop using system workqueues if that module needs to call flush_scheduled_work() or flush_workqueue(system_*_wq). Continuing use of system workqueues is OK as long as that module does not need to call flush_scheduled_work() nor flush_workqueue(system_*_wq). All in-tree kernel modules stopped calling flush_workqueue(system_*_wq) in 5.19. Many of in-tree kernel modules already have patches for stop calling flush_scheduled_work(). But gpu/drm/i915 is one of in-tree kernel modules which do not have patches for stop calling flush_scheduled_work(). I want help from those who are familiar with this module. Regards. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: How to convert drivers/gpu/drm/i915/ to use local workqueue? @ 2022-06-30 8:06 ` Tetsuo Handa 0 siblings, 0 replies; 22+ messages in thread From: Tetsuo Handa @ 2022-06-30 8:06 UTC (permalink / raw) To: Tvrtko Ursulin, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi Cc: Intel Graphics Development, DRI On 2022/06/30 16:46, Tvrtko Ursulin wrote: > > Hi, > > On 10/06/2022 15:57, Tetsuo Handa wrote: >> Hello. >> >> Like commit c4f135d643823a86 ("workqueue: Wrap flush_workqueue() using >> a macro") explains, we are removing flush_scheduled_work() calls. And now >> >> drivers/gpu/drm/i915/display/intel_display.c >> drivers/gpu/drm/i915/gt/selftest_execlists.c >> >> are the last flush_scheduled_work() callers which have no patch proposed. >> I want to make a patch like >> https://lkml.kernel.org/r/e9b95132-89cd-5cfc-1a09-966393c5ecb0@I-love.SAKURA.ne.jp >> but I couldn't understand how to interpret drivers/gpu/drm/i915/ part. > > Could you provide some more context please? I did not immediately understand > whether the goal is remove flush_schedule_work helper with no arguments, or > actually stop drivers using the system work queues. The goal is to remove flush_schedule_work(). Any kernel module is expected to stop using system workqueues if that module needs to call flush_scheduled_work() or flush_workqueue(system_*_wq). Continuing use of system workqueues is OK as long as that module does not need to call flush_scheduled_work() nor flush_workqueue(system_*_wq). All in-tree kernel modules stopped calling flush_workqueue(system_*_wq) in 5.19. Many of in-tree kernel modules already have patches for stop calling flush_scheduled_work(). But gpu/drm/i915 is one of in-tree kernel modules which do not have patches for stop calling flush_scheduled_work(). I want help from those who are familiar with this module. Regards. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Intel-gfx] How to convert drivers/gpu/drm/i915/ to use local workqueue? 2022-06-30 8:06 ` Tetsuo Handa @ 2022-06-30 10:17 ` Tvrtko Ursulin -1 siblings, 0 replies; 22+ messages in thread From: Tvrtko Ursulin @ 2022-06-30 10:17 UTC (permalink / raw) To: Tetsuo Handa, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi Cc: Intel Graphics Development, DRI On 30/06/2022 09:06, Tetsuo Handa wrote: > On 2022/06/30 16:46, Tvrtko Ursulin wrote: >> >> Hi, >> >> On 10/06/2022 15:57, Tetsuo Handa wrote: >>> Hello. >>> >>> Like commit c4f135d643823a86 ("workqueue: Wrap flush_workqueue() using >>> a macro") explains, we are removing flush_scheduled_work() calls. And now >>> >>> drivers/gpu/drm/i915/display/intel_display.c >>> drivers/gpu/drm/i915/gt/selftest_execlists.c >>> >>> are the last flush_scheduled_work() callers which have no patch proposed. >>> I want to make a patch like >>> https://lkml.kernel.org/r/e9b95132-89cd-5cfc-1a09-966393c5ecb0@I-love.SAKURA.ne.jp >>> but I couldn't understand how to interpret drivers/gpu/drm/i915/ part. >> >> Could you provide some more context please? I did not immediately understand >> whether the goal is remove flush_schedule_work helper with no arguments, or >> actually stop drivers using the system work queues. > > The goal is to remove flush_schedule_work(). > > Any kernel module is expected to stop using system workqueues if that module > needs to call flush_scheduled_work() or flush_workqueue(system_*_wq). > Continuing use of system workqueues is OK as long as that module does not > need to call flush_scheduled_work() nor flush_workqueue(system_*_wq). Could you give more context on reasoning here please? What is the difference between using the system_wq and flushing it from a random context? Or concern is about flushing from specific contexts? > All in-tree kernel modules stopped calling flush_workqueue(system_*_wq) in 5.19. > > Many of in-tree kernel modules already have patches for stop calling > flush_scheduled_work(). But gpu/drm/i915 is one of in-tree kernel modules > which do not have patches for stop calling flush_scheduled_work(). > > I want help from those who are familiar with this module. On the i915 specifics, the caller in drivers/gpu/drm/i915/gt/selftest_execlists.c I am pretty sure can be removed. It is synchronized with the error capture side of things which is not required for the test to work. I can send a patch for that or you can, as you prefer? The drivers/gpu/drm/i915/display/intel_display.c one will have to be looked at by Jani or someone familiar with display code paths. Regards, Tvrtko ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: How to convert drivers/gpu/drm/i915/ to use local workqueue? @ 2022-06-30 10:17 ` Tvrtko Ursulin 0 siblings, 0 replies; 22+ messages in thread From: Tvrtko Ursulin @ 2022-06-30 10:17 UTC (permalink / raw) To: Tetsuo Handa, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi Cc: Intel Graphics Development, DRI On 30/06/2022 09:06, Tetsuo Handa wrote: > On 2022/06/30 16:46, Tvrtko Ursulin wrote: >> >> Hi, >> >> On 10/06/2022 15:57, Tetsuo Handa wrote: >>> Hello. >>> >>> Like commit c4f135d643823a86 ("workqueue: Wrap flush_workqueue() using >>> a macro") explains, we are removing flush_scheduled_work() calls. And now >>> >>> drivers/gpu/drm/i915/display/intel_display.c >>> drivers/gpu/drm/i915/gt/selftest_execlists.c >>> >>> are the last flush_scheduled_work() callers which have no patch proposed. >>> I want to make a patch like >>> https://lkml.kernel.org/r/e9b95132-89cd-5cfc-1a09-966393c5ecb0@I-love.SAKURA.ne.jp >>> but I couldn't understand how to interpret drivers/gpu/drm/i915/ part. >> >> Could you provide some more context please? I did not immediately understand >> whether the goal is remove flush_schedule_work helper with no arguments, or >> actually stop drivers using the system work queues. > > The goal is to remove flush_schedule_work(). > > Any kernel module is expected to stop using system workqueues if that module > needs to call flush_scheduled_work() or flush_workqueue(system_*_wq). > Continuing use of system workqueues is OK as long as that module does not > need to call flush_scheduled_work() nor flush_workqueue(system_*_wq). Could you give more context on reasoning here please? What is the difference between using the system_wq and flushing it from a random context? Or concern is about flushing from specific contexts? > All in-tree kernel modules stopped calling flush_workqueue(system_*_wq) in 5.19. > > Many of in-tree kernel modules already have patches for stop calling > flush_scheduled_work(). But gpu/drm/i915 is one of in-tree kernel modules > which do not have patches for stop calling flush_scheduled_work(). > > I want help from those who are familiar with this module. On the i915 specifics, the caller in drivers/gpu/drm/i915/gt/selftest_execlists.c I am pretty sure can be removed. It is synchronized with the error capture side of things which is not required for the test to work. I can send a patch for that or you can, as you prefer? The drivers/gpu/drm/i915/display/intel_display.c one will have to be looked at by Jani or someone familiar with display code paths. Regards, Tvrtko ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Intel-gfx] How to convert drivers/gpu/drm/i915/ to use local workqueue? 2022-06-30 10:17 ` Tvrtko Ursulin @ 2022-06-30 11:19 ` Tetsuo Handa -1 siblings, 0 replies; 22+ messages in thread From: Tetsuo Handa @ 2022-06-30 11:19 UTC (permalink / raw) To: Tvrtko Ursulin, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi Cc: Intel Graphics Development, DRI On 2022/06/30 19:17, Tvrtko Ursulin wrote: > Could you give more context on reasoning here please? What is the difference > between using the system_wq and flushing it from a random context? Or concern > is about flushing from specific contexts? Excuse me, but I couldn't catch what you want. Thus, I show three patterns of problems with an example. Commit c4f135d643823a86 ("workqueue: Wrap flush_workqueue() using a macro") says: Tejun Heo commented that it makes no sense at all to call flush_workqueue() on the shared WQs as the caller has no idea what it's gonna end up waiting for. The "Think twice before calling this function! It's very easy to get into trouble if you don't take great care." warning message does not help avoiding problems. Let's change the direction to that developers had better use their local WQs if flush_scheduled_work()/flush_workqueue(system_*_wq) is inevitable. Three patterns of problems are: (a) Flushing from specific contexts (e.g. GFP_NOFS/GFP_NOIO) can cause deadlock (circular locking dependency) problem. (b) Flushing with specific locks (e.g. module_mutex) held can cause deadlock (circular locking dependency) problem. (c) Even if there is no possibility of deadlock, flushing with specific locks held can cause khungtaskd to complain. An example of (a): ext4 filesystem called flush_scheduled_work(), which meant to wait for only work item scheduled by ext4 filesystem, tried to also wait for work item scheduled by 9p filesystem. https://syzkaller.appspot.com/bug?extid=bde0f89deacca7c765b8 Fixed by reverting the problematic patch. An example of (b): It is GFP_KERNEL context when module's __exit function is called. But whether flush_workqueue() is called from restricted context depends on what locks does the module's __exit function hold. If request_module() is called from some work function using one of system-wide WQs, and flush_workqueue() is called on that WQ from module's __exit function, the kernel might deadlock on module_mutex lock. Making sure that flush_workqueue() is not called on system-wide WQs is the safer choice. Commit 1b3ce51dde365296 ("Input: psmouse-smbus - avoid flush_scheduled_work() usage") is for drivers/input/mouse/psmouse-smbus.c . An example of (c): ath9k driver calls schedule_work() via request_firmware_nowait(). https://syzkaller.appspot.com/bug?id=78a242c8f1f4d15752c8ef4efc22974e2c52c833 ath6kl driver calls flush_scheduled_work() which needlessly waits for completion of works scheduled by ath9k driver (i.e. loading firmware used by ath9k driver). https://syzkaller.appspot.com/bug?id=10a1cba59c42d11e12f897644341156eac9bb7ee Commit 4b2b8e748467985c ("ath6kl: avoid flush_scheduled_work() usage") in linux-next.git might be able to mitigate these problems. (Waiting for next merge window...) > On the i915 specifics, the caller in drivers/gpu/drm/i915/gt/selftest_execlists.c > I am pretty sure can be removed. It is synchronized with the error capture side of > things which is not required for the test to work. > > I can send a patch for that or you can, as you prefer? OK. Please send a patch for that, for that patch will go linux-next.git tree via a tree for gpu/drm/i915 driver. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: How to convert drivers/gpu/drm/i915/ to use local workqueue? @ 2022-06-30 11:19 ` Tetsuo Handa 0 siblings, 0 replies; 22+ messages in thread From: Tetsuo Handa @ 2022-06-30 11:19 UTC (permalink / raw) To: Tvrtko Ursulin, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi Cc: Intel Graphics Development, DRI On 2022/06/30 19:17, Tvrtko Ursulin wrote: > Could you give more context on reasoning here please? What is the difference > between using the system_wq and flushing it from a random context? Or concern > is about flushing from specific contexts? Excuse me, but I couldn't catch what you want. Thus, I show three patterns of problems with an example. Commit c4f135d643823a86 ("workqueue: Wrap flush_workqueue() using a macro") says: Tejun Heo commented that it makes no sense at all to call flush_workqueue() on the shared WQs as the caller has no idea what it's gonna end up waiting for. The "Think twice before calling this function! It's very easy to get into trouble if you don't take great care." warning message does not help avoiding problems. Let's change the direction to that developers had better use their local WQs if flush_scheduled_work()/flush_workqueue(system_*_wq) is inevitable. Three patterns of problems are: (a) Flushing from specific contexts (e.g. GFP_NOFS/GFP_NOIO) can cause deadlock (circular locking dependency) problem. (b) Flushing with specific locks (e.g. module_mutex) held can cause deadlock (circular locking dependency) problem. (c) Even if there is no possibility of deadlock, flushing with specific locks held can cause khungtaskd to complain. An example of (a): ext4 filesystem called flush_scheduled_work(), which meant to wait for only work item scheduled by ext4 filesystem, tried to also wait for work item scheduled by 9p filesystem. https://syzkaller.appspot.com/bug?extid=bde0f89deacca7c765b8 Fixed by reverting the problematic patch. An example of (b): It is GFP_KERNEL context when module's __exit function is called. But whether flush_workqueue() is called from restricted context depends on what locks does the module's __exit function hold. If request_module() is called from some work function using one of system-wide WQs, and flush_workqueue() is called on that WQ from module's __exit function, the kernel might deadlock on module_mutex lock. Making sure that flush_workqueue() is not called on system-wide WQs is the safer choice. Commit 1b3ce51dde365296 ("Input: psmouse-smbus - avoid flush_scheduled_work() usage") is for drivers/input/mouse/psmouse-smbus.c . An example of (c): ath9k driver calls schedule_work() via request_firmware_nowait(). https://syzkaller.appspot.com/bug?id=78a242c8f1f4d15752c8ef4efc22974e2c52c833 ath6kl driver calls flush_scheduled_work() which needlessly waits for completion of works scheduled by ath9k driver (i.e. loading firmware used by ath9k driver). https://syzkaller.appspot.com/bug?id=10a1cba59c42d11e12f897644341156eac9bb7ee Commit 4b2b8e748467985c ("ath6kl: avoid flush_scheduled_work() usage") in linux-next.git might be able to mitigate these problems. (Waiting for next merge window...) > On the i915 specifics, the caller in drivers/gpu/drm/i915/gt/selftest_execlists.c > I am pretty sure can be removed. It is synchronized with the error capture side of > things which is not required for the test to work. > > I can send a patch for that or you can, as you prefer? OK. Please send a patch for that, for that patch will go linux-next.git tree via a tree for gpu/drm/i915 driver. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Intel-gfx] How to convert drivers/gpu/drm/i915/ to use local workqueue? 2022-06-30 11:19 ` Tetsuo Handa @ 2022-06-30 13:09 ` Tvrtko Ursulin -1 siblings, 0 replies; 22+ messages in thread From: Tvrtko Ursulin @ 2022-06-30 13:09 UTC (permalink / raw) To: Tetsuo Handa, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi Cc: Intel Graphics Development, DRI On 30/06/2022 12:19, Tetsuo Handa wrote: > On 2022/06/30 19:17, Tvrtko Ursulin wrote: >> Could you give more context on reasoning here please? What is the difference >> between using the system_wq and flushing it from a random context? Or concern >> is about flushing from specific contexts? > > Excuse me, but I couldn't catch what you want. Thus, I show three patterns of > problems with an example. > > Commit c4f135d643823a86 ("workqueue: Wrap flush_workqueue() using a macro") says: > > Tejun Heo commented that it makes no sense at all to call flush_workqueue() > on the shared WQs as the caller has no idea what it's gonna end up waiting for. > > The "Think twice before calling this function! It's very easy to get into trouble > if you don't take great care." warning message does not help avoiding problems. > > Let's change the direction to that developers had better use their local WQs > if flush_scheduled_work()/flush_workqueue(system_*_wq) is inevitable. > > Three patterns of problems are: > > (a) Flushing from specific contexts (e.g. GFP_NOFS/GFP_NOIO) can cause deadlock > (circular locking dependency) problem. > > (b) Flushing with specific locks (e.g. module_mutex) held can cause deadlock > (circular locking dependency) problem. > > (c) Even if there is no possibility of deadlock, flushing with specific locks > held can cause khungtaskd to complain. > > An example of (a): > > ext4 filesystem called flush_scheduled_work(), which meant to wait for only > work item scheduled by ext4 filesystem, tried to also wait for work item > scheduled by 9p filesystem. > https://syzkaller.appspot.com/bug?extid=bde0f89deacca7c765b8 > > Fixed by reverting the problematic patch. > > An example of (b): > > It is GFP_KERNEL context when module's __exit function is called. But whether > flush_workqueue() is called from restricted context depends on what locks does > the module's __exit function hold. > > If request_module() is called from some work function using one of system-wide WQs, > and flush_workqueue() is called on that WQ from module's __exit function, the kernel > might deadlock on module_mutex lock. Making sure that flush_workqueue() is not called > on system-wide WQs is the safer choice. > > Commit 1b3ce51dde365296 ("Input: psmouse-smbus - avoid flush_scheduled_work() usage") > is for drivers/input/mouse/psmouse-smbus.c . > > An example of (c): > > ath9k driver calls schedule_work() via request_firmware_nowait(). > https://syzkaller.appspot.com/bug?id=78a242c8f1f4d15752c8ef4efc22974e2c52c833 > > ath6kl driver calls flush_scheduled_work() which needlessly waits for completion > of works scheduled by ath9k driver (i.e. loading firmware used by ath9k driver). > https://syzkaller.appspot.com/bug?id=10a1cba59c42d11e12f897644341156eac9bb7ee > > Commit 4b2b8e748467985c ("ath6kl: avoid flush_scheduled_work() usage") in linux-next.git > might be able to mitigate these problems. (Waiting for next merge window...) Okay, from 1b3ce51dde365296: "Flushing system-wide workqueues is dangerous and will be forbidden." Thank you, this exactly explains the motivation which is what I was after. I certainly agree there is a possibility for lock coupling via the shared wq so that is fine by me. >> On the i915 specifics, the caller in drivers/gpu/drm/i915/gt/selftest_execlists.c >> I am pretty sure can be removed. It is synchronized with the error capture side of >> things which is not required for the test to work. >> >> I can send a patch for that or you can, as you prefer? > > OK. Please send a patch for that, for that patch will go linux-next.git tree via > a tree for gpu/drm/i915 driver. Patch sent. If I am right the easiest solution was just to remove the flush. If I was wrong though I'll need to create a dedicated wq so we will see what our automated CI will say. Regards, Tvrtko ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: How to convert drivers/gpu/drm/i915/ to use local workqueue? @ 2022-06-30 13:09 ` Tvrtko Ursulin 0 siblings, 0 replies; 22+ messages in thread From: Tvrtko Ursulin @ 2022-06-30 13:09 UTC (permalink / raw) To: Tetsuo Handa, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi Cc: Intel Graphics Development, DRI On 30/06/2022 12:19, Tetsuo Handa wrote: > On 2022/06/30 19:17, Tvrtko Ursulin wrote: >> Could you give more context on reasoning here please? What is the difference >> between using the system_wq and flushing it from a random context? Or concern >> is about flushing from specific contexts? > > Excuse me, but I couldn't catch what you want. Thus, I show three patterns of > problems with an example. > > Commit c4f135d643823a86 ("workqueue: Wrap flush_workqueue() using a macro") says: > > Tejun Heo commented that it makes no sense at all to call flush_workqueue() > on the shared WQs as the caller has no idea what it's gonna end up waiting for. > > The "Think twice before calling this function! It's very easy to get into trouble > if you don't take great care." warning message does not help avoiding problems. > > Let's change the direction to that developers had better use their local WQs > if flush_scheduled_work()/flush_workqueue(system_*_wq) is inevitable. > > Three patterns of problems are: > > (a) Flushing from specific contexts (e.g. GFP_NOFS/GFP_NOIO) can cause deadlock > (circular locking dependency) problem. > > (b) Flushing with specific locks (e.g. module_mutex) held can cause deadlock > (circular locking dependency) problem. > > (c) Even if there is no possibility of deadlock, flushing with specific locks > held can cause khungtaskd to complain. > > An example of (a): > > ext4 filesystem called flush_scheduled_work(), which meant to wait for only > work item scheduled by ext4 filesystem, tried to also wait for work item > scheduled by 9p filesystem. > https://syzkaller.appspot.com/bug?extid=bde0f89deacca7c765b8 > > Fixed by reverting the problematic patch. > > An example of (b): > > It is GFP_KERNEL context when module's __exit function is called. But whether > flush_workqueue() is called from restricted context depends on what locks does > the module's __exit function hold. > > If request_module() is called from some work function using one of system-wide WQs, > and flush_workqueue() is called on that WQ from module's __exit function, the kernel > might deadlock on module_mutex lock. Making sure that flush_workqueue() is not called > on system-wide WQs is the safer choice. > > Commit 1b3ce51dde365296 ("Input: psmouse-smbus - avoid flush_scheduled_work() usage") > is for drivers/input/mouse/psmouse-smbus.c . > > An example of (c): > > ath9k driver calls schedule_work() via request_firmware_nowait(). > https://syzkaller.appspot.com/bug?id=78a242c8f1f4d15752c8ef4efc22974e2c52c833 > > ath6kl driver calls flush_scheduled_work() which needlessly waits for completion > of works scheduled by ath9k driver (i.e. loading firmware used by ath9k driver). > https://syzkaller.appspot.com/bug?id=10a1cba59c42d11e12f897644341156eac9bb7ee > > Commit 4b2b8e748467985c ("ath6kl: avoid flush_scheduled_work() usage") in linux-next.git > might be able to mitigate these problems. (Waiting for next merge window...) Okay, from 1b3ce51dde365296: "Flushing system-wide workqueues is dangerous and will be forbidden." Thank you, this exactly explains the motivation which is what I was after. I certainly agree there is a possibility for lock coupling via the shared wq so that is fine by me. >> On the i915 specifics, the caller in drivers/gpu/drm/i915/gt/selftest_execlists.c >> I am pretty sure can be removed. It is synchronized with the error capture side of >> things which is not required for the test to work. >> >> I can send a patch for that or you can, as you prefer? > > OK. Please send a patch for that, for that patch will go linux-next.git tree via > a tree for gpu/drm/i915 driver. Patch sent. If I am right the easiest solution was just to remove the flush. If I was wrong though I'll need to create a dedicated wq so we will see what our automated CI will say. Regards, Tvrtko ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Intel-gfx] How to convert drivers/gpu/drm/i915/ to use local workqueue? 2022-06-30 13:09 ` Tvrtko Ursulin @ 2022-06-30 13:59 ` Rodrigo Vivi -1 siblings, 0 replies; 22+ messages in thread From: Rodrigo Vivi @ 2022-06-30 13:59 UTC (permalink / raw) To: Tvrtko Ursulin; +Cc: Tetsuo Handa, Intel Graphics Development, DRI On Thu, Jun 30, 2022 at 02:09:02PM +0100, Tvrtko Ursulin wrote: > > On 30/06/2022 12:19, Tetsuo Handa wrote: > > On 2022/06/30 19:17, Tvrtko Ursulin wrote: > > > Could you give more context on reasoning here please? What is the difference > > > between using the system_wq and flushing it from a random context? Or concern > > > is about flushing from specific contexts? > > > > Excuse me, but I couldn't catch what you want. Thus, I show three patterns of > > problems with an example. > > > > Commit c4f135d643823a86 ("workqueue: Wrap flush_workqueue() using a macro") says: > > > > Tejun Heo commented that it makes no sense at all to call flush_workqueue() > > on the shared WQs as the caller has no idea what it's gonna end up waiting for. > > > > The "Think twice before calling this function! It's very easy to get into trouble > > if you don't take great care." warning message does not help avoiding problems. > > > > Let's change the direction to that developers had better use their local WQs > > if flush_scheduled_work()/flush_workqueue(system_*_wq) is inevitable. > > > > Three patterns of problems are: > > > > (a) Flushing from specific contexts (e.g. GFP_NOFS/GFP_NOIO) can cause deadlock > > (circular locking dependency) problem. > > > > (b) Flushing with specific locks (e.g. module_mutex) held can cause deadlock > > (circular locking dependency) problem. > > > > (c) Even if there is no possibility of deadlock, flushing with specific locks > > held can cause khungtaskd to complain. > > > > An example of (a): > > > > ext4 filesystem called flush_scheduled_work(), which meant to wait for only > > work item scheduled by ext4 filesystem, tried to also wait for work item > > scheduled by 9p filesystem. > > https://syzkaller.appspot.com/bug?extid=bde0f89deacca7c765b8 > > > > Fixed by reverting the problematic patch. > > > > An example of (b): > > > > It is GFP_KERNEL context when module's __exit function is called. But whether > > flush_workqueue() is called from restricted context depends on what locks does > > the module's __exit function hold. > > > > If request_module() is called from some work function using one of system-wide WQs, > > and flush_workqueue() is called on that WQ from module's __exit function, the kernel > > might deadlock on module_mutex lock. Making sure that flush_workqueue() is not called > > on system-wide WQs is the safer choice. > > > > Commit 1b3ce51dde365296 ("Input: psmouse-smbus - avoid flush_scheduled_work() usage") > > is for drivers/input/mouse/psmouse-smbus.c . > > > > An example of (c): > > > > ath9k driver calls schedule_work() via request_firmware_nowait(). > > https://syzkaller.appspot.com/bug?id=78a242c8f1f4d15752c8ef4efc22974e2c52c833 > > > > ath6kl driver calls flush_scheduled_work() which needlessly waits for completion > > of works scheduled by ath9k driver (i.e. loading firmware used by ath9k driver). > > https://syzkaller.appspot.com/bug?id=10a1cba59c42d11e12f897644341156eac9bb7ee > > > > Commit 4b2b8e748467985c ("ath6kl: avoid flush_scheduled_work() usage") in linux-next.git > > might be able to mitigate these problems. (Waiting for next merge window...) > > Okay, from 1b3ce51dde365296: > > "Flushing system-wide workqueues is dangerous and will be forbidden." > > Thank you, this exactly explains the motivation which is what I was after. I > certainly agree there is a possibility for lock coupling via the shared wq > so that is fine by me. > > > > On the i915 specifics, the caller in drivers/gpu/drm/i915/gt/selftest_execlists.c > > > I am pretty sure can be removed. It is synchronized with the error capture side of > > > things which is not required for the test to work. > > > > > > I can send a patch for that or you can, as you prefer? > > > > OK. Please send a patch for that, for that patch will go linux-next.git tree via > > a tree for gpu/drm/i915 driver. > > Patch sent. If I am right the easiest solution was just to remove the flush. > If I was wrong though I'll need to create a dedicated wq so we will see what > our automated CI will say. But besides of flush_scheduled_work() it looks like we also need to take care of the flush_workqueue() calls, no?! * i915_gem_drain_workqueue() * intel_ggtt.c: flush_workqueue(ggtt->vm.i915->wq); * i915_gem_pm.c: flush_workqueue(i915->wq); and the display ones for dev_priv->modeset_wq i915->flip_wq besides the flush_scheduled_work in intel_modeset_driver_remove_noirq > > Regards, > > Tvrtko ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: How to convert drivers/gpu/drm/i915/ to use local workqueue? @ 2022-06-30 13:59 ` Rodrigo Vivi 0 siblings, 0 replies; 22+ messages in thread From: Rodrigo Vivi @ 2022-06-30 13:59 UTC (permalink / raw) To: Tvrtko Ursulin; +Cc: Tetsuo Handa, Intel Graphics Development, DRI On Thu, Jun 30, 2022 at 02:09:02PM +0100, Tvrtko Ursulin wrote: > > On 30/06/2022 12:19, Tetsuo Handa wrote: > > On 2022/06/30 19:17, Tvrtko Ursulin wrote: > > > Could you give more context on reasoning here please? What is the difference > > > between using the system_wq and flushing it from a random context? Or concern > > > is about flushing from specific contexts? > > > > Excuse me, but I couldn't catch what you want. Thus, I show three patterns of > > problems with an example. > > > > Commit c4f135d643823a86 ("workqueue: Wrap flush_workqueue() using a macro") says: > > > > Tejun Heo commented that it makes no sense at all to call flush_workqueue() > > on the shared WQs as the caller has no idea what it's gonna end up waiting for. > > > > The "Think twice before calling this function! It's very easy to get into trouble > > if you don't take great care." warning message does not help avoiding problems. > > > > Let's change the direction to that developers had better use their local WQs > > if flush_scheduled_work()/flush_workqueue(system_*_wq) is inevitable. > > > > Three patterns of problems are: > > > > (a) Flushing from specific contexts (e.g. GFP_NOFS/GFP_NOIO) can cause deadlock > > (circular locking dependency) problem. > > > > (b) Flushing with specific locks (e.g. module_mutex) held can cause deadlock > > (circular locking dependency) problem. > > > > (c) Even if there is no possibility of deadlock, flushing with specific locks > > held can cause khungtaskd to complain. > > > > An example of (a): > > > > ext4 filesystem called flush_scheduled_work(), which meant to wait for only > > work item scheduled by ext4 filesystem, tried to also wait for work item > > scheduled by 9p filesystem. > > https://syzkaller.appspot.com/bug?extid=bde0f89deacca7c765b8 > > > > Fixed by reverting the problematic patch. > > > > An example of (b): > > > > It is GFP_KERNEL context when module's __exit function is called. But whether > > flush_workqueue() is called from restricted context depends on what locks does > > the module's __exit function hold. > > > > If request_module() is called from some work function using one of system-wide WQs, > > and flush_workqueue() is called on that WQ from module's __exit function, the kernel > > might deadlock on module_mutex lock. Making sure that flush_workqueue() is not called > > on system-wide WQs is the safer choice. > > > > Commit 1b3ce51dde365296 ("Input: psmouse-smbus - avoid flush_scheduled_work() usage") > > is for drivers/input/mouse/psmouse-smbus.c . > > > > An example of (c): > > > > ath9k driver calls schedule_work() via request_firmware_nowait(). > > https://syzkaller.appspot.com/bug?id=78a242c8f1f4d15752c8ef4efc22974e2c52c833 > > > > ath6kl driver calls flush_scheduled_work() which needlessly waits for completion > > of works scheduled by ath9k driver (i.e. loading firmware used by ath9k driver). > > https://syzkaller.appspot.com/bug?id=10a1cba59c42d11e12f897644341156eac9bb7ee > > > > Commit 4b2b8e748467985c ("ath6kl: avoid flush_scheduled_work() usage") in linux-next.git > > might be able to mitigate these problems. (Waiting for next merge window...) > > Okay, from 1b3ce51dde365296: > > "Flushing system-wide workqueues is dangerous and will be forbidden." > > Thank you, this exactly explains the motivation which is what I was after. I > certainly agree there is a possibility for lock coupling via the shared wq > so that is fine by me. > > > > On the i915 specifics, the caller in drivers/gpu/drm/i915/gt/selftest_execlists.c > > > I am pretty sure can be removed. It is synchronized with the error capture side of > > > things which is not required for the test to work. > > > > > > I can send a patch for that or you can, as you prefer? > > > > OK. Please send a patch for that, for that patch will go linux-next.git tree via > > a tree for gpu/drm/i915 driver. > > Patch sent. If I am right the easiest solution was just to remove the flush. > If I was wrong though I'll need to create a dedicated wq so we will see what > our automated CI will say. But besides of flush_scheduled_work() it looks like we also need to take care of the flush_workqueue() calls, no?! * i915_gem_drain_workqueue() * intel_ggtt.c: flush_workqueue(ggtt->vm.i915->wq); * i915_gem_pm.c: flush_workqueue(i915->wq); and the display ones for dev_priv->modeset_wq i915->flip_wq besides the flush_scheduled_work in intel_modeset_driver_remove_noirq > > Regards, > > Tvrtko ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Intel-gfx] How to convert drivers/gpu/drm/i915/ to use local workqueue? 2022-06-30 13:59 ` Rodrigo Vivi @ 2022-06-30 14:34 ` Jani Nikula -1 siblings, 0 replies; 22+ messages in thread From: Jani Nikula @ 2022-06-30 14:34 UTC (permalink / raw) To: Rodrigo Vivi, Tvrtko Ursulin Cc: Tetsuo Handa, Intel Graphics Development, DRI On Thu, 30 Jun 2022, Rodrigo Vivi <rodrigo.vivi@intel.com> wrote: > On Thu, Jun 30, 2022 at 02:09:02PM +0100, Tvrtko Ursulin wrote: >> >> On 30/06/2022 12:19, Tetsuo Handa wrote: >> > On 2022/06/30 19:17, Tvrtko Ursulin wrote: >> > > Could you give more context on reasoning here please? What is the difference >> > > between using the system_wq and flushing it from a random context? Or concern >> > > is about flushing from specific contexts? >> > >> > Excuse me, but I couldn't catch what you want. Thus, I show three patterns of >> > problems with an example. >> > >> > Commit c4f135d643823a86 ("workqueue: Wrap flush_workqueue() using a macro") says: >> > >> > Tejun Heo commented that it makes no sense at all to call flush_workqueue() >> > on the shared WQs as the caller has no idea what it's gonna end up waiting for. >> > >> > The "Think twice before calling this function! It's very easy to get into trouble >> > if you don't take great care." warning message does not help avoiding problems. >> > >> > Let's change the direction to that developers had better use their local WQs >> > if flush_scheduled_work()/flush_workqueue(system_*_wq) is inevitable. >> > >> > Three patterns of problems are: >> > >> > (a) Flushing from specific contexts (e.g. GFP_NOFS/GFP_NOIO) can cause deadlock >> > (circular locking dependency) problem. >> > >> > (b) Flushing with specific locks (e.g. module_mutex) held can cause deadlock >> > (circular locking dependency) problem. >> > >> > (c) Even if there is no possibility of deadlock, flushing with specific locks >> > held can cause khungtaskd to complain. >> > >> > An example of (a): >> > >> > ext4 filesystem called flush_scheduled_work(), which meant to wait for only >> > work item scheduled by ext4 filesystem, tried to also wait for work item >> > scheduled by 9p filesystem. >> > https://syzkaller.appspot.com/bug?extid=bde0f89deacca7c765b8 >> > >> > Fixed by reverting the problematic patch. >> > >> > An example of (b): >> > >> > It is GFP_KERNEL context when module's __exit function is called. But whether >> > flush_workqueue() is called from restricted context depends on what locks does >> > the module's __exit function hold. >> > >> > If request_module() is called from some work function using one of system-wide WQs, >> > and flush_workqueue() is called on that WQ from module's __exit function, the kernel >> > might deadlock on module_mutex lock. Making sure that flush_workqueue() is not called >> > on system-wide WQs is the safer choice. >> > >> > Commit 1b3ce51dde365296 ("Input: psmouse-smbus - avoid flush_scheduled_work() usage") >> > is for drivers/input/mouse/psmouse-smbus.c . >> > >> > An example of (c): >> > >> > ath9k driver calls schedule_work() via request_firmware_nowait(). >> > https://syzkaller.appspot.com/bug?id=78a242c8f1f4d15752c8ef4efc22974e2c52c833 >> > >> > ath6kl driver calls flush_scheduled_work() which needlessly waits for completion >> > of works scheduled by ath9k driver (i.e. loading firmware used by ath9k driver). >> > https://syzkaller.appspot.com/bug?id=10a1cba59c42d11e12f897644341156eac9bb7ee >> > >> > Commit 4b2b8e748467985c ("ath6kl: avoid flush_scheduled_work() usage") in linux-next.git >> > might be able to mitigate these problems. (Waiting for next merge window...) >> >> Okay, from 1b3ce51dde365296: >> >> "Flushing system-wide workqueues is dangerous and will be forbidden." >> >> Thank you, this exactly explains the motivation which is what I was after. I >> certainly agree there is a possibility for lock coupling via the shared wq >> so that is fine by me. >> >> > > On the i915 specifics, the caller in drivers/gpu/drm/i915/gt/selftest_execlists.c >> > > I am pretty sure can be removed. It is synchronized with the error capture side of >> > > things which is not required for the test to work. >> > > >> > > I can send a patch for that or you can, as you prefer? >> > >> > OK. Please send a patch for that, for that patch will go linux-next.git tree via >> > a tree for gpu/drm/i915 driver. >> >> Patch sent. If I am right the easiest solution was just to remove the flush. >> If I was wrong though I'll need to create a dedicated wq so we will see what >> our automated CI will say. > > But besides of flush_scheduled_work() it looks like > we also need to take care of the flush_workqueue() calls, no?! > > * i915_gem_drain_workqueue() > * intel_ggtt.c: flush_workqueue(ggtt->vm.i915->wq); > * i915_gem_pm.c: flush_workqueue(i915->wq); > > and the display ones for > dev_priv->modeset_wq > i915->flip_wq > > besides the flush_scheduled_work in intel_modeset_driver_remove_noirq I thought the problem was flushing the system-wide workqueues. The above calls flush our own. As to removing flush_scheduled_work() from intel_modeset_driver_remove_noirq(), I think we'll need to account for all the work and delayed work we've scheduled on the system workqueue, i.e. we need to cancel or flush each of them individually, as necessary. Some of them we do already, but some others, not really. For example we never cancel the fbc underrun work on the driver remove path AFAICT. And it's not even as simple as just adding the cancel_work_sync(&fbc->underrun_work) call in intel_fbc_cleanup(), because intel_fbc_cleanup() is called *after* intel_mode_config_cleanup(), i.e. the work function might get called after the crtc it accesses has been destroyed. So we're going to need to change the cleanup order too. Things have changed considerably since the flush was added in 1630fe754c83 ("drm/i915: Perform intel_enable_fbc() from a delayed task"). I suppose the alternative is to have a local i915 display workqueue, schedule all the random display works and delayed works on that, and then flush that wq instead of the system wq in intel_modeset_driver_remove_noirq(). IIUC, anyway. BR, Jani. -- Jani Nikula, Intel Open Source Graphics Center ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: How to convert drivers/gpu/drm/i915/ to use local workqueue? @ 2022-06-30 14:34 ` Jani Nikula 0 siblings, 0 replies; 22+ messages in thread From: Jani Nikula @ 2022-06-30 14:34 UTC (permalink / raw) To: Rodrigo Vivi, Tvrtko Ursulin Cc: Tetsuo Handa, Intel Graphics Development, DRI On Thu, 30 Jun 2022, Rodrigo Vivi <rodrigo.vivi@intel.com> wrote: > On Thu, Jun 30, 2022 at 02:09:02PM +0100, Tvrtko Ursulin wrote: >> >> On 30/06/2022 12:19, Tetsuo Handa wrote: >> > On 2022/06/30 19:17, Tvrtko Ursulin wrote: >> > > Could you give more context on reasoning here please? What is the difference >> > > between using the system_wq and flushing it from a random context? Or concern >> > > is about flushing from specific contexts? >> > >> > Excuse me, but I couldn't catch what you want. Thus, I show three patterns of >> > problems with an example. >> > >> > Commit c4f135d643823a86 ("workqueue: Wrap flush_workqueue() using a macro") says: >> > >> > Tejun Heo commented that it makes no sense at all to call flush_workqueue() >> > on the shared WQs as the caller has no idea what it's gonna end up waiting for. >> > >> > The "Think twice before calling this function! It's very easy to get into trouble >> > if you don't take great care." warning message does not help avoiding problems. >> > >> > Let's change the direction to that developers had better use their local WQs >> > if flush_scheduled_work()/flush_workqueue(system_*_wq) is inevitable. >> > >> > Three patterns of problems are: >> > >> > (a) Flushing from specific contexts (e.g. GFP_NOFS/GFP_NOIO) can cause deadlock >> > (circular locking dependency) problem. >> > >> > (b) Flushing with specific locks (e.g. module_mutex) held can cause deadlock >> > (circular locking dependency) problem. >> > >> > (c) Even if there is no possibility of deadlock, flushing with specific locks >> > held can cause khungtaskd to complain. >> > >> > An example of (a): >> > >> > ext4 filesystem called flush_scheduled_work(), which meant to wait for only >> > work item scheduled by ext4 filesystem, tried to also wait for work item >> > scheduled by 9p filesystem. >> > https://syzkaller.appspot.com/bug?extid=bde0f89deacca7c765b8 >> > >> > Fixed by reverting the problematic patch. >> > >> > An example of (b): >> > >> > It is GFP_KERNEL context when module's __exit function is called. But whether >> > flush_workqueue() is called from restricted context depends on what locks does >> > the module's __exit function hold. >> > >> > If request_module() is called from some work function using one of system-wide WQs, >> > and flush_workqueue() is called on that WQ from module's __exit function, the kernel >> > might deadlock on module_mutex lock. Making sure that flush_workqueue() is not called >> > on system-wide WQs is the safer choice. >> > >> > Commit 1b3ce51dde365296 ("Input: psmouse-smbus - avoid flush_scheduled_work() usage") >> > is for drivers/input/mouse/psmouse-smbus.c . >> > >> > An example of (c): >> > >> > ath9k driver calls schedule_work() via request_firmware_nowait(). >> > https://syzkaller.appspot.com/bug?id=78a242c8f1f4d15752c8ef4efc22974e2c52c833 >> > >> > ath6kl driver calls flush_scheduled_work() which needlessly waits for completion >> > of works scheduled by ath9k driver (i.e. loading firmware used by ath9k driver). >> > https://syzkaller.appspot.com/bug?id=10a1cba59c42d11e12f897644341156eac9bb7ee >> > >> > Commit 4b2b8e748467985c ("ath6kl: avoid flush_scheduled_work() usage") in linux-next.git >> > might be able to mitigate these problems. (Waiting for next merge window...) >> >> Okay, from 1b3ce51dde365296: >> >> "Flushing system-wide workqueues is dangerous and will be forbidden." >> >> Thank you, this exactly explains the motivation which is what I was after. I >> certainly agree there is a possibility for lock coupling via the shared wq >> so that is fine by me. >> >> > > On the i915 specifics, the caller in drivers/gpu/drm/i915/gt/selftest_execlists.c >> > > I am pretty sure can be removed. It is synchronized with the error capture side of >> > > things which is not required for the test to work. >> > > >> > > I can send a patch for that or you can, as you prefer? >> > >> > OK. Please send a patch for that, for that patch will go linux-next.git tree via >> > a tree for gpu/drm/i915 driver. >> >> Patch sent. If I am right the easiest solution was just to remove the flush. >> If I was wrong though I'll need to create a dedicated wq so we will see what >> our automated CI will say. > > But besides of flush_scheduled_work() it looks like > we also need to take care of the flush_workqueue() calls, no?! > > * i915_gem_drain_workqueue() > * intel_ggtt.c: flush_workqueue(ggtt->vm.i915->wq); > * i915_gem_pm.c: flush_workqueue(i915->wq); > > and the display ones for > dev_priv->modeset_wq > i915->flip_wq > > besides the flush_scheduled_work in intel_modeset_driver_remove_noirq I thought the problem was flushing the system-wide workqueues. The above calls flush our own. As to removing flush_scheduled_work() from intel_modeset_driver_remove_noirq(), I think we'll need to account for all the work and delayed work we've scheduled on the system workqueue, i.e. we need to cancel or flush each of them individually, as necessary. Some of them we do already, but some others, not really. For example we never cancel the fbc underrun work on the driver remove path AFAICT. And it's not even as simple as just adding the cancel_work_sync(&fbc->underrun_work) call in intel_fbc_cleanup(), because intel_fbc_cleanup() is called *after* intel_mode_config_cleanup(), i.e. the work function might get called after the crtc it accesses has been destroyed. So we're going to need to change the cleanup order too. Things have changed considerably since the flush was added in 1630fe754c83 ("drm/i915: Perform intel_enable_fbc() from a delayed task"). I suppose the alternative is to have a local i915 display workqueue, schedule all the random display works and delayed works on that, and then flush that wq instead of the system wq in intel_modeset_driver_remove_noirq(). IIUC, anyway. BR, Jani. -- Jani Nikula, Intel Open Source Graphics Center ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Intel-gfx] How to convert drivers/gpu/drm/i915/ to use local workqueue? 2022-06-30 14:34 ` Jani Nikula @ 2022-06-30 15:25 ` Rodrigo Vivi -1 siblings, 0 replies; 22+ messages in thread From: Rodrigo Vivi @ 2022-06-30 15:25 UTC (permalink / raw) To: Jani Nikula; +Cc: Tetsuo Handa, Intel Graphics Development, DRI On Thu, Jun 30, 2022 at 05:34:22PM +0300, Jani Nikula wrote: > On Thu, 30 Jun 2022, Rodrigo Vivi <rodrigo.vivi@intel.com> wrote: > > On Thu, Jun 30, 2022 at 02:09:02PM +0100, Tvrtko Ursulin wrote: > >> > >> On 30/06/2022 12:19, Tetsuo Handa wrote: > >> > On 2022/06/30 19:17, Tvrtko Ursulin wrote: > >> > > Could you give more context on reasoning here please? What is the difference > >> > > between using the system_wq and flushing it from a random context? Or concern > >> > > is about flushing from specific contexts? > >> > > >> > Excuse me, but I couldn't catch what you want. Thus, I show three patterns of > >> > problems with an example. > >> > > >> > Commit c4f135d643823a86 ("workqueue: Wrap flush_workqueue() using a macro") says: > >> > > >> > Tejun Heo commented that it makes no sense at all to call flush_workqueue() > >> > on the shared WQs as the caller has no idea what it's gonna end up waiting for. > >> > > >> > The "Think twice before calling this function! It's very easy to get into trouble > >> > if you don't take great care." warning message does not help avoiding problems. > >> > > >> > Let's change the direction to that developers had better use their local WQs > >> > if flush_scheduled_work()/flush_workqueue(system_*_wq) is inevitable. > >> > > >> > Three patterns of problems are: > >> > > >> > (a) Flushing from specific contexts (e.g. GFP_NOFS/GFP_NOIO) can cause deadlock > >> > (circular locking dependency) problem. > >> > > >> > (b) Flushing with specific locks (e.g. module_mutex) held can cause deadlock > >> > (circular locking dependency) problem. > >> > > >> > (c) Even if there is no possibility of deadlock, flushing with specific locks > >> > held can cause khungtaskd to complain. > >> > > >> > An example of (a): > >> > > >> > ext4 filesystem called flush_scheduled_work(), which meant to wait for only > >> > work item scheduled by ext4 filesystem, tried to also wait for work item > >> > scheduled by 9p filesystem. > >> > https://syzkaller.appspot.com/bug?extid=bde0f89deacca7c765b8 > >> > > >> > Fixed by reverting the problematic patch. > >> > > >> > An example of (b): > >> > > >> > It is GFP_KERNEL context when module's __exit function is called. But whether > >> > flush_workqueue() is called from restricted context depends on what locks does > >> > the module's __exit function hold. > >> > > >> > If request_module() is called from some work function using one of system-wide WQs, > >> > and flush_workqueue() is called on that WQ from module's __exit function, the kernel > >> > might deadlock on module_mutex lock. Making sure that flush_workqueue() is not called > >> > on system-wide WQs is the safer choice. > >> > > >> > Commit 1b3ce51dde365296 ("Input: psmouse-smbus - avoid flush_scheduled_work() usage") > >> > is for drivers/input/mouse/psmouse-smbus.c . > >> > > >> > An example of (c): > >> > > >> > ath9k driver calls schedule_work() via request_firmware_nowait(). > >> > https://syzkaller.appspot.com/bug?id=78a242c8f1f4d15752c8ef4efc22974e2c52c833 > >> > > >> > ath6kl driver calls flush_scheduled_work() which needlessly waits for completion > >> > of works scheduled by ath9k driver (i.e. loading firmware used by ath9k driver). > >> > https://syzkaller.appspot.com/bug?id=10a1cba59c42d11e12f897644341156eac9bb7ee > >> > > >> > Commit 4b2b8e748467985c ("ath6kl: avoid flush_scheduled_work() usage") in linux-next.git > >> > might be able to mitigate these problems. (Waiting for next merge window...) > >> > >> Okay, from 1b3ce51dde365296: > >> > >> "Flushing system-wide workqueues is dangerous and will be forbidden." > >> > >> Thank you, this exactly explains the motivation which is what I was after. I > >> certainly agree there is a possibility for lock coupling via the shared wq > >> so that is fine by me. > >> > >> > > On the i915 specifics, the caller in drivers/gpu/drm/i915/gt/selftest_execlists.c > >> > > I am pretty sure can be removed. It is synchronized with the error capture side of > >> > > things which is not required for the test to work. > >> > > > >> > > I can send a patch for that or you can, as you prefer? > >> > > >> > OK. Please send a patch for that, for that patch will go linux-next.git tree via > >> > a tree for gpu/drm/i915 driver. > >> > >> Patch sent. If I am right the easiest solution was just to remove the flush. > >> If I was wrong though I'll need to create a dedicated wq so we will see what > >> our automated CI will say. > > > > But besides of flush_scheduled_work() it looks like > > we also need to take care of the flush_workqueue() calls, no?! > > > > * i915_gem_drain_workqueue() > > * intel_ggtt.c: flush_workqueue(ggtt->vm.i915->wq); > > * i915_gem_pm.c: flush_workqueue(i915->wq); > > > > and the display ones for > > dev_priv->modeset_wq > > i915->flip_wq > > > > besides the flush_scheduled_work in intel_modeset_driver_remove_noirq > > I thought the problem was flushing the system-wide workqueues. The above > calls flush our own. oh, indeed, sorry for the noise. I had ignored the if for the new __warn_flushing... > > As to removing flush_scheduled_work() from >intel_modeset_driver_remove_noirq(), yeap, this is the only real one... > I think we'll need to account for > all the work and delayed work we've scheduled on the system workqueue, > i.e. we need to cancel or flush each of them individually, as > necessary. Some of them we do already, but some others, not really. > > For example we never cancel the fbc underrun work on the driver remove > path AFAICT. And it's not even as simple as just adding the > cancel_work_sync(&fbc->underrun_work) call in intel_fbc_cleanup(), > because intel_fbc_cleanup() is called *after* > intel_mode_config_cleanup(), i.e. the work function might get called > after the crtc it accesses has been destroyed. So we're going to need to > change the cleanup order too. > > Things have changed considerably since the flush was added in > 1630fe754c83 ("drm/i915: Perform intel_enable_fbc() from a delayed > task"). > > I suppose the alternative is to have a local i915 display workqueue, > schedule all the random display works and delayed works on that, and > then flush that wq instead of the system wq in > intel_modeset_driver_remove_noirq(). > > IIUC, anyway. > > > BR, > Jani. > > > -- > Jani Nikula, Intel Open Source Graphics Center ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: How to convert drivers/gpu/drm/i915/ to use local workqueue? @ 2022-06-30 15:25 ` Rodrigo Vivi 0 siblings, 0 replies; 22+ messages in thread From: Rodrigo Vivi @ 2022-06-30 15:25 UTC (permalink / raw) To: Jani Nikula; +Cc: Tetsuo Handa, Tvrtko Ursulin, Intel Graphics Development, DRI On Thu, Jun 30, 2022 at 05:34:22PM +0300, Jani Nikula wrote: > On Thu, 30 Jun 2022, Rodrigo Vivi <rodrigo.vivi@intel.com> wrote: > > On Thu, Jun 30, 2022 at 02:09:02PM +0100, Tvrtko Ursulin wrote: > >> > >> On 30/06/2022 12:19, Tetsuo Handa wrote: > >> > On 2022/06/30 19:17, Tvrtko Ursulin wrote: > >> > > Could you give more context on reasoning here please? What is the difference > >> > > between using the system_wq and flushing it from a random context? Or concern > >> > > is about flushing from specific contexts? > >> > > >> > Excuse me, but I couldn't catch what you want. Thus, I show three patterns of > >> > problems with an example. > >> > > >> > Commit c4f135d643823a86 ("workqueue: Wrap flush_workqueue() using a macro") says: > >> > > >> > Tejun Heo commented that it makes no sense at all to call flush_workqueue() > >> > on the shared WQs as the caller has no idea what it's gonna end up waiting for. > >> > > >> > The "Think twice before calling this function! It's very easy to get into trouble > >> > if you don't take great care." warning message does not help avoiding problems. > >> > > >> > Let's change the direction to that developers had better use their local WQs > >> > if flush_scheduled_work()/flush_workqueue(system_*_wq) is inevitable. > >> > > >> > Three patterns of problems are: > >> > > >> > (a) Flushing from specific contexts (e.g. GFP_NOFS/GFP_NOIO) can cause deadlock > >> > (circular locking dependency) problem. > >> > > >> > (b) Flushing with specific locks (e.g. module_mutex) held can cause deadlock > >> > (circular locking dependency) problem. > >> > > >> > (c) Even if there is no possibility of deadlock, flushing with specific locks > >> > held can cause khungtaskd to complain. > >> > > >> > An example of (a): > >> > > >> > ext4 filesystem called flush_scheduled_work(), which meant to wait for only > >> > work item scheduled by ext4 filesystem, tried to also wait for work item > >> > scheduled by 9p filesystem. > >> > https://syzkaller.appspot.com/bug?extid=bde0f89deacca7c765b8 > >> > > >> > Fixed by reverting the problematic patch. > >> > > >> > An example of (b): > >> > > >> > It is GFP_KERNEL context when module's __exit function is called. But whether > >> > flush_workqueue() is called from restricted context depends on what locks does > >> > the module's __exit function hold. > >> > > >> > If request_module() is called from some work function using one of system-wide WQs, > >> > and flush_workqueue() is called on that WQ from module's __exit function, the kernel > >> > might deadlock on module_mutex lock. Making sure that flush_workqueue() is not called > >> > on system-wide WQs is the safer choice. > >> > > >> > Commit 1b3ce51dde365296 ("Input: psmouse-smbus - avoid flush_scheduled_work() usage") > >> > is for drivers/input/mouse/psmouse-smbus.c . > >> > > >> > An example of (c): > >> > > >> > ath9k driver calls schedule_work() via request_firmware_nowait(). > >> > https://syzkaller.appspot.com/bug?id=78a242c8f1f4d15752c8ef4efc22974e2c52c833 > >> > > >> > ath6kl driver calls flush_scheduled_work() which needlessly waits for completion > >> > of works scheduled by ath9k driver (i.e. loading firmware used by ath9k driver). > >> > https://syzkaller.appspot.com/bug?id=10a1cba59c42d11e12f897644341156eac9bb7ee > >> > > >> > Commit 4b2b8e748467985c ("ath6kl: avoid flush_scheduled_work() usage") in linux-next.git > >> > might be able to mitigate these problems. (Waiting for next merge window...) > >> > >> Okay, from 1b3ce51dde365296: > >> > >> "Flushing system-wide workqueues is dangerous and will be forbidden." > >> > >> Thank you, this exactly explains the motivation which is what I was after. I > >> certainly agree there is a possibility for lock coupling via the shared wq > >> so that is fine by me. > >> > >> > > On the i915 specifics, the caller in drivers/gpu/drm/i915/gt/selftest_execlists.c > >> > > I am pretty sure can be removed. It is synchronized with the error capture side of > >> > > things which is not required for the test to work. > >> > > > >> > > I can send a patch for that or you can, as you prefer? > >> > > >> > OK. Please send a patch for that, for that patch will go linux-next.git tree via > >> > a tree for gpu/drm/i915 driver. > >> > >> Patch sent. If I am right the easiest solution was just to remove the flush. > >> If I was wrong though I'll need to create a dedicated wq so we will see what > >> our automated CI will say. > > > > But besides of flush_scheduled_work() it looks like > > we also need to take care of the flush_workqueue() calls, no?! > > > > * i915_gem_drain_workqueue() > > * intel_ggtt.c: flush_workqueue(ggtt->vm.i915->wq); > > * i915_gem_pm.c: flush_workqueue(i915->wq); > > > > and the display ones for > > dev_priv->modeset_wq > > i915->flip_wq > > > > besides the flush_scheduled_work in intel_modeset_driver_remove_noirq > > I thought the problem was flushing the system-wide workqueues. The above > calls flush our own. oh, indeed, sorry for the noise. I had ignored the if for the new __warn_flushing... > > As to removing flush_scheduled_work() from >intel_modeset_driver_remove_noirq(), yeap, this is the only real one... > I think we'll need to account for > all the work and delayed work we've scheduled on the system workqueue, > i.e. we need to cancel or flush each of them individually, as > necessary. Some of them we do already, but some others, not really. > > For example we never cancel the fbc underrun work on the driver remove > path AFAICT. And it's not even as simple as just adding the > cancel_work_sync(&fbc->underrun_work) call in intel_fbc_cleanup(), > because intel_fbc_cleanup() is called *after* > intel_mode_config_cleanup(), i.e. the work function might get called > after the crtc it accesses has been destroyed. So we're going to need to > change the cleanup order too. > > Things have changed considerably since the flush was added in > 1630fe754c83 ("drm/i915: Perform intel_enable_fbc() from a delayed > task"). > > I suppose the alternative is to have a local i915 display workqueue, > schedule all the random display works and delayed works on that, and > then flush that wq instead of the system wq in > intel_modeset_driver_remove_noirq(). > > IIUC, anyway. > > > BR, > Jani. > > > -- > Jani Nikula, Intel Open Source Graphics Center ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Intel-gfx] How to convert drivers/gpu/drm/i915/ to use local workqueue? 2022-06-30 13:09 ` Tvrtko Ursulin (?) (?) @ 2022-08-05 13:14 ` Tetsuo Handa 2022-08-09 16:22 ` Tvrtko Ursulin -1 siblings, 1 reply; 22+ messages in thread From: Tetsuo Handa @ 2022-08-05 13:14 UTC (permalink / raw) To: Tvrtko Ursulin, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi Cc: Intel Graphics Development On 2022/06/30 22:09, Tvrtko Ursulin wrote: >>> On the i915 specifics, the caller in drivers/gpu/drm/i915/gt/selftest_execlists.c >>> I am pretty sure can be removed. It is synchronized with the error capture side of >>> things which is not required for the test to work. >>> >>> I can send a patch for that or you can, as you prefer? >> >> OK. Please send a patch for that, for that patch will go linux-next.git tree via >> a tree for gpu/drm/i915 driver. > > Patch sent. If I am right the easiest solution was just to remove the flush. > If I was wrong though I'll need to create a dedicated wq so we will see what > our automated CI will say. How was the result? ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Intel-gfx] How to convert drivers/gpu/drm/i915/ to use local workqueue? 2022-08-05 13:14 ` [Intel-gfx] " Tetsuo Handa @ 2022-08-09 16:22 ` Tvrtko Ursulin 0 siblings, 0 replies; 22+ messages in thread From: Tvrtko Ursulin @ 2022-08-09 16:22 UTC (permalink / raw) To: Tetsuo Handa, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi Cc: Intel Graphics Development Hi, On 05/08/2022 14:14, Tetsuo Handa wrote: > On 2022/06/30 22:09, Tvrtko Ursulin wrote: >>>> On the i915 specifics, the caller in drivers/gpu/drm/i915/gt/selftest_execlists.c >>>> I am pretty sure can be removed. It is synchronized with the error capture side of >>>> things which is not required for the test to work. >>>> >>>> I can send a patch for that or you can, as you prefer? >>> >>> OK. Please send a patch for that, for that patch will go linux-next.git tree via >>> a tree for gpu/drm/i915 driver. >> >> Patch sent. If I am right the easiest solution was just to remove the flush. >> If I was wrong though I'll need to create a dedicated wq so we will see what >> our automated CI will say. > > How was the result? Results were good but I need someone from our side to review the patch before I can merge it (/me hinting to recipients on the thread). Regards, Tvrtko ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2022-08-09 16:23 UTC | newest] Thread overview: 22+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-06-10 14:57 [Intel-gfx] How to convert drivers/gpu/drm/i915/ to use local workqueue? Tetsuo Handa 2022-06-10 14:57 ` Tetsuo Handa 2022-06-30 4:30 ` [Intel-gfx] " Tetsuo Handa 2022-06-30 4:30 ` Tetsuo Handa 2022-06-30 7:46 ` [Intel-gfx] " Tvrtko Ursulin 2022-06-30 7:46 ` Tvrtko Ursulin 2022-06-30 8:06 ` [Intel-gfx] " Tetsuo Handa 2022-06-30 8:06 ` Tetsuo Handa 2022-06-30 10:17 ` [Intel-gfx] " Tvrtko Ursulin 2022-06-30 10:17 ` Tvrtko Ursulin 2022-06-30 11:19 ` [Intel-gfx] " Tetsuo Handa 2022-06-30 11:19 ` Tetsuo Handa 2022-06-30 13:09 ` [Intel-gfx] " Tvrtko Ursulin 2022-06-30 13:09 ` Tvrtko Ursulin 2022-06-30 13:59 ` [Intel-gfx] " Rodrigo Vivi 2022-06-30 13:59 ` Rodrigo Vivi 2022-06-30 14:34 ` [Intel-gfx] " Jani Nikula 2022-06-30 14:34 ` Jani Nikula 2022-06-30 15:25 ` [Intel-gfx] " Rodrigo Vivi 2022-06-30 15:25 ` Rodrigo Vivi 2022-08-05 13:14 ` [Intel-gfx] " Tetsuo Handa 2022-08-09 16:22 ` Tvrtko Ursulin
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.