* Sync timeout mechanisms - Request for coordination @ 2025-09-08 2:46 tuhaowen 2025-09-09 2:00 ` Saravana Kannan 0 siblings, 1 reply; 6+ messages in thread From: tuhaowen @ 2025-09-08 2:46 UTC (permalink / raw) To: wusamuel, saravanak Cc: rafael, len.brown, pavel, linux-pm, linux-kernel, huangbibo Hi Samuel and Saravana, I hope this email finds you well. I'm reaching out regarding the sync timeout mechanisms for suspend/hibernation that we've both been working on. Rafael from the upstream kernel has indicated that he would prefer us to coordinate our approaches rather than having two separate implementations. He mentioned your patch series "PM: Support aborting suspend during filesystem sync" and suggested we work together on a unified solution. I would like to discuss how we can merge our approaches. Below is a summary of my current implementation: **My approach - Time-based timeout mechanism:** - Introduces a configurable timeout for sync operations during both suspend and hibernation - Uses kthread with wait_for_completion_timeout() to implement timeout - Provides sysfs interface /sys/power/sleep_sync_timeout for runtime configuration - Default behavior unchanged (timeout disabled by default) - Addresses scenarios where sync is excessively slow without wakeup events You can see the detailed implementation and Rafael's feedback at: https://lore.kernel.org/linux-pm/CAJZ5v0jBRy=CvZiWQQaorvc-zT+kkaB6+S2TErGmkaPAGmHLOQ@mail.gmail.com/ **Key differences I see between our approaches:** 1. Your solution focuses on aborting sync when wakeup events occur 2. My solution addresses cases where there are no wakeup events but sync is excessively slow (e.g., slow/faulty storage) 3. Your approach uses workqueue + completion, mine uses kthread + timeout 4. Both aim to prevent indefinite hangs but cover different scenarios **Potential unified approach:** I believe both mechanisms could complement each other: - Event-driven abort (your approach) for responsive wakeup handling - Time-based timeout (my approach) for proactive protection against slow storage - Single, unified implementation in kernel/power/main.c Would you be interested in discussing how we can combine these approaches? I'm open to: 1. Merging the implementations into a single solution 2. Adopting your workqueue approach with added timeout functionality 3. Any other technical approach you think would work better Looking forward to your thoughts and collaboration. Best regards, Haowen Tu ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Sync timeout mechanisms - Request for coordination 2025-09-08 2:46 Sync timeout mechanisms - Request for coordination tuhaowen @ 2025-09-09 2:00 ` Saravana Kannan 2025-09-09 2:45 ` [PATCH v2] PM: Add configurable sync timeout for suspend and hibernation tuhaowen 0 siblings, 1 reply; 6+ messages in thread From: Saravana Kannan @ 2025-09-09 2:00 UTC (permalink / raw) To: tuhaowen Cc: wusamuel, rafael, len.brown, pavel, linux-pm, linux-kernel, huangbibo Hi Haowen, On Sun, Sep 7, 2025 at 7:47 PM tuhaowen <tuhaowen@uniontech.com> wrote: > > Hi Samuel and Saravana, > > I hope this email finds you well. I'm reaching out regarding the sync > timeout mechanisms for suspend/hibernation that we've both been working on. I hope you are well too and thanks for reaching out to us. > Rafael from the upstream kernel has indicated that he would prefer us to > coordinate our approaches rather than having two separate implementations. > He mentioned your patch series "PM: Support aborting suspend during > filesystem sync" and suggested we work together on a unified solution. > > I would like to discuss how we can merge our approaches. Below is a > summary of my current implementation: > > **My approach - Time-based timeout mechanism:** > - Introduces a configurable timeout for sync operations during both > suspend and hibernation > - Uses kthread with wait_for_completion_timeout() to implement timeout > - Provides sysfs interface /sys/power/sleep_sync_timeout for runtime > configuration > - Default behavior unchanged (timeout disabled by default) > - Addresses scenarios where sync is excessively slow without wakeup events This doesn't really fix our issue where we want to abort suspend only if we have to stay awake. If there's no need to abort the suspend (to deal with some user/wake source request), then it's okay to take the time to sync and then suspend. If you abort the suspend, it's just going to get attempted again and in fact waste power. I also don't understand how your patch fixes anything. If anything this makes things worse because the user might have expected their device to have hibernated only to come back hours later to see their battery dead. And even if the user is actively monitoring it, you still need to sync the file system fully before you eventually suspend. So, this doesn't really save any time or the need to sync. > > You can see the detailed implementation and Rafael's feedback at: > https://lore.kernel.org/linux-pm/CAJZ5v0jBRy=CvZiWQQaorvc-zT+kkaB6+S2TErGmkaPAGmHLOQ@mail.gmail.com/ > > **Key differences I see between our approaches:** > 1. Your solution focuses on aborting sync when wakeup events occur > 2. My solution addresses cases where there are no wakeup events but sync > is excessively slow (e.g., slow/faulty storage) > 3. Your approach uses workqueue + completion, mine uses kthread + timeout I don't think the workqueue vs kthread matters -- trivial implementation detail. The important point is when we want to abort. > 4. Both aim to prevent indefinite hangs but cover different scenarios I don't see the point of your change though. Why abort a suspend if there's no need to wake up? I think whatever use case issue you are hitting, it's more of an issue of not grabbing a wake source when it's needed. Can you explain the actual real world issue you are trying to fix? If it's the UI hanging and not responding to keypress/mouse move, then to me it looks like those should be marked as wake sources. > > **Potential unified approach:** > I believe both mechanisms could complement each other: > - Event-driven abort (your approach) for responsive wakeup handling > - Time-based timeout (my approach) for proactive protection against > slow storage > - Single, unified implementation in kernel/power/main.c > > Would you be interested in discussing how we can combine these approaches? > I'm open to: > 1. Merging the implementations into a single solution > 2. Adopting your workqueue approach with added timeout functionality > 3. Any other technical approach you think would work better I'm not convinced adding a timeout is the right solution. It just adds another point of failure and the need for a retry mechanism. However, if you are really set on proving the need for a timeout and implementing it, you can always add it as a separate patch after our patch lands. You can set up a timer to call pm_system_wakeup(). Or just grab a wake source after a time period. In fact, thinking more about this, you are generally having a problem with suspend taking too long to complete. Doesn't matter if it's due to file system sync or anything else. In which case, you should just use a timer to call pm_system_wakeup() and not fix just file system sync (which happens to be the current long pole). > Looking forward to your thoughts and collaboration. Hope my response helps. -Saravana ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2] PM: Add configurable sync timeout for suspend and hibernation 2025-09-09 2:00 ` Saravana Kannan @ 2025-09-09 2:45 ` tuhaowen 2025-09-09 6:18 ` Saravana Kannan 0 siblings, 1 reply; 6+ messages in thread From: tuhaowen @ 2025-09-09 2:45 UTC (permalink / raw) To: saravanak Cc: huangbibo, len.brown, linux-kernel, linux-pm, pavel, rafael, tuhaowen, wusamuel Hi Saravana, Thank you for your detailed feedback and suggestions. I appreciate you taking the time to review my approach. Let me address your concerns and explain the real-world issues I'm trying to solve. > This doesn't really fix our issue where we want to abort suspend only > if we have to stay awake. If there's no need to abort the suspend (to > deal with some user/wake source request), then it's okay to take the > time to sync and then suspend. If you abort the suspend, it's just > going to get attempted again and in fact waste power. I understand your perspective, but I'm addressing a different class of problems. The key issue is user experience and system reliability when sync operations become indefinitely blocked. > I also don't understand how your patch fixes anything. If anything > this makes things worse because the user might have expected their > device to have hibernated only to come back hours later to see their > battery dead. And even if the user is actively monitoring it, you > still need to sync the file system fully before you eventually > suspend. So, this doesn't really save any time or the need to sync. This highlights exactly the problem I'm trying to solve. When a user initiates suspend/hibernation, the screen goes black immediately, giving the impression that the system has successfully entered sleep state. However, if ksys_sync() is blocked (which can happen in several scenarios I'll describe below), the system appears to be suspended but is actually still running at full power consumption in the background. The user has no way to know the system is stuck in sync - they see a black screen and assume suspension succeeded. This leads to exactly the scenario you mentioned: coming back hours later to find the battery dead, but with the added risk of data corruption since the sync never completed. > Can you explain the actual real world issue you are trying to fix? If > it's the UI hanging and not responding to keypress/mouse move, then to > me it looks like those should be marked as wake sources. Here are the specific real-world scenarios I've encountered: 1. **Device removal during file operations**: When copying large files to USB drives and then initiating suspend, if the USB device is removed during the sync process, some filesystems may not properly handle the device disappearance. The sync can become indefinitely blocked waiting for I/O operations on a device that no longer exists. 2. **Faulty storage devices**: Slow or failing storage devices can cause sync operations to hang for extended periods, making the system appear unresponsive. I've observed stack traces like this when the block device is removed during sync: ``` [<0>] __switch_to+0xd0/0x168 [<0>] iterate_supers+0x88/0x118 [<0>] ksys_sync+0x48/0xb8 [<0>] ksys_sync_helper+0x18/0xa0 [<0>] pm_suspend+0x260/0x3e8 ``` In this case, ksys_sync() is permanently blocked and will never complete, even though the user believes the system has suspended. My timeout approach provides several benefits: 1. **User feedback**: If I set a 1-minute timeout and sync doesn't complete, the system fails suspend gracefully and returns control to the user. This gives them clear indication that something went wrong, rather than leaving them with a black screen and unknown system state. 2. **Prevents false assumptions**: Users won't mistakenly believe their system is suspended when it's actually consuming full power. 3. **Allows recovery**: Users can investigate the issue, safely eject devices, or take other corrective actions. > However, if you are really set on proving the need for a timeout and > implementing it, you can always add it as a separate patch after our > patch lands. You can set up a timer to call pm_system_wakeup(). Or > just grab a wake source after a time period. I appreciate this suggestion, and I'm certainly willing to coordinate our approaches. However, I believe the sync timeout addresses a fundamentally different problem than wakeup-event-based abortion. Regarding your concern about sync completion after timeout, I want to clarify that in my updated implementation, I've removed kthread_stop() to ensure the sync operation continues in the background even after timeout. This means: 1. The suspend operation fails immediately with a timeout error, giving the user feedback 2. The sync operation continues running in the background to completion 3. Data integrity is preserved while providing responsive user experience This approach ensures that we get both the user experience benefits (immediate feedback) and data safety (background sync completion). I believe both our approaches serve different but complementary purposes: - Your approach: Handle cases where wakeup events should abort sync - My approach: Handle cases where sync becomes pathologically slow or blocked Would you be open to discussing how we might integrate both approaches? I'm happy to work on this as a follow-up patch series after your changes land, or we could explore a unified solution that handles both scenarios. > In fact, thinking more about this, you are generally having a problem > with suspend taking too long to complete. Doesn't matter if it's due > to file system sync or anything else. In which case, you should just > use a timer to call pm_system_wakeup() and not fix just file system > sync (which happens to be the current long pole). This is an interesting perspective. However, I believe filesystem sync deserves special attention because: 1. It's currently the most common source of indefinite hangs during suspend 2. The consequences of interrupted vs. timed-out sync are different (data integrity) 3. Sync has specific requirements for background completion that other suspend operations may not have I'd be very interested in your thoughts on this approach and whether we can find a path forward that addresses both our use cases. I really appreciate your patience in this discussion and look forward to continuing our collaboration on this important issue. Thanks again for taking the time to review and provide such detailed feedback. Best regards, Haowen Tu ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2] PM: Add configurable sync timeout for suspend and hibernation 2025-09-09 2:45 ` [PATCH v2] PM: Add configurable sync timeout for suspend and hibernation tuhaowen @ 2025-09-09 6:18 ` Saravana Kannan 2025-09-09 7:13 ` Different approaches to sync timeout: Desktop vs Mobile use cases tuhaowen 0 siblings, 1 reply; 6+ messages in thread From: Saravana Kannan @ 2025-09-09 6:18 UTC (permalink / raw) To: tuhaowen Cc: huangbibo, len.brown, linux-kernel, linux-pm, pavel, rafael, wusamuel On Mon, Sep 8, 2025 at 7:46 PM tuhaowen <tuhaowen@uniontech.com> wrote: > > Hi Saravana, > > Thank you for your detailed feedback and suggestions. I appreciate you > taking the time to review my approach. Let me address your concerns and > explain the real-world issues I'm trying to solve. > > > This doesn't really fix our issue where we want to abort suspend only > > if we have to stay awake. If there's no need to abort the suspend (to > > deal with some user/wake source request), then it's okay to take the > > time to sync and then suspend. If you abort the suspend, it's just > > going to get attempted again and in fact waste power. > > I understand your perspective, but I'm addressing a different class of > problems. The key issue is user experience and system reliability when > sync operations become indefinitely blocked. > > > I also don't understand how your patch fixes anything. If anything > > this makes things worse because the user might have expected their > > device to have hibernated only to come back hours later to see their > > battery dead. And even if the user is actively monitoring it, you > > still need to sync the file system fully before you eventually > > suspend. So, this doesn't really save any time or the need to sync. > > This highlights exactly the problem I'm trying to solve. When a user > initiates suspend/hibernation, the screen goes black immediately, giving > the impression that the system has successfully entered sleep state. > However, if ksys_sync() is blocked (which can happen in several scenarios > I'll describe below), the system appears to be suspended but is actually > still running at full power consumption in the background. > > The user has no way to know the system is stuck in sync - they see a > black screen and assume suspension succeeded. This leads to exactly the > scenario you mentioned: coming back hours later to find the battery dead, > but with the added risk of data corruption since the sync never completed. > > > Can you explain the actual real world issue you are trying to fix? If > > it's the UI hanging and not responding to keypress/mouse move, then to > > me it looks like those should be marked as wake sources. > > Here are the specific real-world scenarios I've encountered: > > 1. **Device removal during file operations**: When copying large files > to USB drives and then initiating suspend, if the USB device is > removed during the sync process, some filesystems may not properly > handle the device disappearance. The sync can become indefinitely > blocked waiting for I/O operations on a device that no longer exists. > > 2. **Faulty storage devices**: Slow or failing storage devices can cause > sync operations to hang for extended periods, making the system appear > unresponsive. > > I've observed stack traces like this when the block device is removed during sync: > > ``` > [<0>] __switch_to+0xd0/0x168 > [<0>] iterate_supers+0x88/0x118 > [<0>] ksys_sync+0x48/0xb8 > [<0>] ksys_sync_helper+0x18/0xa0 > [<0>] pm_suspend+0x260/0x3e8 > ``` This seems like a fundamental sync issue you are trying to paper over/work around. I think the right fix should be in the ksys_sync() code. > In this case, ksys_sync() is permanently blocked and will never complete, > even though the user believes the system has suspended. > > My timeout approach provides several benefits: > > 1. **User feedback**: If I set a 1-minute timeout and sync doesn't > complete, the system fails suspend gracefully and returns control to > the user. This gives them clear indication that something went wrong, > rather than leaving them with a black screen and unknown system state. > > 2. **Prevents false assumptions**: Users won't mistakenly believe their > system is suspended when it's actually consuming full power. > > 3. **Allows recovery**: Users can investigate the issue, safely eject > devices, or take other corrective actions. If the user has closed the laptop, none of these will be noticed by the user. And the laptop will still consume power since it's not suspended and cause a dead battery hours later. > > However, if you are really set on proving the need for a timeout and > > implementing it, you can always add it as a separate patch after our > > patch lands. You can set up a timer to call pm_system_wakeup(). Or > > just grab a wake source after a time period. > > I appreciate this suggestion, and I'm certainly willing to coordinate our > approaches. However, I believe the sync timeout addresses a fundamentally > different problem than wakeup-event-based abortion. > > Regarding your concern about sync completion after timeout, I want to > clarify that in my updated implementation, I've removed kthread_stop() to > ensure the sync operation continues in the background even after timeout. I don't think the kthread_stop() is even correct to do. You are just randomly killing the sync thread where data structures and locks could be left in a bad state. > This means: > > 1. The suspend operation fails immediately with a timeout error, giving > the user feedback > 2. The sync operation continues running in the background to completion > 3. Data integrity is preserved while providing responsive user experience > > This approach ensures that we get both the user experience benefits > (immediate feedback) and data safety (background sync completion). > > I believe both our approaches serve different but complementary purposes: > - Your approach: Handle cases where wakeup events should abort sync > - My approach: Handle cases where sync becomes pathologically slow or > blocked I'm still not sold on the timeout being the right solution to your issue, but I'm not going to block the feature. > Would you be open to discussing how we might integrate both approaches? > I'm happy to work on this as a follow-up patch series after your changes > land, or we could explore a unified solution that handles both scenarios. I think a follow up patch is the best approach. We'll give you the APIs to abort as needed (timer in your case) and you can try to convince the community that the timeout is needed. > > > In fact, thinking more about this, you are generally having a problem > > with suspend taking too long to complete. Doesn't matter if it's due > > to file system sync or anything else. In which case, you should just > > use a timer to call pm_system_wakeup() and not fix just file system > > sync (which happens to be the current long pole). > > This is an interesting perspective. However, I believe filesystem sync > deserves special attention because: > > 1. It's currently the most common source of indefinite hangs during > suspend This is not a reason for special casing for the file system. > 2. The consequences of interrupted vs. timed-out sync are different > (data integrity) Not really. You do realize that even in our patch we let the sync continue in the background, right? > 3. Sync has specific requirements for background completion that other > suspend operations may not have Not sure what you mean. > I'd be very interested in your thoughts on this approach and whether we > can find a path forward that addresses both our use cases. I really > appreciate your patience in this discussion and look forward to > continuing our collaboration on this important issue. I don't want to block your effort, but I also can't support it. So a unified patch doesn't seem like the right way to go. As I mentioned above, I'd recommend building on top of what we land. > Thanks again for taking the time to review and provide such detailed > feedback. Same to you! Thanks, Saravana ^ permalink raw reply [flat|nested] 6+ messages in thread
* Different approaches to sync timeout: Desktop vs Mobile use cases 2025-09-09 6:18 ` Saravana Kannan @ 2025-09-09 7:13 ` tuhaowen 2025-09-09 8:51 ` Oliver Neukum 0 siblings, 1 reply; 6+ messages in thread From: tuhaowen @ 2025-09-09 7:13 UTC (permalink / raw) To: pavel, rafael Cc: saravanak, huangbibo, len.brown, linux-kernel, linux-pm, tuhaowen, wusamuel Hi Rafael and Linux PM maintainers, I hope this email finds you well. I'm writing to discuss the different approaches that have emerged regarding sync timeout mechanisms during suspend/hibernation, and to seek guidance on how to proceed. ## Background We (UnionTech) and the Google Android team (Saravana, Samuel) have been working on similar but fundamentally different solutions to address sync hangs during suspend operations. After discussions with Saravana, it has become clear that our approaches diverge significantly due to different use cases and environments. ## Our Approach - Desktop/PC Focus Our patch: https://lore.kernel.org/all/20250909065836.32534-1-tuhaowen@uniontech.com/ **Use Case**: Desktop/laptop systems where users expect immediate feedback **Problem**: When large file operations are in progress (e.g., copying files to USB drives) and users initiate suspend, ksys_sync() can hang indefinitely, especially when block devices are removed during the process. This results in: - Black screen with no user feedback - Users assuming the system has frozen - Need for forced power cycles - Poor user experience **Solution**: Configurable sync timeout (default disabled for compatibility) - If sync exceeds timeout, suspend fails gracefully with user feedback - Sync continues in background after system resumes - User can take corrective action (eject devices, etc.) - Prevents false impression of system freeze ## Google's Approach - Mobile/Android Focus Their patch: https://lore.kernel.org/all/CAGETcx8x5G75sQ9zVkxe+BpK7WsEk+LS6KDMd5BBR=xWPMtevg@mail.gmail.com/ **Use Case**: Mobile devices where power consumption is critical **Problem**: Unnecessary wake-ups during sync operations **Solution**: Abort sync only when wakeup events occur - Wait indefinitely for sync if no wakeup sources are active - Prioritize power efficiency over user responsiveness - Suitable for devices where black screen is expected behavior ## Key Differences 1. **User Expectations**: - Desktop: Users expect responsive feedback and ability to recover - Mobile: Users expect device to "just work" when they next unlock 2. **Hardware Context**: - Desktop: Removable storage, active file operations, power adapter available - Mobile: Integrated storage, controlled app lifecycle, battery dependent 3. **Problem Focus**: - Desktop: Prevent indefinite hangs and provide user feedback - Mobile: Prevent unnecessary power consumption 4. **Timeout Philosophy**: - Desktop: Proactive timeout to maintain responsiveness - Mobile: Reactive abort only when wakeup events require it ## Technical Considerations Both solutions address legitimate but different problems: - **Desktop scenario**: User copying large files, removes USB drive, initiates suspend -> ksys_sync() hangs indefinitely -> black screen -> user thinks system froze - **Mobile scenario**: Background sync during suspend -> unnecessary wake-ups -> battery drain ## Request for Guidance Given these different environments and requirements, we believe both approaches have merit for their respective use cases. However, we want to ensure we follow the kernel community's preferred approach for handling such divergent requirements. Would you prefer: 1. **Unified solution**: Combine both approaches in a single patchset (more complex but comprehensive) 2. **Separate solutions**: Allow both patches to coexist, targeting different use cases 3. **Staged approach**: Land one solution first, then build upon it (Saravana suggested this approach) 4. **Different direction**: An alternative approach we haven't considered ## Our Position We believe the desktop use case requires proactive timeout mechanisms because: - Users directly interact with the system and expect feedback - Removable storage creates unique failure scenarios - The cost of a failed suspend attempt is much lower than an unresponsive system - Data integrity is preserved through background sync continuation We're happy to collaborate on a unified solution, but we also recognize that desktop and mobile environments may legitimately require different approaches to the same underlying kernel mechanisms. Thank you for your time and guidance on this matter. We look forward to your feedback on how to best serve both desktop and mobile use cases within the kernel's power management subsystem. Best regards, Haowen Tu ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Different approaches to sync timeout: Desktop vs Mobile use cases 2025-09-09 7:13 ` Different approaches to sync timeout: Desktop vs Mobile use cases tuhaowen @ 2025-09-09 8:51 ` Oliver Neukum 0 siblings, 0 replies; 6+ messages in thread From: Oliver Neukum @ 2025-09-09 8:51 UTC (permalink / raw) To: tuhaowen, pavel, rafael Cc: saravanak, huangbibo, len.brown, linux-kernel, linux-pm, wusamuel Hi, On 09.09.25 09:13, tuhaowen wrote: > **Use Case**: Desktop/laptop systems where users expect immediate feedback > **Problem**: When large file operations are in progress (e.g., copying > files to USB drives) and users initiate suspend, ksys_sync() can hang It seems to me like both your approaches seek to mitigate the consequences of the issue rather than solve it. Your issue is that you have too many dirty blocks associated with a slow device. In other words, you should sync and need to limit the number of dirty blocks created after you've started syncing without freezing all of user space. Regards Oliver ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-09-09 8:51 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-09-08 2:46 Sync timeout mechanisms - Request for coordination tuhaowen 2025-09-09 2:00 ` Saravana Kannan 2025-09-09 2:45 ` [PATCH v2] PM: Add configurable sync timeout for suspend and hibernation tuhaowen 2025-09-09 6:18 ` Saravana Kannan 2025-09-09 7:13 ` Different approaches to sync timeout: Desktop vs Mobile use cases tuhaowen 2025-09-09 8:51 ` Oliver Neukum
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox