[PATCH] mm: avoid livelock on !__GFP

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm: avoid livelock on !__GFP_FS allocations
@ 2011-10-25  6:39 Colin Cross
  2011-10-25  7:40 ` Pekka Enberg
                   ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Colin Cross @ 2011-10-25  6:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Colin Cross, Andrew Morton, Mel Gorman, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, David Rientjes, linux-mm

Under the following conditions, __alloc_pages_slowpath can loop
forever:
gfp_mask & __GFP_WAIT is true
gfp_mask & __GFP_FS is false
reclaim and compaction make no progress
order <= PAGE_ALLOC_COSTLY_ORDER

These conditions happen very often during suspend and resume,
when pm_restrict_gfp_mask() effectively converts all GFP_KERNEL
allocations into __GFP_WAIT.

The oom killer is not run because gfp_mask & __GFP_FS is false,
but should_alloc_retry will always return true when order is less
than PAGE_ALLOC_COSTLY_ORDER.

Fix __alloc_pages_slowpath to skip retrying when oom killer is
not allowed by the GFP flags, the same way it would skip if the
oom killer was allowed but disabled.

Signed-off-by: Colin Cross <ccross@android.com>
---

An alternative patch would add a did_some_progress argument to
__alloc_pages_may_oom, and remove the checks in
__alloc_pages_slowpath that require knowledge of when
__alloc_pages_may_oom chooses to run out_of_memory. If
did_some_progress was still zero, it would goto nopage whether
or not __alloc_pages_may_oom was actually called.

 mm/page_alloc.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fef8dc3..dcd99b3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2193,6 +2193,10 @@ rebalance:
 			}

 			goto restart;
+		} else {
+			/* If we aren't going to try the OOM killer, give up */
+			if (!(gfp_mask & __GFP_NOFAIL))
+				goto nopage;
 		}
 	}

-- 
1.7.4.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-10-25  6:39 Colin Cross
@ 2011-10-25  7:40 ` Pekka Enberg
  2011-10-25  7:51   ` Colin Cross
  2011-10-25  9:09 ` Mel Gorman
  2011-10-25 22:10 ` David Rientjes
  2 siblings, 1 reply; 47+ messages in thread
From: Pekka Enberg @ 2011-10-25  7:40 UTC (permalink / raw)
  To: Colin Cross
  Cc: linux-kernel, Andrew Morton, Mel Gorman, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, David Rientjes, linux-mm

On Tue, Oct 25, 2011 at 9:39 AM, Colin Cross <ccross@android.com> wrote:
> Under the following conditions, __alloc_pages_slowpath can loop
> forever:
> gfp_mask & __GFP_WAIT is true
> gfp_mask & __GFP_FS is false
> reclaim and compaction make no progress
> order <= PAGE_ALLOC_COSTLY_ORDER
>
> These conditions happen very often during suspend and resume,
> when pm_restrict_gfp_mask() effectively converts all GFP_KERNEL
> allocations into __GFP_WAIT.

Why does it do that? Why don't we fix the gfp mask instead?

> The oom killer is not run because gfp_mask & __GFP_FS is false,
> but should_alloc_retry will always return true when order is less
> than PAGE_ALLOC_COSTLY_ORDER.
>
> Fix __alloc_pages_slowpath to skip retrying when oom killer is
> not allowed by the GFP flags, the same way it would skip if the
> oom killer was allowed but disabled.
>
> Signed-off-by: Colin Cross <ccross@android.com>
> ---
>
> An alternative patch would add a did_some_progress argument to
> __alloc_pages_may_oom, and remove the checks in
> __alloc_pages_slowpath that require knowledge of when
> __alloc_pages_may_oom chooses to run out_of_memory. If
> did_some_progress was still zero, it would goto nopage whether
> or not __alloc_pages_may_oom was actually called.
>
>  mm/page_alloc.c |    4 ++++
>  1 files changed, 4 insertions(+), 0 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index fef8dc3..dcd99b3 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2193,6 +2193,10 @@ rebalance:
>                        }
>
>                        goto restart;
> +               } else {
> +                       /* If we aren't going to try the OOM killer, give up */
> +                       if (!(gfp_mask & __GFP_NOFAIL))
> +                               goto nopage;
>                }
>        }

I don't quite understand how __GFP_WAIT is involved here. Which path
is causing the infinite loop?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-10-25  7:40 ` Pekka Enberg
@ 2011-10-25  7:51   ` Colin Cross
  2011-10-25  8:08     ` Pekka Enberg
  2011-10-25 22:12     ` David Rientjes
  0 siblings, 2 replies; 47+ messages in thread
From: Colin Cross @ 2011-10-25  7:51 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: linux-kernel, Andrew Morton, Mel Gorman, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, David Rientjes, linux-mm

On Tue, Oct 25, 2011 at 12:40 AM, Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> On Tue, Oct 25, 2011 at 9:39 AM, Colin Cross <ccross@android.com> wrote:
>> Under the following conditions, __alloc_pages_slowpath can loop
>> forever:
>> gfp_mask & __GFP_WAIT is true
>> gfp_mask & __GFP_FS is false
>> reclaim and compaction make no progress
>> order <= PAGE_ALLOC_COSTLY_ORDER
>>
>> These conditions happen very often during suspend and resume,
>> when pm_restrict_gfp_mask() effectively converts all GFP_KERNEL
>> allocations into __GFP_WAIT.
>
> Why does it do that? Why don't we fix the gfp mask instead?
It disables __GFP_IO and __GFP_FS because the IO drivers may be suspended.

>> The oom killer is not run because gfp_mask & __GFP_FS is false,
>> but should_alloc_retry will always return true when order is less
>> than PAGE_ALLOC_COSTLY_ORDER.
>>
>> Fix __alloc_pages_slowpath to skip retrying when oom killer is
>> not allowed by the GFP flags, the same way it would skip if the
>> oom killer was allowed but disabled.
>>
>> Signed-off-by: Colin Cross <ccross@android.com>
>> ---
>>
>> An alternative patch would add a did_some_progress argument to
>> __alloc_pages_may_oom, and remove the checks in
>> __alloc_pages_slowpath that require knowledge of when
>> __alloc_pages_may_oom chooses to run out_of_memory. If
>> did_some_progress was still zero, it would goto nopage whether
>> or not __alloc_pages_may_oom was actually called.
>>
>>  mm/page_alloc.c |    4 ++++
>>  1 files changed, 4 insertions(+), 0 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index fef8dc3..dcd99b3 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -2193,6 +2193,10 @@ rebalance:
>>                        }
>>
>>                        goto restart;
>> +               } else {
>> +                       /* If we aren't going to try the OOM killer, give up */
>> +                       if (!(gfp_mask & __GFP_NOFAIL))
>> +                               goto nopage;
>>                }
>>        }
>
> I don't quite understand how __GFP_WAIT is involved here. Which path
> is causing the infinite loop?
GFP_KERNEL is __GFP_WAIT | __GFP_IO | __GFP_FS.  Once driver suspend
has started, gfp_allowed_mask is ~(__GFP_IO | GFP_FS), so any call to
__alloc_pages_nodemask(GFP_KERNEL, ...) gets masked to effectively
__alloc_pages_nodemask(__GFP_WAIT, ...).

The loop is in __alloc_pages_slowpath, from the rebalance label to
should_alloc_retry.  Under the conditions I listed in the commit
message, there is no path to the nopage label, because all the
relevant "goto nopage" lines that would normally allow a GFP_KERNEL
allocation to fail are inside a check for __GFP_FS.

Modifying the gfp_allowed_mask would not completely fix the issue, a
GFP_NOIO allocation can meet the conditions outside of suspend.
gfp_allowed_mask just makes the issue more likely, by converting
GFP_KERNEL into GFP_NOIO.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-10-25  7:51   ` Colin Cross
@ 2011-10-25  8:08     ` Pekka Enberg
  2011-10-25 22:12     ` David Rientjes
  1 sibling, 0 replies; 47+ messages in thread
From: Pekka Enberg @ 2011-10-25  8:08 UTC (permalink / raw)
  To: Colin Cross
  Cc: linux-kernel, Andrew Morton, Mel Gorman, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, David Rientjes, linux-mm

Hi Colin,

On Tue, Oct 25, 2011 at 9:39 AM, Colin Cross <ccross@android.com> wrote:
>>> Under the following conditions, __alloc_pages_slowpath can loop
>>> forever:
>>> gfp_mask & __GFP_WAIT is true
>>> gfp_mask & __GFP_FS is false
>>> reclaim and compaction make no progress
>>> order <= PAGE_ALLOC_COSTLY_ORDER
>>>
>>> These conditions happen very often during suspend and resume,
>>> when pm_restrict_gfp_mask() effectively converts all GFP_KERNEL
>>> allocations into __GFP_WAIT.

On Tue, Oct 25, 2011 at 12:40 AM, Pekka Enberg <penberg@cs.helsinki.fi> wrote:
>> Why does it do that? Why don't we fix the gfp mask instead?

On Tue, Oct 25, 2011 at 10:51 AM, Colin Cross <ccross@android.com> wrote:
> It disables __GFP_IO and __GFP_FS because the IO drivers may be suspended.

Sure but why doesn't it clear __GFP_WAIT too?

>>> The oom killer is not run because gfp_mask & __GFP_FS is false,
>>> but should_alloc_retry will always return true when order is less
>>> than PAGE_ALLOC_COSTLY_ORDER.
>>>
>>> Fix __alloc_pages_slowpath to skip retrying when oom killer is
>>> not allowed by the GFP flags, the same way it would skip if the
>>> oom killer was allowed but disabled.
>>>
>>> Signed-off-by: Colin Cross <ccross@android.com>
>>> ---
>>>
>>> An alternative patch would add a did_some_progress argument to
>>> __alloc_pages_may_oom, and remove the checks in
>>> __alloc_pages_slowpath that require knowledge of when
>>> __alloc_pages_may_oom chooses to run out_of_memory. If
>>> did_some_progress was still zero, it would goto nopage whether
>>> or not __alloc_pages_may_oom was actually called.
>>>
>>>  mm/page_alloc.c |    4 ++++
>>>  1 files changed, 4 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index fef8dc3..dcd99b3 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -2193,6 +2193,10 @@ rebalance:
>>>                        }
>>>
>>>                        goto restart;
>>> +               } else {
>>> +                       /* If we aren't going to try the OOM killer, give up */
>>> +                       if (!(gfp_mask & __GFP_NOFAIL))
>>> +                               goto nopage;
>>>                }
>>>        }
>>
>> I don't quite understand how __GFP_WAIT is involved here. Which path
>> is causing the infinite loop?
>
> GFP_KERNEL is __GFP_WAIT | __GFP_IO | __GFP_FS.  Once driver suspend
> has started, gfp_allowed_mask is ~(__GFP_IO | GFP_FS), so any call to
> __alloc_pages_nodemask(GFP_KERNEL, ...) gets masked to effectively
> __alloc_pages_nodemask(__GFP_WAIT, ...).
>
> The loop is in __alloc_pages_slowpath, from the rebalance label to
> should_alloc_retry.  Under the conditions I listed in the commit
> message, there is no path to the nopage label, because all the
> relevant "goto nopage" lines that would normally allow a GFP_KERNEL
> allocation to fail are inside a check for __GFP_FS.

Right. Please include that information in the changelog.

> Modifying the gfp_allowed_mask would not completely fix the issue, a
> GFP_NOIO allocation can meet the conditions outside of suspend.
> gfp_allowed_mask just makes the issue more likely, by converting
> GFP_KERNEL into GFP_NOIO.

Why would anyone want to combine __GFP_WAIT, __GFP_NOFAIL and
!__GFP_IO on purpose? What is it useful for?

As for your patch:

Acked-by: Pekka Enberg <penberg@kernel.org>

but I'd love to hear why we shouldn't also fix the suspend gfp mask to
clear __GFP_WAIT and add a WARN_ON_ONCE to your new code path.

                        Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-10-25  6:39 Colin Cross
  2011-10-25  7:40 ` Pekka Enberg
@ 2011-10-25  9:09 ` Mel Gorman
  2011-10-25  9:26   ` Colin Cross
                     ` (2 more replies)
  2011-10-25 22:10 ` David Rientjes
  2 siblings, 3 replies; 47+ messages in thread
From: Mel Gorman @ 2011-10-25  9:09 UTC (permalink / raw)
  To: Colin Cross
  Cc: linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki, Andrea Arcangeli,
	David Rientjes, linux-mm

On Mon, Oct 24, 2011 at 11:39:49PM -0700, Colin Cross wrote:
> Under the following conditions, __alloc_pages_slowpath can loop
> forever:
> gfp_mask & __GFP_WAIT is true
> gfp_mask & __GFP_FS is false
> reclaim and compaction make no progress
> order <= PAGE_ALLOC_COSTLY_ORDER
> 
> These conditions happen very often during suspend and resume,
> when pm_restrict_gfp_mask() effectively converts all GFP_KERNEL
> allocations into __GFP_WAIT.
b> 
> The oom killer is not run because gfp_mask & __GFP_FS is false,
> but should_alloc_retry will always return true when order is less
> than PAGE_ALLOC_COSTLY_ORDER.
> 
> Fix __alloc_pages_slowpath to skip retrying when oom killer is
> not allowed by the GFP flags, the same way it would skip if the
> oom killer was allowed but disabled.
> 
> Signed-off-by: Colin Cross <ccross@android.com>

Hi Colin,

Your patch functionally seems fine. I see the problem and we certainly
do not want to have the OOM killer firing during suspend. I would prefer
that the IO devices would not be suspended until reclaim was completed
but I imagine that would be a lot harder.

That said, it will be difficult to remember why checking __GFP_NOFAIL in
this case is necessary and someone might "optimitise" it away later. It
would be preferable if it was self-documenting. Maybe something like
this? (This is totally untested)

 mm/page_alloc.c |   22 ++++++++++++++++++++++
 1 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6e8ecb6..ad8f376 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -127,6 +127,20 @@ void pm_restrict_gfp_mask(void)
 	saved_gfp_mask = gfp_allowed_mask;
 	gfp_allowed_mask &= ~GFP_IOFS;
 }
+
+static bool pm_suspending(void)
+{
+	if ((gfp_allowed_mask & GFP_IOFS) == GFP_IOFS)
+		return false;
+	return true;
+}
+
+#else
+
+static bool pm_suspending(void)
+{
+	return false;
+}
 #endif /* CONFIG_PM_SLEEP */
 
 #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
@@ -2207,6 +2221,14 @@ rebalance:
 
 			goto restart;
 		}
+
+		/*
+		 * Suspend converts GFP_KERNEL to __GFP_WAIT which can
+		 * prevent reclaim making forward progress without
+		 * invoking OOM. Bail if we are suspending
+		 */
+		if (pm_suspending())
+			goto nopage;
 	}
 
 	/* Check if we should retry the allocation */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-10-25  9:09 ` Mel Gorman
@ 2011-10-25  9:26   ` Colin Cross
  2011-10-25 11:23     ` Mel Gorman
  2011-10-25 19:29   ` Colin Cross
  2011-10-25 22:18   ` David Rientjes
  2 siblings, 1 reply; 47+ messages in thread
From: Colin Cross @ 2011-10-25  9:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki, Andrea Arcangeli,
	David Rientjes, linux-mm

On Tue, Oct 25, 2011 at 2:09 AM, Mel Gorman <mgorman@suse.de> wrote:
> On Mon, Oct 24, 2011 at 11:39:49PM -0700, Colin Cross wrote:
>> Under the following conditions, __alloc_pages_slowpath can loop
>> forever:
>> gfp_mask & __GFP_WAIT is true
>> gfp_mask & __GFP_FS is false
>> reclaim and compaction make no progress
>> order <= PAGE_ALLOC_COSTLY_ORDER
>>
>> These conditions happen very often during suspend and resume,
>> when pm_restrict_gfp_mask() effectively converts all GFP_KERNEL
>> allocations into __GFP_WAIT.
> b>
>> The oom killer is not run because gfp_mask & __GFP_FS is false,
>> but should_alloc_retry will always return true when order is less
>> than PAGE_ALLOC_COSTLY_ORDER.
>>
>> Fix __alloc_pages_slowpath to skip retrying when oom killer is
>> not allowed by the GFP flags, the same way it would skip if the
>> oom killer was allowed but disabled.
>>
>> Signed-off-by: Colin Cross <ccross@android.com>
>
> Hi Colin,
>
> Your patch functionally seems fine. I see the problem and we certainly
> do not want to have the OOM killer firing during suspend. I would prefer
> that the IO devices would not be suspended until reclaim was completed
> but I imagine that would be a lot harder.
>
> That said, it will be difficult to remember why checking __GFP_NOFAIL in
> this case is necessary and someone might "optimitise" it away later. It
> would be preferable if it was self-documenting. Maybe something like
> this? (This is totally untested)

This issue is not limited to suspend, any GFP_NOIO allocation could
end up in the same loop.  Suspend is the most likely case, because it
effectively converts all GFP_KERNEL allocations into GFP_NOIO.

>  mm/page_alloc.c |   22 ++++++++++++++++++++++
>  1 files changed, 22 insertions(+), 0 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6e8ecb6..ad8f376 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -127,6 +127,20 @@ void pm_restrict_gfp_mask(void)
>        saved_gfp_mask = gfp_allowed_mask;
>        gfp_allowed_mask &= ~GFP_IOFS;
>  }
> +
> +static bool pm_suspending(void)
> +{
> +       if ((gfp_allowed_mask & GFP_IOFS) == GFP_IOFS)
> +               return false;
> +       return true;
> +}
> +
> +#else
> +
> +static bool pm_suspending(void)
> +{
> +       return false;
> +}
>  #endif /* CONFIG_PM_SLEEP */
>
>  #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
> @@ -2207,6 +2221,14 @@ rebalance:
>
>                        goto restart;
>                }
> +
> +               /*
> +                * Suspend converts GFP_KERNEL to __GFP_WAIT which can
> +                * prevent reclaim making forward progress without
> +                * invoking OOM. Bail if we are suspending
> +                */
> +               if (pm_suspending())
> +                       goto nopage;
>        }
>
>        /* Check if we should retry the allocation */
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-10-25  9:26   ` Colin Cross
@ 2011-10-25 11:23     ` Mel Gorman
  2011-10-25 17:08       ` Colin Cross
  2011-10-25 19:39       ` Pekka Enberg
  0 siblings, 2 replies; 47+ messages in thread
From: Mel Gorman @ 2011-10-25 11:23 UTC (permalink / raw)
  To: Colin Cross
  Cc: linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki, Andrea Arcangeli,
	David Rientjes, linux-mm

On Tue, Oct 25, 2011 at 02:26:56AM -0700, Colin Cross wrote:
> On Tue, Oct 25, 2011 at 2:09 AM, Mel Gorman <mgorman@suse.de> wrote:
> > On Mon, Oct 24, 2011 at 11:39:49PM -0700, Colin Cross wrote:
> >> Under the following conditions, __alloc_pages_slowpath can loop
> >> forever:
> >> gfp_mask & __GFP_WAIT is true
> >> gfp_mask & __GFP_FS is false
> >> reclaim and compaction make no progress
> >> order <= PAGE_ALLOC_COSTLY_ORDER
> >>
> >> These conditions happen very often during suspend and resume,
> >> when pm_restrict_gfp_mask() effectively converts all GFP_KERNEL
> >> allocations into __GFP_WAIT.
> > b>
> >> The oom killer is not run because gfp_mask & __GFP_FS is false,
> >> but should_alloc_retry will always return true when order is less
> >> than PAGE_ALLOC_COSTLY_ORDER.
> >>
> >> Fix __alloc_pages_slowpath to skip retrying when oom killer is
> >> not allowed by the GFP flags, the same way it would skip if the
> >> oom killer was allowed but disabled.
> >>
> >> Signed-off-by: Colin Cross <ccross@android.com>
> >
> > Hi Colin,
> >
> > Your patch functionally seems fine. I see the problem and we certainly
> > do not want to have the OOM killer firing during suspend. I would prefer
> > that the IO devices would not be suspended until reclaim was completed
> > but I imagine that would be a lot harder.
> >
> > That said, it will be difficult to remember why checking __GFP_NOFAIL in
> > this case is necessary and someone might "optimitise" it away later. It
> > would be preferable if it was self-documenting. Maybe something like
> > this? (This is totally untested)
> 
> This issue is not limited to suspend, any GFP_NOIO allocation could
> end up in the same loop.  Suspend is the most likely case, because it
> effectively converts all GFP_KERNEL allocations into GFP_NOIO.
> 

I see what you mean with GFP_NOIO but there is an important difference
between GFP_NOIO and suspend.  A GFP_NOIO low-order allocation currently
implies __GFP_NOFAIL as commented on in should_alloc_retry(). If no progress
is made, we call wait_iff_congested() and sleep for a bit. As the system
is running, kswapd and other process activity will proceed and eventually
reclaim enough pages for the GFP_NOIO allocation to succeed. In a running
system, GFP_NOIO can stall for a period of time but your patch will cause
the allocation to fail. While I expect callers return ENOMEM or handle
the situation properly with a wait-and-retry loop, there will be
operations that fail that used to succeed. This is why I'd prefer it was
a suspend-specific fix unless we know there is a case where a machine
livelocks due to a GFP_NOIO allocation looping forever and even then I'd
wonder why kswapd was not helping.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-10-25 11:23     ` Mel Gorman
@ 2011-10-25 17:08       ` Colin Cross
  2011-11-01 12:28         ` Mel Gorman
  2011-10-25 19:39       ` Pekka Enberg
  1 sibling, 1 reply; 47+ messages in thread
From: Colin Cross @ 2011-10-25 17:08 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki, Andrea Arcangeli,
	David Rientjes, linux-mm

On Tue, Oct 25, 2011 at 4:23 AM, Mel Gorman <mgorman@suse.de> wrote:
> On Tue, Oct 25, 2011 at 02:26:56AM -0700, Colin Cross wrote:
>> On Tue, Oct 25, 2011 at 2:09 AM, Mel Gorman <mgorman@suse.de> wrote:
>> > On Mon, Oct 24, 2011 at 11:39:49PM -0700, Colin Cross wrote:
>> >> Under the following conditions, __alloc_pages_slowpath can loop
>> >> forever:
>> >> gfp_mask & __GFP_WAIT is true
>> >> gfp_mask & __GFP_FS is false
>> >> reclaim and compaction make no progress
>> >> order <= PAGE_ALLOC_COSTLY_ORDER
>> >>
>> >> These conditions happen very often during suspend and resume,
>> >> when pm_restrict_gfp_mask() effectively converts all GFP_KERNEL
>> >> allocations into __GFP_WAIT.
>> > b>
>> >> The oom killer is not run because gfp_mask & __GFP_FS is false,
>> >> but should_alloc_retry will always return true when order is less
>> >> than PAGE_ALLOC_COSTLY_ORDER.
>> >>
>> >> Fix __alloc_pages_slowpath to skip retrying when oom killer is
>> >> not allowed by the GFP flags, the same way it would skip if the
>> >> oom killer was allowed but disabled.
>> >>
>> >> Signed-off-by: Colin Cross <ccross@android.com>
>> >
>> > Hi Colin,
>> >
>> > Your patch functionally seems fine. I see the problem and we certainly
>> > do not want to have the OOM killer firing during suspend. I would prefer
>> > that the IO devices would not be suspended until reclaim was completed
>> > but I imagine that would be a lot harder.
>> >
>> > That said, it will be difficult to remember why checking __GFP_NOFAIL in
>> > this case is necessary and someone might "optimitise" it away later. It
>> > would be preferable if it was self-documenting. Maybe something like
>> > this? (This is totally untested)
>>
>> This issue is not limited to suspend, any GFP_NOIO allocation could
>> end up in the same loop.  Suspend is the most likely case, because it
>> effectively converts all GFP_KERNEL allocations into GFP_NOIO.
>>
>
> I see what you mean with GFP_NOIO but there is an important difference
> between GFP_NOIO and suspend.  A GFP_NOIO low-order allocation currently
> implies __GFP_NOFAIL as commented on in should_alloc_retry(). If no progress
> is made, we call wait_iff_congested() and sleep for a bit. As the system
> is running, kswapd and other process activity will proceed and eventually
> reclaim enough pages for the GFP_NOIO allocation to succeed. In a running
> system, GFP_NOIO can stall for a period of time but your patch will cause
> the allocation to fail. While I expect callers return ENOMEM or handle
> the situation properly with a wait-and-retry loop, there will be
> operations that fail that used to succeed. This is why I'd prefer it was
> a suspend-specific fix unless we know there is a case where a machine
> livelocks due to a GFP_NOIO allocation looping forever and even then I'd
> wonder why kswapd was not helping.

OK, I see the change in behavior you are trying to avoid.  With your
patch GFP_NOIO allocations can still fail during suspend, is that OK?
I'm also worried about GFP_NOIO allocations looping forever when swap
is not enabled, but I've never seen it happen, and it would probably
recover eventually when another tried tried a GFP_KERNEL allocation
and oom killed something.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-10-25  9:09 ` Mel Gorman
  2011-10-25  9:26   ` Colin Cross
@ 2011-10-25 19:29   ` Colin Cross
  2011-10-25 22:18   ` David Rientjes
  2 siblings, 0 replies; 47+ messages in thread
From: Colin Cross @ 2011-10-25 19:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki, Andrea Arcangeli,
	David Rientjes, linux-mm

On Tue, Oct 25, 2011 at 2:09 AM, Mel Gorman <mgorman@suse.de> wrote:
> On Mon, Oct 24, 2011 at 11:39:49PM -0700, Colin Cross wrote:
>> Under the following conditions, __alloc_pages_slowpath can loop
>> forever:
>> gfp_mask & __GFP_WAIT is true
>> gfp_mask & __GFP_FS is false
>> reclaim and compaction make no progress
>> order <= PAGE_ALLOC_COSTLY_ORDER
>>
>> These conditions happen very often during suspend and resume,
>> when pm_restrict_gfp_mask() effectively converts all GFP_KERNEL
>> allocations into __GFP_WAIT.
> b>
>> The oom killer is not run because gfp_mask & __GFP_FS is false,
>> but should_alloc_retry will always return true when order is less
>> than PAGE_ALLOC_COSTLY_ORDER.
>>
>> Fix __alloc_pages_slowpath to skip retrying when oom killer is
>> not allowed by the GFP flags, the same way it would skip if the
>> oom killer was allowed but disabled.
>>
>> Signed-off-by: Colin Cross <ccross@android.com>
>
> Hi Colin,
>
> Your patch functionally seems fine. I see the problem and we certainly
> do not want to have the OOM killer firing during suspend. I would prefer
> that the IO devices would not be suspended until reclaim was completed
> but I imagine that would be a lot harder.
>
> That said, it will be difficult to remember why checking __GFP_NOFAIL in
> this case is necessary and someone might "optimitise" it away later. It
> would be preferable if it was self-documenting. Maybe something like
> this? (This is totally untested)
>
>  mm/page_alloc.c |   22 ++++++++++++++++++++++
>  1 files changed, 22 insertions(+), 0 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6e8ecb6..ad8f376 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -127,6 +127,20 @@ void pm_restrict_gfp_mask(void)
>        saved_gfp_mask = gfp_allowed_mask;
>        gfp_allowed_mask &= ~GFP_IOFS;
>  }
> +
> +static bool pm_suspending(void)
> +{
> +       if ((gfp_allowed_mask & GFP_IOFS) == GFP_IOFS)
> +               return false;
> +       return true;
> +}
> +
> +#else
> +
> +static bool pm_suspending(void)
> +{
> +       return false;
> +}
>  #endif /* CONFIG_PM_SLEEP */
>
>  #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
> @@ -2207,6 +2221,14 @@ rebalance:
>
>                        goto restart;
>                }
> +
> +               /*
> +                * Suspend converts GFP_KERNEL to __GFP_WAIT which can
> +                * prevent reclaim making forward progress without
> +                * invoking OOM. Bail if we are suspending
> +                */
> +               if (pm_suspending())
> +                       goto nopage;
>        }
>
>        /* Check if we should retry the allocation */
>

Your patch solves my immediate problem.
Tested-by: Colin Cross <ccross@android.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-10-25 11:23     ` Mel Gorman
  2011-10-25 17:08       ` Colin Cross
@ 2011-10-25 19:39       ` Pekka Enberg
  2011-11-01 12:29         ` Mel Gorman
  1 sibling, 1 reply; 47+ messages in thread
From: Pekka Enberg @ 2011-10-25 19:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Colin Cross, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, David Rientjes, linux-mm

Hi Mel,

On Tue, Oct 25, 2011 at 2:23 PM, Mel Gorman <mgorman@suse.de> wrote:
> I see what you mean with GFP_NOIO but there is an important difference
> between GFP_NOIO and suspend.  A GFP_NOIO low-order allocation currently
> implies __GFP_NOFAIL as commented on in should_alloc_retry(). If no progress
> is made, we call wait_iff_congested() and sleep for a bit. As the system
> is running, kswapd and other process activity will proceed and eventually
> reclaim enough pages for the GFP_NOIO allocation to succeed. In a running
> system, GFP_NOIO can stall for a period of time but your patch will cause
> the allocation to fail. While I expect callers return ENOMEM or handle
> the situation properly with a wait-and-retry loop, there will be
> operations that fail that used to succeed. This is why I'd prefer it was
> a suspend-specific fix unless we know there is a case where a machine
> livelocks due to a GFP_NOIO allocation looping forever and even then I'd
> wonder why kswapd was not helping.

I'm not that happy about your patch because it's going to the
direction where the page allocator is special-casing for suspension.
If you don't think it's a good idea to fix it for the general case
(i.e. Colin's patch), why don't we fix it up in a way that suspension
code passes sane GFP flags?

                        Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-10-25  6:39 Colin Cross
  2011-10-25  7:40 ` Pekka Enberg
  2011-10-25  9:09 ` Mel Gorman
@ 2011-10-25 22:10 ` David Rientjes
  2 siblings, 0 replies; 47+ messages in thread
From: David Rientjes @ 2011-10-25 22:10 UTC (permalink / raw)
  To: Colin Cross
  Cc: linux-kernel, Andrew Morton, Mel Gorman, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, linux-mm

On Mon, 24 Oct 2011, Colin Cross wrote:

> Under the following conditions, __alloc_pages_slowpath can loop
> forever:
> gfp_mask & __GFP_WAIT is true
> gfp_mask & __GFP_FS is false
> reclaim and compaction make no progress
> order <= PAGE_ALLOC_COSTLY_ORDER
> 

The oom killer is only called for __GFP_FS because we want to ensure that 
we don't inadvertently kill something if we didn't have a chance to at 
least make a good effort at direct reclaim.  There's a very high liklihood 
that direct reclaim would succeed with __GFP_FS, so we loop endlessly 
waiting for either kswapd to reclaim in the background even though it 
might not be able to because of filesystem locks or another allocation 
happens in a context that allows reclaim to succeed or oom killing.

For low-order allocations (those at or below PAGE_ALLOC_COSTLY_ORDER) 
where fragmentation isn't a huge issue, __GFP_WAIT && !__GFP_FS && 
!did_some_progress makes sense.

> These conditions happen very often during suspend and resume,
> when pm_restrict_gfp_mask() effectively converts all GFP_KERNEL
> allocations into __GFP_WAIT.
> 

This is the problem.  All allocations now have no chance of ever having 
direct reclaim succeed nor the oom killer called.  It seems like you would 
want pm_restrict_gfp_mask() to also include __GFP_NORETRY and ensure it 
can never be called for __GFP_NOFAIL.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index fef8dc3..dcd99b3 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2193,6 +2193,10 @@ rebalance:
>  			}
>  
>  			goto restart;
> +		} else {
> +			/* If we aren't going to try the OOM killer, give up */
> +			if (!(gfp_mask & __GFP_NOFAIL))
> +				goto nopage;
>  		}
>  	}
>  

Nack on this, it is going to cause many very verbose allocation failures 
(if !__GFP_NOWARN) when not using suspend because we're not in a context 
where we can do sensible reclaim or compaction and presently kswapd can 
either reclaim or another allocation will allow low-order amounts of 
memory to be reclaimed or the oom killer to free some memory.  It would 
introduce a regression into page allocation.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-10-25  7:51   ` Colin Cross
  2011-10-25  8:08     ` Pekka Enberg
@ 2011-10-25 22:12     ` David Rientjes
  1 sibling, 0 replies; 47+ messages in thread
From: David Rientjes @ 2011-10-25 22:12 UTC (permalink / raw)
  To: Colin Cross
  Cc: Pekka Enberg, linux-kernel, Andrew Morton, Mel Gorman,
	KAMEZAWA Hiroyuki, Andrea Arcangeli, linux-mm

On Tue, 25 Oct 2011, Colin Cross wrote:

> GFP_KERNEL is __GFP_WAIT | __GFP_IO | __GFP_FS.  Once driver suspend
> has started, gfp_allowed_mask is ~(__GFP_IO | GFP_FS), so any call to
> __alloc_pages_nodemask(GFP_KERNEL, ...) gets masked to effectively
> __alloc_pages_nodemask(__GFP_WAIT, ...).
> 

Just passing __GFP_WAIT is the problem that you're trying to address, 
though.  Why not include __GFP_NORETRY since you know the liklihood of 
allocation being successful on the second iteration is very slim since 
you're not in a context where you can force reclaim or oom killing?

> The loop is in __alloc_pages_slowpath, from the rebalance label to
> should_alloc_retry.

The loop is by design and is activated because you're just passing 
__GFP_WAIT in this context for no sensible reason.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-10-25  9:09 ` Mel Gorman
  2011-10-25  9:26   ` Colin Cross
  2011-10-25 19:29   ` Colin Cross
@ 2011-10-25 22:18   ` David Rientjes
  2011-10-26  1:46     ` Colin Cross
  2 siblings, 1 reply; 47+ messages in thread
From: David Rientjes @ 2011-10-25 22:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Colin Cross, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, linux-mm

On Tue, 25 Oct 2011, Mel Gorman wrote:

> That said, it will be difficult to remember why checking __GFP_NOFAIL in
> this case is necessary and someone might "optimitise" it away later. It
> would be preferable if it was self-documenting. Maybe something like
> this? (This is totally untested)
> 

__GFP_NOFAIL _should_ be optimized away in this case because all he's 
passing is __GFP_WAIT | __GFP_NOFAIL.  That doesn't make any sense unless 
all you want to do is livelock.

__GFP_NOFAIL doesn't mean the page allocator would infinitely loop in all 
conditions.  That's why GFP_ATOMIC | __GFP_NOFAIL actually fails, and I 
would argue that __GFP_WAIT | __GFP_NOFAIL should fail as well since it's 
the exact same condition except doesn't have access to the extra memory 
reserves.

Suspend needs to either set __GFP_NORETRY to avoid the livelock if it's 
going to disable all means of memory reclaiming or freeing in the page 
allocator.  Or, better yet, just make it GFP_NOWAIT.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-10-25 22:18   ` David Rientjes
@ 2011-10-26  1:46     ` Colin Cross
  2011-10-26  5:47       ` David Rientjes
  0 siblings, 1 reply; 47+ messages in thread
From: Colin Cross @ 2011-10-26  1:46 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mel Gorman, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, linux-mm

On Tue, Oct 25, 2011 at 3:18 PM, David Rientjes <rientjes@google.com> wrote:
> On Tue, 25 Oct 2011, Mel Gorman wrote:
>
>> That said, it will be difficult to remember why checking __GFP_NOFAIL in
>> this case is necessary and someone might "optimitise" it away later. It
>> would be preferable if it was self-documenting. Maybe something like
>> this? (This is totally untested)
>>
>
> __GFP_NOFAIL _should_ be optimized away in this case because all he's
> passing is __GFP_WAIT | __GFP_NOFAIL.  That doesn't make any sense unless
> all you want to do is livelock.

__GFP_NOFAIL is not set in the case that I care about.  If my change
is hit, no forward progress has been made, so I agree it should not
honor __GFP_NOFAIL.

> __GFP_NOFAIL doesn't mean the page allocator would infinitely loop in all
> conditions.  That's why GFP_ATOMIC | __GFP_NOFAIL actually fails, and I
> would argue that __GFP_WAIT | __GFP_NOFAIL should fail as well since it's
> the exact same condition except doesn't have access to the extra memory
> reserves.
>
> Suspend needs to either set __GFP_NORETRY to avoid the livelock if it's
> going to disable all means of memory reclaiming or freeing in the page
> allocator.  Or, better yet, just make it GFP_NOWAIT.
>

It would be nice to give compaction and the slab shrinker a chance to
recover a few pages, both methods will work fine in suspend.
GFP_NOWAIT will prevent them from ever running, and __GFP_NORETRY will
give up even if they are making progress but haven't recovered enough
pages.

Converting suspend to GFP_NOWAIT would simply be ~GFP_KERNEL instead
of ~GFP_IOFS in pm_restrict_gfp_mask().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-10-26  1:46     ` Colin Cross
@ 2011-10-26  5:47       ` David Rientjes
  2011-10-26  6:12         ` David Rientjes
  0 siblings, 1 reply; 47+ messages in thread
From: David Rientjes @ 2011-10-26  5:47 UTC (permalink / raw)
  To: Colin Cross
  Cc: Mel Gorman, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, linux-mm

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2063 bytes --]

On Tue, 25 Oct 2011, Colin Cross wrote:

> > On Tue, 25 Oct 2011, Mel Gorman wrote:
> >
> >> That said, it will be difficult to remember why checking __GFP_NOFAIL in
> >> this case is necessary and someone might "optimitise" it away later. It
> >> would be preferable if it was self-documenting. Maybe something like
> >> this? (This is totally untested)
> >>
> >
> > __GFP_NOFAIL _should_ be optimized away in this case because all he's
> > passing is __GFP_WAIT | __GFP_NOFAIL.  That doesn't make any sense unless
> > all you want to do is livelock.
> 
> __GFP_NOFAIL is not set in the case that I care about.  If my change
> is hit, no forward progress has been made, so I agree it should not
> honor __GFP_NOFAIL.
> 

I was responding to Mel's comment, not your case.

> > __GFP_NOFAIL doesn't mean the page allocator would infinitely loop in all
> > conditions.  That's why GFP_ATOMIC | __GFP_NOFAIL actually fails, and I
> > would argue that __GFP_WAIT | __GFP_NOFAIL should fail as well since it's
> > the exact same condition except doesn't have access to the extra memory
> > reserves.
> >
> > Suspend needs to either set __GFP_NORETRY to avoid the livelock if it's
> > going to disable all means of memory reclaiming or freeing in the page
> > allocator.  Or, better yet, just make it GFP_NOWAIT.
> >
> 
> It would be nice to give compaction and the slab shrinker a chance to
> recover a few pages, both methods will work fine in suspend.

Ok, so __GFP_NORETRY it is.  Just make sure that when 
pm_restrict_gfp_mask() masks off __GFP_IO and __GFP_FS that it also sets 
__GFP_NORETRY even though the name of the function no longer seems 
appropriate anymore.

> GFP_NOWAIT will prevent them from ever running, and __GFP_NORETRY will
> give up even if they are making progress but haven't recovered enough
> pages.
> 

These are all order-3 or smaller allocations where fragmentation isn't a 
big issue.  If a call into direct compaction or reclaim fails to reclaim 
that small amount of contiguous memory, what makes you believe that a 
second call will?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-10-26  5:47       ` David Rientjes
@ 2011-10-26  6:12         ` David Rientjes
  2011-10-26  6:16           ` Colin Cross
  0 siblings, 1 reply; 47+ messages in thread
From: David Rientjes @ 2011-10-26  6:12 UTC (permalink / raw)
  To: Colin Cross
  Cc: Mel Gorman, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, linux-mm

On Tue, 25 Oct 2011, David Rientjes wrote:

> Ok, so __GFP_NORETRY it is.  Just make sure that when 
> pm_restrict_gfp_mask() masks off __GFP_IO and __GFP_FS that it also sets 
> __GFP_NORETRY even though the name of the function no longer seems 
> appropriate anymore.
> 

Or, rather, when pm_restrict_gfp_mask() clears __GFP_IO and __GFP_FS that 
it also has the same behavior as __GFP_NORETRY in should_alloc_retry() by 
setting a variable in file scope.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-10-26  6:12         ` David Rientjes
@ 2011-10-26  6:16           ` Colin Cross
  2011-10-26  6:24             ` David Rientjes
  0 siblings, 1 reply; 47+ messages in thread
From: Colin Cross @ 2011-10-26  6:16 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mel Gorman, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, linux-mm

On Tue, Oct 25, 2011 at 11:12 PM, David Rientjes <rientjes@google.com> wrote:
> On Tue, 25 Oct 2011, David Rientjes wrote:
>
>> Ok, so __GFP_NORETRY it is.  Just make sure that when
>> pm_restrict_gfp_mask() masks off __GFP_IO and __GFP_FS that it also sets
>> __GFP_NORETRY even though the name of the function no longer seems
>> appropriate anymore.
>>
>
> Or, rather, when pm_restrict_gfp_mask() clears __GFP_IO and __GFP_FS that
> it also has the same behavior as __GFP_NORETRY in should_alloc_retry() by
> setting a variable in file scope.
>

Why do you prefer that over adding a gfp_required_mask?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-10-26  6:16           ` Colin Cross
@ 2011-10-26  6:24             ` David Rientjes
  2011-10-26  6:26               ` Colin Cross
  0 siblings, 1 reply; 47+ messages in thread
From: David Rientjes @ 2011-10-26  6:24 UTC (permalink / raw)
  To: Colin Cross
  Cc: Mel Gorman, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, linux-mm

On Tue, 25 Oct 2011, Colin Cross wrote:

> > Or, rather, when pm_restrict_gfp_mask() clears __GFP_IO and __GFP_FS that
> > it also has the same behavior as __GFP_NORETRY in should_alloc_retry() by
> > setting a variable in file scope.
> >
> 
> Why do you prefer that over adding a gfp_required_mask?
> 

Because it avoids an unnecessary OR in the page and slab allocator 
fastpaths which are red hot :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-10-26  6:24             ` David Rientjes
@ 2011-10-26  6:26               ` Colin Cross
  2011-10-26  6:33                 ` David Rientjes
  0 siblings, 1 reply; 47+ messages in thread
From: Colin Cross @ 2011-10-26  6:26 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mel Gorman, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, linux-mm

On Tue, Oct 25, 2011 at 11:24 PM, David Rientjes <rientjes@google.com> wrote:
> On Tue, 25 Oct 2011, Colin Cross wrote:
>
>> > Or, rather, when pm_restrict_gfp_mask() clears __GFP_IO and __GFP_FS that
>> > it also has the same behavior as __GFP_NORETRY in should_alloc_retry() by
>> > setting a variable in file scope.
>> >
>>
>> Why do you prefer that over adding a gfp_required_mask?
>>
>
> Because it avoids an unnecessary OR in the page and slab allocator
> fastpaths which are red hot :)
>

Makes sense.  What about this?  Official patch to follow.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fef8dc3..59cd4ff 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1786,6 +1786,13 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
                return 0;

        /*
+        * If PM has disabled I/O, OOM is disabled and reclaim is unlikely
+        * to make any progress.  To prevent a livelock, don't retry.
+        */
+       if (!(gfp_allowed_mask & __GFP_FS))
+               return 0;
+
+       /*
         * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
         * means __GFP_NOFAIL, but that may not be true in other
         * implementations.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-10-26  6:26               ` Colin Cross
@ 2011-10-26  6:33                 ` David Rientjes
  2011-10-26  6:36                   ` Colin Cross
  0 siblings, 1 reply; 47+ messages in thread
From: David Rientjes @ 2011-10-26  6:33 UTC (permalink / raw)
  To: Colin Cross
  Cc: Mel Gorman, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, linux-mm

On Tue, 25 Oct 2011, Colin Cross wrote:

> Makes sense.  What about this?  Official patch to follow.
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index fef8dc3..59cd4ff 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1786,6 +1786,13 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
>                 return 0;
> 
>         /*
> +        * If PM has disabled I/O, OOM is disabled and reclaim is unlikely
> +        * to make any progress.  To prevent a livelock, don't retry.
> +        */
> +       if (!(gfp_allowed_mask & __GFP_FS))
> +               return 0;
> +
> +       /*
>          * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
>          * means __GFP_NOFAIL, but that may not be true in other
>          * implementations.

Eek, this is precisely what we don't want and is functionally the same as 
what you initially proposed except it doesn't care about __GFP_NOFAIL.

You're trying to address a suspend issue where nothing on the system can 
logically make progress because __GFP_FS seriously restricts the ability 
of reclaim to do anything useful if it doesn't succeed the first time and 
kswapd isn't effective.  That's why I suggested a hook into 
pm_restrict_gfp_mask() to set a variable and then treat it exactly as 
__GFP_NORETRY in should_alloc_retry().

Consider if nobody is using suspend and they are allocating with GFP_NOFS.  
There's potentially a lot of candidates:

	$ grep -r GFP_NOFS * | wc -l
	1016

and now we've just introduced a regression where the allocation would 
eventually succeed because of either kswapd, a backing device that is no 
longer congested, or an allocation on another cpu in a context where 
direct reclaim can be more aggressive or the oom killer can at least free 
some memory.

So you definitely want to localize your change to only suspend and 
pm_restrict_gfp_mask() is a very easy way to do it.  So I'd suggest adding 
a static bool that can be tested in should_alloc_retry() and identify such 
situations and tag it as __read_mostly.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-10-26  6:33                 ` David Rientjes
@ 2011-10-26  6:36                   ` Colin Cross
  2011-10-26  6:51                     ` David Rientjes
  0 siblings, 1 reply; 47+ messages in thread
From: Colin Cross @ 2011-10-26  6:36 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mel Gorman, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, linux-mm

On Tue, Oct 25, 2011 at 11:33 PM, David Rientjes <rientjes@google.com> wrote:
> On Tue, 25 Oct 2011, Colin Cross wrote:
>
>> Makes sense.  What about this?  Official patch to follow.
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index fef8dc3..59cd4ff 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1786,6 +1786,13 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
>>                 return 0;
>>
>>         /*
>> +        * If PM has disabled I/O, OOM is disabled and reclaim is unlikely
>> +        * to make any progress.  To prevent a livelock, don't retry.
>> +        */
>> +       if (!(gfp_allowed_mask & __GFP_FS))
>> +               return 0;
>> +
>> +       /*
>>          * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
>>          * means __GFP_NOFAIL, but that may not be true in other
>>          * implementations.
>
> Eek, this is precisely what we don't want and is functionally the same as
> what you initially proposed except it doesn't care about __GFP_NOFAIL.

This is checking against gfp_allowed_mask, not gfp_mask.

> You're trying to address a suspend issue where nothing on the system can
> logically make progress because __GFP_FS seriously restricts the ability
> of reclaim to do anything useful if it doesn't succeed the first time and
> kswapd isn't effective.  That's why I suggested a hook into
> pm_restrict_gfp_mask() to set a variable and then treat it exactly as
> __GFP_NORETRY in should_alloc_retry().
>
> Consider if nobody is using suspend and they are allocating with GFP_NOFS.
> There's potentially a lot of candidates:
>
>        $ grep -r GFP_NOFS * | wc -l
>        1016
>
> and now we've just introduced a regression where the allocation would
> eventually succeed because of either kswapd, a backing device that is no
> longer congested, or an allocation on another cpu in a context where
> direct reclaim can be more aggressive or the oom killer can at least free
> some memory.
>
> So you definitely want to localize your change to only suspend and
> pm_restrict_gfp_mask() is a very easy way to do it.  So I'd suggest adding
> a static bool that can be tested in should_alloc_retry() and identify such
> situations and tag it as __read_mostly.
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-10-26  6:36                   ` Colin Cross
@ 2011-10-26  6:51                     ` David Rientjes
  2011-10-26  6:57                       ` Colin Cross
  0 siblings, 1 reply; 47+ messages in thread
From: David Rientjes @ 2011-10-26  6:51 UTC (permalink / raw)
  To: Colin Cross
  Cc: Mel Gorman, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, linux-mm

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1249 bytes --]

On Tue, 25 Oct 2011, Colin Cross wrote:

> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index fef8dc3..59cd4ff 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -1786,6 +1786,13 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> >>                 return 0;
> >>
> >>         /*
> >> +        * If PM has disabled I/O, OOM is disabled and reclaim is unlikely
> >> +        * to make any progress.  To prevent a livelock, don't retry.
> >> +        */
> >> +       if (!(gfp_allowed_mask & __GFP_FS))
> >> +               return 0;
> >> +
> >> +       /*
> >>          * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
> >>          * means __GFP_NOFAIL, but that may not be true in other
> >>          * implementations.
> >
> > Eek, this is precisely what we don't want and is functionally the same as
> > what you initially proposed except it doesn't care about __GFP_NOFAIL.
> 
> This is checking against gfp_allowed_mask, not gfp_mask.
> 

gfp_allowed_mask is initialized to GFP_BOOT_MASK to start so that __GFP_FS 
is never allowed before the slab allocator is completely initialized, so 
you've now implicitly made all early boot allocations to be __GFP_NORETRY 
even though they may not pass it.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-10-26  6:51                     ` David Rientjes
@ 2011-10-26  6:57                       ` Colin Cross
  2011-10-26  7:10                         ` David Rientjes
  0 siblings, 1 reply; 47+ messages in thread
From: Colin Cross @ 2011-10-26  6:57 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mel Gorman, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, linux-mm

On Tue, Oct 25, 2011 at 11:51 PM, David Rientjes <rientjes@google.com> wrote:
> On Tue, 25 Oct 2011, Colin Cross wrote:
>
>> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> >> index fef8dc3..59cd4ff 100644
>> >> --- a/mm/page_alloc.c
>> >> +++ b/mm/page_alloc.c
>> >> @@ -1786,6 +1786,13 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
>> >>                 return 0;
>> >>
>> >>         /*
>> >> +        * If PM has disabled I/O, OOM is disabled and reclaim is unlikely
>> >> +        * to make any progress.  To prevent a livelock, don't retry.
>> >> +        */
>> >> +       if (!(gfp_allowed_mask & __GFP_FS))
>> >> +               return 0;
>> >> +
>> >> +       /*
>> >>          * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
>> >>          * means __GFP_NOFAIL, but that may not be true in other
>> >>          * implementations.
>> >
>> > Eek, this is precisely what we don't want and is functionally the same as
>> > what you initially proposed except it doesn't care about __GFP_NOFAIL.
>>
>> This is checking against gfp_allowed_mask, not gfp_mask.
>>
>
> gfp_allowed_mask is initialized to GFP_BOOT_MASK to start so that __GFP_FS
> is never allowed before the slab allocator is completely initialized, so
> you've now implicitly made all early boot allocations to be __GFP_NORETRY
> even though they may not pass it.

Only before interrupts are enabled, and then isn't it vulnerable to
the same livelock?  Interrupts are off, single cpu, kswapd can't run.
If an allocation ever failed, which seems unlikely, why would retrying
help?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-10-26  6:57                       ` Colin Cross
@ 2011-10-26  7:10                         ` David Rientjes
  2011-10-26  7:22                           ` Colin Cross
  0 siblings, 1 reply; 47+ messages in thread
From: David Rientjes @ 2011-10-26  7:10 UTC (permalink / raw)
  To: Colin Cross
  Cc: Mel Gorman, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, linux-mm

On Tue, 25 Oct 2011, Colin Cross wrote:

> > gfp_allowed_mask is initialized to GFP_BOOT_MASK to start so that __GFP_FS
> > is never allowed before the slab allocator is completely initialized, so
> > you've now implicitly made all early boot allocations to be __GFP_NORETRY
> > even though they may not pass it.
> 
> Only before interrupts are enabled, and then isn't it vulnerable to
> the same livelock?  Interrupts are off, single cpu, kswapd can't run.
> If an allocation ever failed, which seems unlikely, why would retrying
> help?
> 

If you want to claim gfp_allowed_mask as a pm-only entity, then I see no 
problem with this approach.  However, if gfp_allowed_mask would be allowed 
to temporarily change after init for another purpose then it would make 
sense to retry because another allocation with __GFP_FS on another cpu or 
kswapd could start making progress could allow for future memory freeing.

The suggestion to add a hook directly into a pm-interface was so that we 
could isolate it only to suspend and, to me, is the most maintainable 
solution.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-10-26  7:10                         ` David Rientjes
@ 2011-10-26  7:22                           ` Colin Cross
  2011-11-01 12:36                             ` Mel Gorman
  0 siblings, 1 reply; 47+ messages in thread
From: Colin Cross @ 2011-10-26  7:22 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mel Gorman, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, linux-mm

On Wed, Oct 26, 2011 at 12:10 AM, David Rientjes <rientjes@google.com> wrote:
> On Tue, 25 Oct 2011, Colin Cross wrote:
>
>> > gfp_allowed_mask is initialized to GFP_BOOT_MASK to start so that __GFP_FS
>> > is never allowed before the slab allocator is completely initialized, so
>> > you've now implicitly made all early boot allocations to be __GFP_NORETRY
>> > even though they may not pass it.
>>
>> Only before interrupts are enabled, and then isn't it vulnerable to
>> the same livelock?  Interrupts are off, single cpu, kswapd can't run.
>> If an allocation ever failed, which seems unlikely, why would retrying
>> help?
>>
>
> If you want to claim gfp_allowed_mask as a pm-only entity, then I see no
> problem with this approach.  However, if gfp_allowed_mask would be allowed
> to temporarily change after init for another purpose then it would make
> sense to retry because another allocation with __GFP_FS on another cpu or
> kswapd could start making progress could allow for future memory freeing.
>
> The suggestion to add a hook directly into a pm-interface was so that we
> could isolate it only to suspend and, to me, is the most maintainable
> solution.
>

pm_restrict_gfp_mask seems to claim gfp_allowed_mask as owned by pm at runtime:
"gfp_allowed_mask also should only be modified with pm_mutex held,
unless the suspend/hibernate code is guaranteed not to run in parallel
with that modification"

I think we've wrapped around to Mel's original patch, which adds a
pm_suspending() helper that is implemented next to
pm_restrict_gfp_mask.  His patch puts the check inside
!did_some_progress instead of should_alloc_retry, which I prefer as it
at least keeps trying until reclaim isn't working.  Pekka was trying
to avoid adding pm-specific checks into the allocator, which is why I
stuck to the symptom (__GFP_FS is clear) rather than the cause (PM).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-10-25 17:08       ` Colin Cross
@ 2011-11-01 12:28         ` Mel Gorman
  0 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2011-11-01 12:28 UTC (permalink / raw)
  To: Colin Cross
  Cc: linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki, Andrea Arcangeli,
	David Rientjes, linux-mm

On Tue, Oct 25, 2011 at 10:08:58AM -0700, Colin Cross wrote:
> On Tue, Oct 25, 2011 at 4:23 AM, Mel Gorman <mgorman@suse.de> wrote:
> > On Tue, Oct 25, 2011 at 02:26:56AM -0700, Colin Cross wrote:
> >> On Tue, Oct 25, 2011 at 2:09 AM, Mel Gorman <mgorman@suse.de> wrote:
> >> > On Mon, Oct 24, 2011 at 11:39:49PM -0700, Colin Cross wrote:
> >> >> Under the following conditions, __alloc_pages_slowpath can loop
> >> >> forever:
> >> >> gfp_mask & __GFP_WAIT is true
> >> >> gfp_mask & __GFP_FS is false
> >> >> reclaim and compaction make no progress
> >> >> order <= PAGE_ALLOC_COSTLY_ORDER
> >> >>
> >> >> These conditions happen very often during suspend and resume,
> >> >> when pm_restrict_gfp_mask() effectively converts all GFP_KERNEL
> >> >> allocations into __GFP_WAIT.
> >> > b>
> >> >> The oom killer is not run because gfp_mask & __GFP_FS is false,
> >> >> but should_alloc_retry will always return true when order is less
> >> >> than PAGE_ALLOC_COSTLY_ORDER.
> >> >>
> >> >> Fix __alloc_pages_slowpath to skip retrying when oom killer is
> >> >> not allowed by the GFP flags, the same way it would skip if the
> >> >> oom killer was allowed but disabled.
> >> >>
> >> >> Signed-off-by: Colin Cross <ccross@android.com>
> >> >
> >> > Hi Colin,
> >> >
> >> > Your patch functionally seems fine. I see the problem and we certainly
> >> > do not want to have the OOM killer firing during suspend. I would prefer
> >> > that the IO devices would not be suspended until reclaim was completed
> >> > but I imagine that would be a lot harder.
> >> >
> >> > That said, it will be difficult to remember why checking __GFP_NOFAIL in
> >> > this case is necessary and someone might "optimitise" it away later. It
> >> > would be preferable if it was self-documenting. Maybe something like
> >> > this? (This is totally untested)
> >>
> >> This issue is not limited to suspend, any GFP_NOIO allocation could
> >> end up in the same loop.  Suspend is the most likely case, because it
> >> effectively converts all GFP_KERNEL allocations into GFP_NOIO.
> >>
> >
> > I see what you mean with GFP_NOIO but there is an important difference
> > between GFP_NOIO and suspend.  A GFP_NOIO low-order allocation currently
> > implies __GFP_NOFAIL as commented on in should_alloc_retry(). If no progress
> > is made, we call wait_iff_congested() and sleep for a bit. As the system
> > is running, kswapd and other process activity will proceed and eventually
> > reclaim enough pages for the GFP_NOIO allocation to succeed. In a running
> > system, GFP_NOIO can stall for a period of time but your patch will cause
> > the allocation to fail. While I expect callers return ENOMEM or handle
> > the situation properly with a wait-and-retry loop, there will be
> > operations that fail that used to succeed. This is why I'd prefer it was
> > a suspend-specific fix unless we know there is a case where a machine
> > livelocks due to a GFP_NOIO allocation looping forever and even then I'd
> > wonder why kswapd was not helping.
> 
> OK, I see the change in behavior you are trying to avoid.  With your
> patch GFP_NOIO allocations can still fail during suspend, is that OK?

What is the alternative? We are not making forward progress. This might
mean that the suspend operation fails but I'm not seeing a better
alternative.

> I'm also worried about GFP_NOIO allocations looping forever when swap
> is not enabled, but I've never seen it happen, and it would probably
> recover eventually when another tried tried a GFP_KERNEL allocation
> and oom killed something.

Or when dirty pages backed by the filesystem are cleaned and reclaimed.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-10-25 19:39       ` Pekka Enberg
@ 2011-11-01 12:29         ` Mel Gorman
  0 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2011-11-01 12:29 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Colin Cross, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, David Rientjes, linux-mm

On Tue, Oct 25, 2011 at 10:39:34PM +0300, Pekka Enberg wrote:
> Hi Mel,
> 
> On Tue, Oct 25, 2011 at 2:23 PM, Mel Gorman <mgorman@suse.de> wrote:
> > I see what you mean with GFP_NOIO but there is an important difference
> > between GFP_NOIO and suspend.  A GFP_NOIO low-order allocation currently
> > implies __GFP_NOFAIL as commented on in should_alloc_retry(). If no progress
> > is made, we call wait_iff_congested() and sleep for a bit. As the system
> > is running, kswapd and other process activity will proceed and eventually
> > reclaim enough pages for the GFP_NOIO allocation to succeed. In a running
> > system, GFP_NOIO can stall for a period of time but your patch will cause
> > the allocation to fail. While I expect callers return ENOMEM or handle
> > the situation properly with a wait-and-retry loop, there will be
> > operations that fail that used to succeed. This is why I'd prefer it was
> > a suspend-specific fix unless we know there is a case where a machine
> > livelocks due to a GFP_NOIO allocation looping forever and even then I'd
> > wonder why kswapd was not helping.
> 
> I'm not that happy about your patch because it's going to the
> direction where the page allocator is special-casing for suspension.

Suspend really is a special case. While I'd prefer to avoid special
casing it like this, I prefer it a *lot* more than failing GFP_NOIO
allocations that used to succeed.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-10-26  7:22                           ` Colin Cross
@ 2011-11-01 12:36                             ` Mel Gorman
  0 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2011-11-01 12:36 UTC (permalink / raw)
  To: Colin Cross
  Cc: David Rientjes, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, linux-mm

On Wed, Oct 26, 2011 at 12:22:14AM -0700, Colin Cross wrote:
> On Wed, Oct 26, 2011 at 12:10 AM, David Rientjes <rientjes@google.com> wrote:
> > On Tue, 25 Oct 2011, Colin Cross wrote:
> >
> >> > gfp_allowed_mask is initialized to GFP_BOOT_MASK to start so that __GFP_FS
> >> > is never allowed before the slab allocator is completely initialized, so
> >> > you've now implicitly made all early boot allocations to be __GFP_NORETRY
> >> > even though they may not pass it.
> >>
> >> Only before interrupts are enabled, and then isn't it vulnerable to
> >> the same livelock?  Interrupts are off, single cpu, kswapd can't run.
> >> If an allocation ever failed, which seems unlikely, why would retrying
> >> help?
> >>
> >
> > If you want to claim gfp_allowed_mask as a pm-only entity, then I see no
> > problem with this approach.  However, if gfp_allowed_mask would be allowed
> > to temporarily change after init for another purpose then it would make
> > sense to retry because another allocation with __GFP_FS on another cpu or
> > kswapd could start making progress could allow for future memory freeing.
> >
> > The suggestion to add a hook directly into a pm-interface was so that we
> > could isolate it only to suspend and, to me, is the most maintainable
> > solution.
> >
> 
> pm_restrict_gfp_mask seems to claim gfp_allowed_mask as owned by pm at runtime:
> "gfp_allowed_mask also should only be modified with pm_mutex held,
> unless the suspend/hibernate code is guaranteed not to run in parallel
> with that modification"
> 
> I think we've wrapped around to Mel's original patch, which adds a
> pm_suspending() helper that is implemented next to
> pm_restrict_gfp_mask.  His patch puts the check inside
> !did_some_progress instead of should_alloc_retry, which I prefer as it
> at least keeps trying until reclaim isn't working.  Pekka was trying
> to avoid adding pm-specific checks into the allocator, which is why I
> stuck to the symptom (__GFP_FS is clear) rather than the cause (PM).
> 

Right now, I'm still no seeing a problem with the pm_suspending() check
as it's made for a corner-case situation in a very slow path that is
self-documenting. This thread has died somewhat and there is still no
fix merged. Is someone cooking up a patch they would prefer as an
alternative? If not, I'm going to resubmit the fix based on
pm_suspending.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH] mm: avoid livelock on !__GFP_FS allocations
@ 2011-11-14 14:04 Mel Gorman
  2011-11-14 18:38 ` Andrea Arcangeli
                   ` (3 more replies)
  0 siblings, 4 replies; 47+ messages in thread
From: Mel Gorman @ 2011-11-14 14:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Colin Cross, Pekka Enberg, KAMEZAWA Hiroyuki, Andrea Arcangeli,
	David Rientjes, LKML, Linux-MM

This patch seems to have gotten lost in the cracks and the discussion
on alternatives that started here https://lkml.org/lkml/2011/10/25/24
petered out without any alternative patches being posted. Lacking
a viable alternative patch, I'm reposting this patch because AFAIK,
this bug still exists.

Colin Cross reported;

  Under the following conditions, __alloc_pages_slowpath can loop forever:
  gfp_mask & __GFP_WAIT is true
  gfp_mask & __GFP_FS is false
  reclaim and compaction make no progress
  order <= PAGE_ALLOC_COSTLY_ORDER

  These conditions happen very often during suspend and resume,
  when pm_restrict_gfp_mask() effectively converts all GFP_KERNEL
  allocations into __GFP_WAIT.

  The oom killer is not run because gfp_mask & __GFP_FS is false,
  but should_alloc_retry will always return true when order is less
  than PAGE_ALLOC_COSTLY_ORDER.

In his fix, he avoided retrying the allocation if reclaim made no
progress and __GFP_FS was not set. The problem is that this would
result in GFP_NOIO allocations failing that previously succeeded
which would be very unfortunate.

The big difference between GFP_NOIO and suspend converting GFP_KERNEL
to behave like GFP_NOIO is that normally flushers will be cleaning
pages and kswapd reclaims pages allowing GFP_NOIO to succeed after
a short delay. The same does not necessarily apply during suspend as
the storage device may be suspended.  Hence, this patch special cases
the suspend case to fail the page allocation if reclaim cannot make
progress. This might cause suspend to abort but that is better than
a livelock.

[mgorman@suse.de: Rework fix to be suspend specific]
Reported-and-tested-by: Colin Cross <ccross@android.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c |   22 ++++++++++++++++++++++
 1 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9dd443d..5402897 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -127,6 +127,20 @@ void pm_restrict_gfp_mask(void)
 	saved_gfp_mask = gfp_allowed_mask;
 	gfp_allowed_mask &= ~GFP_IOFS;
 }
+
+static bool pm_suspending(void)
+{
+	if ((gfp_allowed_mask & GFP_IOFS) == GFP_IOFS)
+		return false;
+	return true;
+}
+
+#else
+
+static bool pm_suspending(void)
+{
+	return false;
+}
 #endif /* CONFIG_PM_SLEEP */

 #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
@@ -2214,6 +2228,14 @@ rebalance:

 			goto restart;
 		}
+
+		/*
+		 * Suspend converts GFP_KERNEL to __GFP_WAIT which can
+		 * prevent reclaim making forward progress without
+		 * invoking OOM. Bail if we are suspending
+		 */
+		if (pm_suspending())
+			goto nopage;
 	}

 	/* Check if we should retry the allocation */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-11-14 14:04 [PATCH] mm: avoid livelock on !__GFP_FS allocations Mel Gorman
@ 2011-11-14 18:38 ` Andrea Arcangeli
  2011-11-15 10:30   ` Mel Gorman
  2011-11-14 23:03 ` Andrew Morton
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 47+ messages in thread
From: Andrea Arcangeli @ 2011-11-14 18:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Colin Cross, Pekka Enberg, KAMEZAWA Hiroyuki,
	David Rientjes, LKML, Linux-MM

On Mon, Nov 14, 2011 at 02:04:21PM +0000, Mel Gorman wrote:
> In his fix, he avoided retrying the allocation if reclaim made no
> progress and __GFP_FS was not set. The problem is that this would
> result in GFP_NOIO allocations failing that previously succeeded
> which would be very unfortunate.

GFP_NOFS are made by filesystems/buffers to avoid locking up on fs/vfs
locking. Those also should be able to handle failure gracefully but
userland is more likely to get a -ENOMEM from these (for example
during direct-io) if those fs allocs fails. So clearly it sounds risky
to apply the modification quoted above and risk having any GFP_NOFS
fail. Said that I'm afraid we're not deadlock safe with current code
that cannot fail but there's no easy solution and no way to fix it in
the short term, and it's only a theoretical concern.

For !__GFP_FS allocations, __GFP_NOFAIL is the default for order <=
PAGE_ALLOC_COSTLY_ORDER and __GFP_NORETRY is the default for order >
PAGE_ALLOC_COSTLY_ORDER. This inconsistency is not so clean in my
view. Also for GFP_KERNEL/USER/__GFP_FS regular allocations the
__GFP_NOFAIL looks more like a __GFP_MAY_OOM.  But if we fix that and
we drop __GFP_NORETRY, and we set __GFP_NOFAIL within the
GFP_NOFS/NOIO #defines (to remove the magic PAGE_ALLOC_COSTLY_ORDER
check in should_alloc_retry) we may loop forever if somebody allocates
several mbytes of huge contiguous RAM with GFP_NOIO. So at least
there's a practical explanation for the current code.

Patch looks good to me (and safer) even if I don't like keeping
infinite loops from a purely theoretical standpoint.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-11-14 14:04 [PATCH] mm: avoid livelock on !__GFP_FS allocations Mel Gorman
  2011-11-14 18:38 ` Andrea Arcangeli
@ 2011-11-14 23:03 ` Andrew Morton
  2011-11-15 10:42   ` Mel Gorman
  2011-11-15 16:13 ` Minchan Kim
  2011-11-15 21:40 ` David Rientjes
  3 siblings, 1 reply; 47+ messages in thread
From: Andrew Morton @ 2011-11-14 23:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Colin Cross, Pekka Enberg, KAMEZAWA Hiroyuki, Andrea Arcangeli,
	David Rientjes, LKML, Linux-MM

On Mon, 14 Nov 2011 14:04:21 +0000
Mel Gorman <mgorman@suse.de> wrote:

> This patch seems to have gotten lost in the cracks and the discussion
> on alternatives that started here https://lkml.org/lkml/2011/10/25/24
> petered out without any alternative patches being posted. Lacking
> a viable alternative patch, I'm reposting this patch because AFAIK,
> this bug still exists.
> 
> Colin Cross reported;
> 
>   Under the following conditions, __alloc_pages_slowpath can loop forever:
>   gfp_mask & __GFP_WAIT is true
>   gfp_mask & __GFP_FS is false
>   reclaim and compaction make no progress
>   order <= PAGE_ALLOC_COSTLY_ORDER
> 
>   These conditions happen very often during suspend and resume,
>   when pm_restrict_gfp_mask() effectively converts all GFP_KERNEL
>   allocations into __GFP_WAIT.
> 
>   The oom killer is not run because gfp_mask & __GFP_FS is false,
>   but should_alloc_retry will always return true when order is less
>   than PAGE_ALLOC_COSTLY_ORDER.
> 
> In his fix, he avoided retrying the allocation if reclaim made no
> progress and __GFP_FS was not set. The problem is that this would
> result in GFP_NOIO allocations failing that previously succeeded
> which would be very unfortunate.
> 
> The big difference between GFP_NOIO and suspend converting GFP_KERNEL
> to behave like GFP_NOIO is that normally flushers will be cleaning
> pages and kswapd reclaims pages allowing GFP_NOIO to succeed after
> a short delay. The same does not necessarily apply during suspend as
> the storage device may be suspended.  Hence, this patch special cases
> the suspend case to fail the page allocation if reclaim cannot make
> progress. This might cause suspend to abort but that is better than
> a livelock.

Fair enough.

> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9dd443d..5402897 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -127,6 +127,20 @@ void pm_restrict_gfp_mask(void)
>  	saved_gfp_mask = gfp_allowed_mask;
>  	gfp_allowed_mask &= ~GFP_IOFS;
>  }
> +
> +static bool pm_suspending(void)
> +{
> +	if ((gfp_allowed_mask & GFP_IOFS) == GFP_IOFS)
> +		return false;
> +	return true;
> +}

This doesn't seem a terribly reliable way of detecting that PM has
disabled the storage devices (which is what we really want to know
here: kswapd got crippled).

I guess it's safe for now, because PM is the only caller who alters
gfp_allowed_mask (I assume).  But an explicit storage_is_unavaliable
global which is set and cleared at exactly the correct time is clearer,
more direct and future-safer, no?

> +#else
> +
> +static bool pm_suspending(void)
> +{
> +	return false;
> +}
>  #endif /* CONFIG_PM_SLEEP */
>  
>  #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
> @@ -2214,6 +2228,14 @@ rebalance:
>  
>  			goto restart;
>  		}
> +
> +		/*
> +		 * Suspend converts GFP_KERNEL to __GFP_WAIT which can
> +		 * prevent reclaim making forward progress without
> +		 * invoking OOM. Bail if we are suspending
> +		 */
> +		if (pm_suspending())
> +			goto nopage;

The comment doesn't tell the whole story: it's important that kswapd
writeout was disabled?

--- a/mm/page_alloc.c~mm-avoid-livelock-on-__gfp_fs-allocations-fix
+++ a/mm/page_alloc.c
@@ -2263,9 +2263,10 @@ rebalance:
 		}
 
 		/*
-		 * Suspend converts GFP_KERNEL to __GFP_WAIT which can
-		 * prevent reclaim making forward progress without
-		 * invoking OOM. Bail if we are suspending
+		 * Suspend converts GFP_KERNEL to __GFP_WAIT which can prevent
+		 * reclaim making forward progress without invoking OOM.
+		 * Suspend also disables storage devices so kswapd cannot save
+		 * us.  Bail if we are suspending.
 		 */
 		if (pm_suspending())
 			goto nopage;
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-11-14 18:38 ` Andrea Arcangeli
@ 2011-11-15 10:30   ` Mel Gorman
  0 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2011-11-15 10:30 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Colin Cross, Pekka Enberg, KAMEZAWA Hiroyuki,
	David Rientjes, LKML, Linux-MM

On Mon, Nov 14, 2011 at 07:38:12PM +0100, Andrea Arcangeli wrote:
> On Mon, Nov 14, 2011 at 02:04:21PM +0000, Mel Gorman wrote:
> > In his fix, he avoided retrying the allocation if reclaim made no
> > progress and __GFP_FS was not set. The problem is that this would
> > result in GFP_NOIO allocations failing that previously succeeded
> > which would be very unfortunate.
> 
> GFP_NOFS are made by filesystems/buffers to avoid locking up on fs/vfs
> locking. Those also should be able to handle failure gracefully but
> userland is more likely to get a -ENOMEM from these (for example
> during direct-io) if those fs allocs fails.

I was also vaguely recalling Roland's talk at day 2 of kernel summit
(reported at http://lwn.net/Articles/464500/) where he talked about
error handling. One point he made was that some filesystems ran into
problems in the event of memory allocation failure. I didn't audit
if the block layer handles it better but one way or the other I did
not want to throw the the block or filesystem layers curve balls.

> So clearly it sounds risky
> to apply the modification quoted above and risk having any GFP_NOFS
> fail. Said that I'm afraid we're not deadlock safe with current code
> that cannot fail but there's no easy solution and no way to fix it in
> the short term, and it's only a theoretical concern.

It's still a valid concern. The expectation is that we are protected
from deadlocks a combination of mempools and the watermarks forcing
processes to stall in direct reclaim leaving a cushion of pages for
reclaim using PF_MEMALLOC to always make forward progress. Patches
that break how watermarks work tend to lead to deadlock.

> For !__GFP_FS allocations, __GFP_NOFAIL is the default for order <=
> PAGE_ALLOC_COSTLY_ORDER and __GFP_NORETRY is the default for order >
> PAGE_ALLOC_COSTLY_ORDER. This inconsistency is not so clean in my
> view.

Is your concern that the behaviour of the allocator changes quite
significantly for orders < PAGE_ALLOC_COSTLY_ORDER?

I agree with you that it would be nicer if there was a gradual scaling
back of how much work the allocator did that depended on order. To
date there has not been much pressure or motivation to implement it.

> Also for GFP_KERNEL/USER/__GFP_FS regular allocations the
> __GFP_NOFAIL looks more like a __GFP_MAY_OOM.  But if we fix that and
> we drop __GFP_NORETRY, and we set __GFP_NOFAIL within the
> GFP_NOFS/NOIO #defines (to remove the magic PAGE_ALLOC_COSTLY_ORDER
> check in should_alloc_retry) we may loop forever if somebody allocates
> several mbytes of huge contiguous RAM with GFP_NOIO. So at least
> there's a practical explanation for the current code.
> 

Yep.

> Patch looks good to me (and safer) even if I don't like keeping
> infinite loops from a purely theoretical standpoint.

>From a more practical point of view, I am generally more concerned
with abnormally large stalls from within the page allocator which
is what patches like "Do not stall in synchronous compaction for THP
allocations" address.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-11-14 23:03 ` Andrew Morton
@ 2011-11-15 10:42   ` Mel Gorman
  2011-11-15 15:43     ` Mel Gorman
  0 siblings, 1 reply; 47+ messages in thread
From: Mel Gorman @ 2011-11-15 10:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Colin Cross, Pekka Enberg, KAMEZAWA Hiroyuki, Andrea Arcangeli,
	David Rientjes, LKML, Linux-MM

On Mon, Nov 14, 2011 at 03:03:26PM -0800, Andrew Morton wrote:
> > <SNIP>
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 9dd443d..5402897 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -127,6 +127,20 @@ void pm_restrict_gfp_mask(void)
> >  	saved_gfp_mask = gfp_allowed_mask;
> >  	gfp_allowed_mask &= ~GFP_IOFS;
> >  }
> > +
> > +static bool pm_suspending(void)
> > +{
> > +	if ((gfp_allowed_mask & GFP_IOFS) == GFP_IOFS)
> > +		return false;
> > +	return true;
> > +}
> 
> This doesn't seem a terribly reliable way of detecting that PM has
> disabled the storage devices (which is what we really want to know
> here: kswapd got crippled).
> 

It only works because PM is the only caller that alters
gfp_allowed_mask at runtime after early boot completes. We also check
if suspend is in progress in mm/swapfile.c#try_to_free_swap() using
the gfp_allowed_mask.

> I guess it's safe for now, because PM is the only caller who alters
> gfp_allowed_mask (I assume). 

You assume correctly.

> But an explicit storage_is_unavaliable
> global which is set and cleared at exactly the correct time is clearer,
> more direct and future-safer, no?
> 

It feels overkill to allocate more global storage for it when
gfp_allowed_mask is already there but I could rename pm_suspending() to
pm_disabled_storage(), make try_to_free_swap() use the same helper but
leave the implementation the same. This would clarify the situation.

> > +#else
> > +
> > +static bool pm_suspending(void)
> > +{
> > +	return false;
> > +}
> >  #endif /* CONFIG_PM_SLEEP */
> >  
> >  #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
> > @@ -2214,6 +2228,14 @@ rebalance:
> >  
> >  			goto restart;
> >  		}
> > +
> > +		/*
> > +		 * Suspend converts GFP_KERNEL to __GFP_WAIT which can
> > +		 * prevent reclaim making forward progress without
> > +		 * invoking OOM. Bail if we are suspending
> > +		 */
> > +		if (pm_suspending())
> > +			goto nopage;
> 
> The comment doesn't tell the whole story: it's important that kswapd
> writeout was disabled?
> 

Writeout is disabled for flushers as well but your comment covers both
and clarifies the situation. Thanks

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-11-15 10:42   ` Mel Gorman
@ 2011-11-15 15:43     ` Mel Gorman
  0 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2011-11-15 15:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Colin Cross, Pekka Enberg, KAMEZAWA Hiroyuki, Andrea Arcangeli,
	David Rientjes, LKML, Linux-MM

> <SNIP>
> It feels overkill to allocate more global storage for it when
> gfp_allowed_mask is already there but I could rename pm_suspending() to
> pm_disabled_storage(), make try_to_free_swap() use the same helper but
> leave the implementation the same. This would clarify the situation.
> 

Something like this? It's on top of your comment update

==== CUT HERE ====
mm: Clarify usage of gfp_allowed_mask

This patch clarifies how gfp_allowed_mask is used and that during
run-time it is PM-specific.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/gfp.h |   16 ++++++++++++++++
 mm/page_alloc.c     |   11 ++---------
 mm/swapfile.c       |    6 +++---
 3 files changed, 21 insertions(+), 12 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 3a76faf..033f55f 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -367,9 +367,25 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
 void drain_all_pages(void);
 void drain_local_pages(void *dummy);
 
+/*
+ * gfp_allowed_mask is set to GFP_BOOT_MASK during early boot to restrict what
+ * GFP flags are used before interrupts are enabled. Once interrupts are
+ * enabled, it is set to __GFP_BITS_MASK while the system is running. During
+ * hibernation, it is used by PM to avoid I/O during memory allocation while
+ * devices are suspended.
+ */
 extern gfp_t gfp_allowed_mask;
 
 extern void pm_restrict_gfp_mask(void);
 extern void pm_restore_gfp_mask(void);
 
+#ifdef CONFIG_PM_SLEEP
+extern bool pm_suspended_storage(void);
+#else
+static inline bool pm_suspended_storage(void)
+{
+	return false;
+}
+#endif /* CONFIG_PM_SLEEP */
+
 #endif /* __LINUX_GFP_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2ee4040..4ace1e0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -128,19 +128,12 @@ void pm_restrict_gfp_mask(void)
 	gfp_allowed_mask &= ~GFP_IOFS;
 }
 
-static bool pm_suspending(void)
+bool pm_suspended_storage(void)
 {
 	if ((gfp_allowed_mask & GFP_IOFS) == GFP_IOFS)
 		return false;
 	return true;
 }
-
-#else
-
-static bool pm_suspending(void)
-{
-	return false;
-}
 #endif /* CONFIG_PM_SLEEP */
 
 #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
@@ -2235,7 +2228,7 @@ rebalance:
 		 * Suspend also disables storage devices so kswapd cannot save
 		 * us.  Bail if we are suspending.
 		 */
-		if (pm_suspending())
+		if (pm_suspended_storage())
 			goto nopage;
 	}
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index b1cd120..9520592 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -667,10 +667,10 @@ int try_to_free_swap(struct page *page)
 	 * original page might be freed under memory pressure, then
 	 * later read back in from swap, now with the wrong data.
 	 *
-	 * Hibernation clears bits from gfp_allowed_mask to prevent
-	 * memory reclaim from writing to disk, so check that here.
+	 * Hibration suspends storage while it is writing the image
+	 * to disk so check that here.
 	 */
-	if (!(gfp_allowed_mask & __GFP_IO))
+	if (pm_suspended_storage())
 		return 0;
 
 	delete_from_swap_cache(page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-11-14 14:04 [PATCH] mm: avoid livelock on !__GFP_FS allocations Mel Gorman
  2011-11-14 18:38 ` Andrea Arcangeli
  2011-11-14 23:03 ` Andrew Morton
@ 2011-11-15 16:13 ` Minchan Kim
  2011-11-15 17:36   ` Mel Gorman
  2011-11-15 21:40 ` David Rientjes
  3 siblings, 1 reply; 47+ messages in thread
From: Minchan Kim @ 2011-11-15 16:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Colin Cross, Pekka Enberg, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, David Rientjes, LKML, Linux-MM

On Mon, Nov 14, 2011 at 11:04 PM, Mel Gorman <mgorman@suse.de> wrote:
> This patch seems to have gotten lost in the cracks and the discussion
> on alternatives that started here https://lkml.org/lkml/2011/10/25/24
> petered out without any alternative patches being posted. Lacking
> a viable alternative patch, I'm reposting this patch because AFAIK,
> this bug still exists.
>
> Colin Cross reported;
>
>  Under the following conditions, __alloc_pages_slowpath can loop forever:
>  gfp_mask & __GFP_WAIT is true
>  gfp_mask & __GFP_FS is false
>  reclaim and compaction make no progress
>  order <= PAGE_ALLOC_COSTLY_ORDER
>
>  These conditions happen very often during suspend and resume,
>  when pm_restrict_gfp_mask() effectively converts all GFP_KERNEL
>  allocations into __GFP_WAIT.
>
>  The oom killer is not run because gfp_mask & __GFP_FS is false,
>  but should_alloc_retry will always return true when order is less
>  than PAGE_ALLOC_COSTLY_ORDER.
>
> In his fix, he avoided retrying the allocation if reclaim made no
> progress and __GFP_FS was not set. The problem is that this would
> result in GFP_NOIO allocations failing that previously succeeded
> which would be very unfortunate.
>
> The big difference between GFP_NOIO and suspend converting GFP_KERNEL
> to behave like GFP_NOIO is that normally flushers will be cleaning
> pages and kswapd reclaims pages allowing GFP_NOIO to succeed after
> a short delay. The same does not necessarily apply during suspend as
> the storage device may be suspended.  Hence, this patch special cases
> the suspend case to fail the page allocation if reclaim cannot make
> progress. This might cause suspend to abort but that is better than
> a livelock.
>
> [mgorman@suse.de: Rework fix to be suspend specific]
> Reported-and-tested-by: Colin Cross <ccross@android.com>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  mm/page_alloc.c |   22 ++++++++++++++++++++++
>  1 files changed, 22 insertions(+), 0 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9dd443d..5402897 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -127,6 +127,20 @@ void pm_restrict_gfp_mask(void)
>        saved_gfp_mask = gfp_allowed_mask;
>        gfp_allowed_mask &= ~GFP_IOFS;
>  }
> +
> +static bool pm_suspending(void)
> +{
> +       if ((gfp_allowed_mask & GFP_IOFS) == GFP_IOFS)
> +               return false;
> +       return true;
> +}
> +
> +#else
> +
> +static bool pm_suspending(void)
> +{
> +       return false;
> +}
>  #endif /* CONFIG_PM_SLEEP */
>
>  #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
> @@ -2214,6 +2228,14 @@ rebalance:
>
>                        goto restart;
>                }
> +
> +               /*
> +                * Suspend converts GFP_KERNEL to __GFP_WAIT which can
> +                * prevent reclaim making forward progress without
> +                * invoking OOM. Bail if we are suspending
> +                */
> +               if (pm_suspending())
> +                       goto nopage;
>        }
>
>        /* Check if we should retry the allocation */
>


I don't have much time to look into this problem so I miss some things.
But the feeling I have a mind when I faced this problem is why we
should make another special case handling function.
Already we have such thing for hibernation - oom_killer_disabled in vm
Could we use it instead of making new branch for very special case?
Maybe It would be better to rename oom_killer_disabled with
pm_is_going or something.


-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-11-15 16:13 ` Minchan Kim
@ 2011-11-15 17:36   ` Mel Gorman
  2011-11-16  0:22     ` Minchan Kim
  0 siblings, 1 reply; 47+ messages in thread
From: Mel Gorman @ 2011-11-15 17:36 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Colin Cross, Pekka Enberg, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, David Rientjes, LKML, Linux-MM

On Wed, Nov 16, 2011 at 01:13:30AM +0900, Minchan Kim wrote:
> On Mon, Nov 14, 2011 at 11:04 PM, Mel Gorman <mgorman@suse.de> wrote:
> > This patch seems to have gotten lost in the cracks and the discussion
> > on alternatives that started here https://lkml.org/lkml/2011/10/25/24
> > petered out without any alternative patches being posted. Lacking
> > a viable alternative patch, I'm reposting this patch because AFAIK,
> > this bug still exists.
> >
> > Colin Cross reported;
> >
> >  Under the following conditions, __alloc_pages_slowpath can loop forever:
> >  gfp_mask & __GFP_WAIT is true
> >  gfp_mask & __GFP_FS is false
> >  reclaim and compaction make no progress
> >  order <= PAGE_ALLOC_COSTLY_ORDER
> >
> >  These conditions happen very often during suspend and resume,
> >  when pm_restrict_gfp_mask() effectively converts all GFP_KERNEL
> >  allocations into __GFP_WAIT.
> >
> >  The oom killer is not run because gfp_mask & __GFP_FS is false,
> >  but should_alloc_retry will always return true when order is less
> >  than PAGE_ALLOC_COSTLY_ORDER.
> >
> > In his fix, he avoided retrying the allocation if reclaim made no
> > progress and __GFP_FS was not set. The problem is that this would
> > result in GFP_NOIO allocations failing that previously succeeded
> > which would be very unfortunate.
> >
> > The big difference between GFP_NOIO and suspend converting GFP_KERNEL
> > to behave like GFP_NOIO is that normally flushers will be cleaning
> > pages and kswapd reclaims pages allowing GFP_NOIO to succeed after
> > a short delay. The same does not necessarily apply during suspend as
> > the storage device may be suspended.  Hence, this patch special cases
> > the suspend case to fail the page allocation if reclaim cannot make
> > progress. This might cause suspend to abort but that is better than
> > a livelock.
> >
> > [mgorman@suse.de: Rework fix to be suspend specific]
> > Reported-and-tested-by: Colin Cross <ccross@android.com>
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  mm/page_alloc.c |   22 ++++++++++++++++++++++
> >  1 files changed, 22 insertions(+), 0 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 9dd443d..5402897 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -127,6 +127,20 @@ void pm_restrict_gfp_mask(void)
> >        saved_gfp_mask = gfp_allowed_mask;
> >        gfp_allowed_mask &= ~GFP_IOFS;
> >  }
> > +
> > +static bool pm_suspending(void)
> > +{
> > +       if ((gfp_allowed_mask & GFP_IOFS) == GFP_IOFS)
> > +               return false;
> > +       return true;
> > +}
> > +
> > +#else
> > +
> > +static bool pm_suspending(void)
> > +{
> > +       return false;
> > +}
> >  #endif /* CONFIG_PM_SLEEP */
> >
> >  #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
> > @@ -2214,6 +2228,14 @@ rebalance:
> >
> >                        goto restart;
> >                }
> > +
> > +               /*
> > +                * Suspend converts GFP_KERNEL to __GFP_WAIT which can
> > +                * prevent reclaim making forward progress without
> > +                * invoking OOM. Bail if we are suspending
> > +                */
> > +               if (pm_suspending())
> > +                       goto nopage;
> >        }
> >
> >        /* Check if we should retry the allocation */
> >
> 
> I don't have much time to look into this problem so I miss some things.
> But the feeling I have a mind when I faced this problem is why we
> should make another special case handling function.
> Already we have such thing for hibernation - oom_killer_disabled in vm
> Could we use it instead of making new branch for very special case?

Fair question!

Suspend is a multi-stage process and the OOM killer is disabled at
a different time to the GFP flags being restricted. This is another
reason why renaming to pm_suspending to pm_suspended_storage is a
good idea (pm_suspending is misleading at best).

I am vague on all the steps hibernation takes but initially processes
are frozen and if they are successfully frozen then the OOM killer is
disabled. At this point, storage is still active so the GFP allowed
mask is the same. When preparing to write the image, kernel threads
are suspended so there is no new IO being initiated and then the GFP
mask is restricted to prevent any memory allocation trying to write
pages to storage. It then writes the image to disk.

So what we have now is

        if (!did_some_progress) {
                if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
                        if (oom_killer_disabled)
                                goto nopage;

Lets say we changed that to

        if (!did_some_progress) {
                if (oom_killer_disabled)
                        goto nopage;
                if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {

The impact would be that during the time between processes been frozen
and storage being suspended, GFP_NOIO allocations that used to call
wait_iff_congested and retry while kswapd does its thing will return
failure instead. These GFP_NOIO allocations that used to succeed will
now fail in rare cases during suspend and I don't think we want that.

Is this what you meant or had you something else in mind?

> Maybe It would be better to rename oom_killer_disabled with
> pm_is_going or something.
> 

I think renaming oom_killer_disabled to pm_oom_disabled would be
reasonable but it does not necessarily get us away from needing a
pm_suspended_storage() test.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-11-14 14:04 [PATCH] mm: avoid livelock on !__GFP_FS allocations Mel Gorman
                   ` (2 preceding siblings ...)
  2011-11-15 16:13 ` Minchan Kim
@ 2011-11-15 21:40 ` David Rientjes
  2011-11-16  9:52   ` Mel Gorman
  3 siblings, 1 reply; 47+ messages in thread
From: David Rientjes @ 2011-11-15 21:40 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Colin Cross, Pekka Enberg, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, LKML, Linux-MM

On Mon, 14 Nov 2011, Mel Gorman wrote:

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9dd443d..5402897 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -127,6 +127,20 @@ void pm_restrict_gfp_mask(void)
>  	saved_gfp_mask = gfp_allowed_mask;
>  	gfp_allowed_mask &= ~GFP_IOFS;
>  }
> +
> +static bool pm_suspending(void)
> +{
> +	if ((gfp_allowed_mask & GFP_IOFS) == GFP_IOFS)
> +		return false;
> +	return true;
> +}
> +
> +#else
> +
> +static bool pm_suspending(void)
> +{
> +	return false;
> +}
>  #endif /* CONFIG_PM_SLEEP */
>  
>  #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
> @@ -2214,6 +2228,14 @@ rebalance:
>  
>  			goto restart;
>  		}
> +
> +		/*
> +		 * Suspend converts GFP_KERNEL to __GFP_WAIT which can
> +		 * prevent reclaim making forward progress without
> +		 * invoking OOM. Bail if we are suspending
> +		 */
> +		if (pm_suspending())
> +			goto nopage;
>  	}
>  
>  	/* Check if we should retry the allocation */

This allows all __GFP_NOFAIL allocations to fail while 
pm_restrict_gfp_mask() is in effect, so I disagree with this unless it is 
moved into the should_alloc_retry() logic.  If you pass did_some_progress 
into that function and then moved the check for __GFP_NOFAIL right under 
the check for __GFP_NORETRY and checked for pm_suspending() there (and 
before the check for PAGE_ALLOC_COSTLY_ORDER) then it would allow the 
infinite loop for __GFP_NOFAIL which is required if __GFP_WAIT.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-11-15 17:36   ` Mel Gorman
@ 2011-11-16  0:22     ` Minchan Kim
  2011-11-16  0:28       ` Colin Cross
  0 siblings, 1 reply; 47+ messages in thread
From: Minchan Kim @ 2011-11-16  0:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Colin Cross, Pekka Enberg, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, David Rientjes, LKML, Linux-MM

On Tue, Nov 15, 2011 at 05:36:56PM +0000, Mel Gorman wrote:
> On Wed, Nov 16, 2011 at 01:13:30AM +0900, Minchan Kim wrote:
> > On Mon, Nov 14, 2011 at 11:04 PM, Mel Gorman <mgorman@suse.de> wrote:
> > > This patch seems to have gotten lost in the cracks and the discussion
> > > on alternatives that started here https://lkml.org/lkml/2011/10/25/24
> > > petered out without any alternative patches being posted. Lacking
> > > a viable alternative patch, I'm reposting this patch because AFAIK,
> > > this bug still exists.
> > >
> > > Colin Cross reported;
> > >
> > >  Under the following conditions, __alloc_pages_slowpath can loop forever:
> > >  gfp_mask & __GFP_WAIT is true
> > >  gfp_mask & __GFP_FS is false
> > >  reclaim and compaction make no progress
> > >  order <= PAGE_ALLOC_COSTLY_ORDER
> > >
> > >  These conditions happen very often during suspend and resume,
> > >  when pm_restrict_gfp_mask() effectively converts all GFP_KERNEL
> > >  allocations into __GFP_WAIT.
> > >
> > >  The oom killer is not run because gfp_mask & __GFP_FS is false,
> > >  but should_alloc_retry will always return true when order is less
> > >  than PAGE_ALLOC_COSTLY_ORDER.
> > >
> > > In his fix, he avoided retrying the allocation if reclaim made no
> > > progress and __GFP_FS was not set. The problem is that this would
> > > result in GFP_NOIO allocations failing that previously succeeded
> > > which would be very unfortunate.
> > >
> > > The big difference between GFP_NOIO and suspend converting GFP_KERNEL
> > > to behave like GFP_NOIO is that normally flushers will be cleaning
> > > pages and kswapd reclaims pages allowing GFP_NOIO to succeed after
> > > a short delay. The same does not necessarily apply during suspend as
> > > the storage device may be suspended.  Hence, this patch special cases
> > > the suspend case to fail the page allocation if reclaim cannot make
> > > progress. This might cause suspend to abort but that is better than
> > > a livelock.
> > >
> > > [mgorman@suse.de: Rework fix to be suspend specific]
> > > Reported-and-tested-by: Colin Cross <ccross@android.com>
> > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > ---
> > >  mm/page_alloc.c |   22 ++++++++++++++++++++++
> > >  1 files changed, 22 insertions(+), 0 deletions(-)
> > >
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 9dd443d..5402897 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -127,6 +127,20 @@ void pm_restrict_gfp_mask(void)
> > >        saved_gfp_mask = gfp_allowed_mask;
> > >        gfp_allowed_mask &= ~GFP_IOFS;
> > >  }
> > > +
> > > +static bool pm_suspending(void)
> > > +{
> > > +       if ((gfp_allowed_mask & GFP_IOFS) == GFP_IOFS)
> > > +               return false;
> > > +       return true;
> > > +}
> > > +
> > > +#else
> > > +
> > > +static bool pm_suspending(void)
> > > +{
> > > +       return false;
> > > +}
> > >  #endif /* CONFIG_PM_SLEEP */
> > >
> > >  #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
> > > @@ -2214,6 +2228,14 @@ rebalance:
> > >
> > >                        goto restart;
> > >                }
> > > +
> > > +               /*
> > > +                * Suspend converts GFP_KERNEL to __GFP_WAIT which can
> > > +                * prevent reclaim making forward progress without
> > > +                * invoking OOM. Bail if we are suspending
> > > +                */
> > > +               if (pm_suspending())
> > > +                       goto nopage;
> > >        }
> > >
> > >        /* Check if we should retry the allocation */
> > >
> > 
> > I don't have much time to look into this problem so I miss some things.
> > But the feeling I have a mind when I faced this problem is why we
> > should make another special case handling function.
> > Already we have such thing for hibernation - oom_killer_disabled in vm
> > Could we use it instead of making new branch for very special case?
> 
> Fair question!
> 
> Suspend is a multi-stage process and the OOM killer is disabled at
> a different time to the GFP flags being restricted. This is another
> reason why renaming to pm_suspending to pm_suspended_storage is a
> good idea (pm_suspending is misleading at best).
> 
> I am vague on all the steps hibernation takes but initially processes
> are frozen and if they are successfully frozen then the OOM killer is
> disabled. At this point, storage is still active so the GFP allowed
> mask is the same. When preparing to write the image, kernel threads
> are suspended so there is no new IO being initiated and then the GFP
> mask is restricted to prevent any memory allocation trying to write
> pages to storage. It then writes the image to disk.
> 
> So what we have now is
> 
>         if (!did_some_progress) {
>                 if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
>                         if (oom_killer_disabled)
>                                 goto nopage;
> 
> Lets say we changed that to
> 
>         if (!did_some_progress) {
>                 if (oom_killer_disabled)
>                         goto nopage;
>                 if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> 
> The impact would be that during the time between processes been frozen
> and storage being suspended, GFP_NOIO allocations that used to call
> wait_iff_congested and retry while kswapd does its thing will return
> failure instead. These GFP_NOIO allocations that used to succeed will
> now fail in rare cases during suspend and I don't think we want that.
> 
> Is this what you meant or had you something else in mind?
> 

You read my mind exactly!

I thought hibernation process is as follows,

freeze user processes
oom_disable
hibernate_preallocate_memory
freeze kernel processes(include kswapd)
pm_restrict_gfp_mask
swsusp_save

My guessing is hibernate_prealocate_memory should reserve all memory needed
for hibernation for reclaimaing pages of kswapd because kswapd just would be
stopped so during swsusp_save, page reclaim should not be occured.

But being see description of patch, my guess seems wrong.
Now the problem happens and it means page reclaim happens during swsusp_save.
Colin or someone could confirm this?

If so, could we reserve more memory when we preallocate hibernation memory
for avoiding page reclaim without kswapd?

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-11-16  0:22     ` Minchan Kim
@ 2011-11-16  0:28       ` Colin Cross
  2011-11-16  0:45         ` Minchan Kim
  0 siblings, 1 reply; 47+ messages in thread
From: Colin Cross @ 2011-11-16  0:28 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Mel Gorman, Andrew Morton, Pekka Enberg, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, David Rientjes, LKML, Linux-MM

On Tue, Nov 15, 2011 at 4:22 PM, Minchan Kim <minchan.kim@gmail.com> wrote:
> On Tue, Nov 15, 2011 at 05:36:56PM +0000, Mel Gorman wrote:
>> On Wed, Nov 16, 2011 at 01:13:30AM +0900, Minchan Kim wrote:
>> The impact would be that during the time between processes been frozen
>> and storage being suspended, GFP_NOIO allocations that used to call
>> wait_iff_congested and retry while kswapd does its thing will return
>> failure instead. These GFP_NOIO allocations that used to succeed will
>> now fail in rare cases during suspend and I don't think we want that.
>>
>> Is this what you meant or had you something else in mind?
>>
>
> You read my mind exactly!
>
> I thought hibernation process is as follows,
>
> freeze user processes
> oom_disable
> hibernate_preallocate_memory
> freeze kernel processes(include kswapd)
> pm_restrict_gfp_mask
> swsusp_save
>
> My guessing is hibernate_prealocate_memory should reserve all memory needed
> for hibernation for reclaimaing pages of kswapd because kswapd just would be
> stopped so during swsusp_save, page reclaim should not be occured.
>
> But being see description of patch, my guess seems wrong.
> Now the problem happens and it means page reclaim happens during swsusp_save.
> Colin or someone could confirm this?

The problem I see is during suspend, not hibernation.  The particular
allocation that usually causes the problem is the pgd_alloc for page
tables when re-enabling the 2nd cpu during resume, which is odd as
those same page tables were freed during suspend.  I guess an
unfreezable kernel thread allocated that memory between the free and
re-allocation.

> If so, could we reserve more memory when we preallocate hibernation memory
> for avoiding page reclaim without kswapd?
>
> --
> Kind regards,
> Minchan Kim
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-11-16  0:28       ` Colin Cross
@ 2011-11-16  0:45         ` Minchan Kim
  2011-11-16  7:10           ` Pekka Enberg
  0 siblings, 1 reply; 47+ messages in thread
From: Minchan Kim @ 2011-11-16  0:45 UTC (permalink / raw)
  To: Colin Cross
  Cc: Mel Gorman, Andrew Morton, Pekka Enberg, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, David Rientjes, LKML, Linux-MM

On Tue, Nov 15, 2011 at 04:28:36PM -0800, Colin Cross wrote:
> On Tue, Nov 15, 2011 at 4:22 PM, Minchan Kim <minchan.kim@gmail.com> wrote:
> > On Tue, Nov 15, 2011 at 05:36:56PM +0000, Mel Gorman wrote:
> >> On Wed, Nov 16, 2011 at 01:13:30AM +0900, Minchan Kim wrote:
> >> The impact would be that during the time between processes been frozen
> >> and storage being suspended, GFP_NOIO allocations that used to call
> >> wait_iff_congested and retry while kswapd does its thing will return
> >> failure instead. These GFP_NOIO allocations that used to succeed will
> >> now fail in rare cases during suspend and I don't think we want that.
> >>
> >> Is this what you meant or had you something else in mind?
> >>
> >
> > You read my mind exactly!
> >
> > I thought hibernation process is as follows,
> >
> > freeze user processes
> > oom_disable
> > hibernate_preallocate_memory
> > freeze kernel processes(include kswapd)
> > pm_restrict_gfp_mask
> > swsusp_save
> >
> > My guessing is hibernate_prealocate_memory should reserve all memory needed
> > for hibernation for reclaimaing pages of kswapd because kswapd just would be
> > stopped so during swsusp_save, page reclaim should not be occured.
> >
> > But being see description of patch, my guess seems wrong.
> > Now the problem happens and it means page reclaim happens during swsusp_save.
> > Colin or someone could confirm this?
> 
> The problem I see is during suspend, not hibernation.  The particular
> allocation that usually causes the problem is the pgd_alloc for page
> tables when re-enabling the 2nd cpu during resume, which is odd as
> those same page tables were freed during suspend.  I guess an
> unfreezable kernel thread allocated that memory between the free and
> re-allocation.

Then, How about this?

[barrios@barrios-laptop linux-2.6]$ git diff
diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
index fdd4263..01aa9b5 100644
--- a/kernel/power/suspend.c
+++ b/kernel/power/suspend.c
@@ -297,9 +297,11 @@ int enter_state(suspend_state_t state)
                goto Finish;
 
        pr_debug("PM: Entering %s sleep\n", pm_states[state]);
+       oom_killer_disable();
        pm_restrict_gfp_mask();
        error = suspend_devices_and_enter(state);
        pm_restore_gfp_mask();
+       oom_killer_enable();
 
  Finish:
        pr_debug("PM: Finishing wakeup.\n");
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6e8ecb6..d8c31b7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2177,9 +2177,9 @@ rebalance:
         * running out of options and have to consider going OOM
         */
        if (!did_some_progress) {
-               if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
-                       if (oom_killer_disabled)
+               if (oom_killer_disabled)
                                goto nopage;
+               if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
                        page = __alloc_pages_may_oom(gfp_mask, order,
                                        zonelist, high_zoneidx,
                                        nodemask, preferred_zone,

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-11-16  0:45         ` Minchan Kim
@ 2011-11-16  7:10           ` Pekka Enberg
  2011-11-16 21:44             ` David Rientjes
  0 siblings, 1 reply; 47+ messages in thread
From: Pekka Enberg @ 2011-11-16  7:10 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Colin Cross, Mel Gorman, Andrew Morton, Pekka Enberg,
	KAMEZAWA Hiroyuki, Andrea Arcangeli, David Rientjes, LKML,
	Linux-MM

On Wed, 16 Nov 2011, Minchan Kim wrote:
> Then, How about this?
>
> [barrios@barrios-laptop linux-2.6]$ git diff
> diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
> index fdd4263..01aa9b5 100644
> --- a/kernel/power/suspend.c
> +++ b/kernel/power/suspend.c
> @@ -297,9 +297,11 @@ int enter_state(suspend_state_t state)
>                goto Finish;
>
>        pr_debug("PM: Entering %s sleep\n", pm_states[state]);
> +       oom_killer_disable();
>        pm_restrict_gfp_mask();
>        error = suspend_devices_and_enter(state);
>        pm_restore_gfp_mask();
> +       oom_killer_enable();
>
>  Finish:
>        pr_debug("PM: Finishing wakeup.\n");
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6e8ecb6..d8c31b7 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2177,9 +2177,9 @@ rebalance:
>         * running out of options and have to consider going OOM
>         */
>        if (!did_some_progress) {
> -               if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> -                       if (oom_killer_disabled)
> +               if (oom_killer_disabled)
>                                goto nopage;
> +               if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
>                        page = __alloc_pages_may_oom(gfp_mask, order,
>                                        zonelist, high_zoneidx,
>                                        nodemask, preferred_zone,
>

I'd prefer something like this. The whole 'gfp_allowed_flags' thing was 
designed to make GFP_KERNEL work during boot time where it's obviously 
safe to do that. I really don't think that's going to work suspend 
cleanly.

 			Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-11-15 21:40 ` David Rientjes
@ 2011-11-16  9:52   ` Mel Gorman
  2011-11-16 21:39     ` David Rientjes
  0 siblings, 1 reply; 47+ messages in thread
From: Mel Gorman @ 2011-11-16  9:52 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Colin Cross, Pekka Enberg, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, LKML, Linux-MM

On Tue, Nov 15, 2011 at 01:40:42PM -0800, David Rientjes wrote:
> > <SNIP>
> > @@ -2214,6 +2228,14 @@ rebalance:
> >  
> >  			goto restart;
> >  		}
> > +
> > +		/*
> > +		 * Suspend converts GFP_KERNEL to __GFP_WAIT which can
> > +		 * prevent reclaim making forward progress without
> > +		 * invoking OOM. Bail if we are suspending
> > +		 */
> > +		if (pm_suspending())
> > +			goto nopage;
> >  	}
> >  
> >  	/* Check if we should retry the allocation */
> 
> This allows all __GFP_NOFAIL allocations to fail while 
> pm_restrict_gfp_mask() is in effect, so I disagree with this unless it is 
> moved into the should_alloc_retry() logic.  If you pass did_some_progress 
> into that function and then moved the check for __GFP_NOFAIL right under 
> the check for __GFP_NORETRY and checked for pm_suspending() there (and 
> before the check for PAGE_ALLOC_COSTLY_ORDER) then it would allow the 
> infinite loop for __GFP_NOFAIL which is required if __GFP_WAIT.

Good point. I agree that it would be more consistent although
there is still the risk of infinite looping with __GFP_NOFAIL if
storage devices are disabled.

Colin reported elsewhere in this thread that "the particular allocation
that usually causes the problem is the pgd_alloc for page tables when
re-enabling the 2nd cpu during resume". On X86, those allocations are using
the flags

GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO

so they should not be trapped in an infinite loop due to __GFP_NOFAIL.
On ARM, they use GFP_KERNEL so should also be ok.

That said, this patch is no longer functionally equivalent to what he
tested so I had to remove the tested-by. Colin, can you retest with the
following patch please? If it gets stuck, it's interesting in itself
because it means we were previously failing __GFP_NOFAIL allocations
during suspend on paths that probably don't handle allocation failure
very well.

David, is this what you meant? This patch includes all the
documentation-related updates that were discussed in this thread as well
as updated the check in mm/swapfile.c for hibernation.

==== CUT HERE ====
mm: avoid livelock on !__GFP_FS allocations v2

Changelog since V1
  o Move PM check to should_alloc_retry (David Rientjes)
  o Add some additional documentation

Colin Cross reported;

  Under the following conditions, __alloc_pages_slowpath can loop forever:
  gfp_mask & __GFP_WAIT is true
  gfp_mask & __GFP_FS is false
  reclaim and compaction make no progress
  order <= PAGE_ALLOC_COSTLY_ORDER

  These conditions happen very often during suspend and resume,
  when pm_restrict_gfp_mask() effectively converts all GFP_KERNEL
  allocations into __GFP_WAIT.

  The oom killer is not run because gfp_mask & __GFP_FS is false,
  but should_alloc_retry will always return true when order is less
  than PAGE_ALLOC_COSTLY_ORDER.

In his fix, he avoided retrying the allocation if reclaim made no
progress and __GFP_FS was not set. The problem is that this would
result in GFP_NOIO allocations failing that previously succeeded
which would be very unfortunate.

The big difference between GFP_NOIO and suspend converting GFP_KERNEL
to behave like GFP_NOIO is that normally flushers will be cleaning
pages and kswapd reclaims pages allowing GFP_NOIO to succeed after
a short delay. The same does not necessarily apply during suspend as
the storage device may be suspended.

This patch special cases the suspend case to fail the page allocation
if reclaim cannot make progress and adds some documentation on how
gfp_allowed_mask is currently used. Failing allocations like this
may cause suspend to abort but that is better than a livelock.

[mgorman@suse.de: Rework fix to be suspend specific]
[rientjes@google.com: Move suspended device check to should_alloc_retry]
Reported-by: Colin Cross <ccross@android.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/gfp.h |   16 ++++++++++++++++
 mm/page_alloc.c     |   30 ++++++++++++++++++++++--------
 mm/swapfile.c       |    6 +++---
 3 files changed, 41 insertions(+), 11 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 3a76faf..033f55f 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -367,9 +367,25 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
 void drain_all_pages(void);
 void drain_local_pages(void *dummy);
 
+/*
+ * gfp_allowed_mask is set to GFP_BOOT_MASK during early boot to restrict what
+ * GFP flags are used before interrupts are enabled. Once interrupts are
+ * enabled, it is set to __GFP_BITS_MASK while the system is running. During
+ * hibernation, it is used by PM to avoid I/O during memory allocation while
+ * devices are suspended.
+ */
 extern gfp_t gfp_allowed_mask;
 
 extern void pm_restrict_gfp_mask(void);
 extern void pm_restore_gfp_mask(void);
 
+#ifdef CONFIG_PM_SLEEP
+extern bool pm_suspended_storage(void);
+#else
+static inline bool pm_suspended_storage(void)
+{
+	return false;
+}
+#endif /* CONFIG_PM_SLEEP */
+
 #endif /* __LINUX_GFP_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9dd443d..a72cbf9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -127,6 +127,13 @@ void pm_restrict_gfp_mask(void)
 	saved_gfp_mask = gfp_allowed_mask;
 	gfp_allowed_mask &= ~GFP_IOFS;
 }
+
+bool pm_suspended_storage(void)
+{
+	if ((gfp_allowed_mask & GFP_IOFS) == GFP_IOFS)
+		return false;
+	return true;
+}
 #endif /* CONFIG_PM_SLEEP */
 
 #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
@@ -1795,12 +1802,25 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
 
 static inline int
 should_alloc_retry(gfp_t gfp_mask, unsigned int order,
+				unsigned long did_some_progress,
 				unsigned long pages_reclaimed)
 {
 	/* Do not loop if specifically requested */
 	if (gfp_mask & __GFP_NORETRY)
 		return 0;
 
+	/* Always retry if specifically requested */
+	if (gfp_mask & __GFP_NOFAIL)
+		return 1;
+
+	/*
+	 * Suspend converts GFP_KERNEL to __GFP_WAIT which can prevent reclaim
+	 * making forward progress without invoking OOM. Suspend also disables
+	 * storage devices so kswapd will not help. Bail if we are suspending.
+	 */
+	if (!did_some_progress && pm_suspended_storage())
+		return 0;
+
 	/*
 	 * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
 	 * means __GFP_NOFAIL, but that may not be true in other
@@ -1819,13 +1839,6 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
 	if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
 		return 1;
 
-	/*
-	 * Don't let big-order allocations loop unless the caller
-	 * explicitly requests that.
-	 */
-	if (gfp_mask & __GFP_NOFAIL)
-		return 1;
-
 	return 0;
 }
 
@@ -2218,7 +2231,8 @@ rebalance:
 
 	/* Check if we should retry the allocation */
 	pages_reclaimed += did_some_progress;
-	if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
+	if (should_alloc_retry(gfp_mask, order, did_some_progress,
+						pages_reclaimed)) {
 		/* Wait for some write requests to complete then retry */
 		wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
 		goto rebalance;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index b1cd120..9520592 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -667,10 +667,10 @@ int try_to_free_swap(struct page *page)
 	 * original page might be freed under memory pressure, then
 	 * later read back in from swap, now with the wrong data.
 	 *
-	 * Hibernation clears bits from gfp_allowed_mask to prevent
-	 * memory reclaim from writing to disk, so check that here.
+	 * Hibration suspends storage while it is writing the image
+	 * to disk so check that here.
 	 */
-	if (!(gfp_allowed_mask & __GFP_IO))
+	if (pm_suspended_storage())
 		return 0;
 
 	delete_from_swap_cache(page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-11-16  9:52   ` Mel Gorman
@ 2011-11-16 21:39     ` David Rientjes
  0 siblings, 0 replies; 47+ messages in thread
From: David Rientjes @ 2011-11-16 21:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Colin Cross, Pekka Enberg, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, LKML, Linux-MM

On Wed, 16 Nov 2011, Mel Gorman wrote:

> Good point. I agree that it would be more consistent although
> there is still the risk of infinite looping with __GFP_NOFAIL if
> storage devices are disabled.
> 

Yeah, that's always been possible even regardless of the state of storage 
devices.  If a task has access to memory reserves via TIF_MEMDIE, 
__alloc_pages_high_priority() will just loop indefinitely anyway for these 
allocations.  While users of __GFP_NOFAIL accept that it won't return NULL 
as long as they have __GFP_WAIT (which they all do), then they should also 
accept the fact that it may never return at all.

> Colin reported elsewhere in this thread that "the particular allocation
> that usually causes the problem is the pgd_alloc for page tables when
> re-enabling the 2nd cpu during resume". On X86, those allocations are using
> the flags
> 
> GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO
> 
> so they should not be trapped in an infinite loop due to __GFP_NOFAIL.
> On ARM, they use GFP_KERNEL so should also be ok.
> 

The __GFP_REPEAT is concerning because there's a high liklihood that 
!__GFP_FS as a result of suspend will never cause enough pages to be 
reclaimed so the necessary threshold will be reached to exit from its own 
self-induced infinite loop.  So if we go forward with failing allocations 
attempted without __GFP_IO and __GFP_FS that are !__GFP_NOFAIL, then we 
should also add that __GFP_REPEAT is a no-op without __GFP_IO or __GFP_FS.

> David, is this what you meant? This patch includes all the
> documentation-related updates that were discussed in this thread as well
> as updated the check in mm/swapfile.c for hibernation.
> 
> ==== CUT HERE ====
> mm: avoid livelock on !__GFP_FS allocations v2
> 
> Changelog since V1
>   o Move PM check to should_alloc_retry (David Rientjes)
>   o Add some additional documentation
> 
> Colin Cross reported;
> 
>   Under the following conditions, __alloc_pages_slowpath can loop forever:
>   gfp_mask & __GFP_WAIT is true
>   gfp_mask & __GFP_FS is false
>   reclaim and compaction make no progress
>   order <= PAGE_ALLOC_COSTLY_ORDER
> 
>   These conditions happen very often during suspend and resume,
>   when pm_restrict_gfp_mask() effectively converts all GFP_KERNEL
>   allocations into __GFP_WAIT.
> 
>   The oom killer is not run because gfp_mask & __GFP_FS is false,
>   but should_alloc_retry will always return true when order is less
>   than PAGE_ALLOC_COSTLY_ORDER.
> 
> In his fix, he avoided retrying the allocation if reclaim made no
> progress and __GFP_FS was not set. The problem is that this would
> result in GFP_NOIO allocations failing that previously succeeded
> which would be very unfortunate.
> 
> The big difference between GFP_NOIO and suspend converting GFP_KERNEL
> to behave like GFP_NOIO is that normally flushers will be cleaning
> pages and kswapd reclaims pages allowing GFP_NOIO to succeed after
> a short delay. The same does not necessarily apply during suspend as
> the storage device may be suspended.
> 
> This patch special cases the suspend case to fail the page allocation
> if reclaim cannot make progress and adds some documentation on how
> gfp_allowed_mask is currently used. Failing allocations like this
> may cause suspend to abort but that is better than a livelock.
> 
> [mgorman@suse.de: Rework fix to be suspend specific]
> [rientjes@google.com: Move suspended device check to should_alloc_retry]
> Reported-by: Colin Cross <ccross@android.com>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: David Rientjes <rientjes@google.com>

Thanks Mel!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-11-16  7:10           ` Pekka Enberg
@ 2011-11-16 21:44             ` David Rientjes
  2011-11-16 21:58               ` Rafael J. Wysocki
  2011-11-16 22:07               ` Minchan Kim
  0 siblings, 2 replies; 47+ messages in thread
From: David Rientjes @ 2011-11-16 21:44 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Minchan Kim, Colin Cross, Mel Gorman, Andrew Morton, Pekka Enberg,
	KAMEZAWA Hiroyuki, Andrea Arcangeli, LKML, Linux-MM,
	Rafael J. Wysocki

On Wed, 16 Nov 2011, Pekka Enberg wrote:

> > diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
> > index fdd4263..01aa9b5 100644
> > --- a/kernel/power/suspend.c
> > +++ b/kernel/power/suspend.c
> > @@ -297,9 +297,11 @@ int enter_state(suspend_state_t state)
> >                goto Finish;
> > 
> >        pr_debug("PM: Entering %s sleep\n", pm_states[state]);
> > +       oom_killer_disable();
> >        pm_restrict_gfp_mask();
> >        error = suspend_devices_and_enter(state);
> >        pm_restore_gfp_mask();
> > +       oom_killer_enable();
> > 
> >  Finish:
> >        pr_debug("PM: Finishing wakeup.\n");
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 6e8ecb6..d8c31b7 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2177,9 +2177,9 @@ rebalance:
> >         * running out of options and have to consider going OOM
> >         */
> >        if (!did_some_progress) {
> > -               if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> > -                       if (oom_killer_disabled)
> > +               if (oom_killer_disabled)
> >                                goto nopage;

You're allowing __GFP_NOFAIL allocations to fail.

> > +               if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> >                        page = __alloc_pages_may_oom(gfp_mask, order,
> >                                        zonelist, high_zoneidx,
> >                                        nodemask, preferred_zone,
> > 
> 
> I'd prefer something like this. The whole 'gfp_allowed_flags' thing was
> designed to make GFP_KERNEL work during boot time where it's obviously safe to
> do that. I really don't think that's going to work suspend cleanly.
> 

Adding Rafael to the cc.

This has been done since 2.6.34 and presumably has been working quite 
well.  I don't have a specific objection to gfp_allowed_flags to be used 
outside of boot since it seems plausible that there are system-level 
contexts that would need different behavior in the page allocator and this 
does it effectively without major surgery or a slower fastpath.  Suspend 
is using it just like boot does before irqs are enabled, so I don't have 
an objection to it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-11-16 21:44             ` David Rientjes
@ 2011-11-16 21:58               ` Rafael J. Wysocki
  2011-11-16 22:07               ` Minchan Kim
  1 sibling, 0 replies; 47+ messages in thread
From: Rafael J. Wysocki @ 2011-11-16 21:58 UTC (permalink / raw)
  To: David Rientjes
  Cc: Pekka Enberg, Minchan Kim, Colin Cross, Mel Gorman, Andrew Morton,
	Pekka Enberg, KAMEZAWA Hiroyuki, Andrea Arcangeli, LKML, Linux-MM

On Wednesday, November 16, 2011, David Rientjes wrote:
> On Wed, 16 Nov 2011, Pekka Enberg wrote:
> 
> > > diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
> > > index fdd4263..01aa9b5 100644
> > > --- a/kernel/power/suspend.c
> > > +++ b/kernel/power/suspend.c
> > > @@ -297,9 +297,11 @@ int enter_state(suspend_state_t state)
> > >                goto Finish;
> > > 
> > >        pr_debug("PM: Entering %s sleep\n", pm_states[state]);
> > > +       oom_killer_disable();
> > >        pm_restrict_gfp_mask();
> > >        error = suspend_devices_and_enter(state);
> > >        pm_restore_gfp_mask();
> > > +       oom_killer_enable();
> > > 
> > >  Finish:
> > >        pr_debug("PM: Finishing wakeup.\n");
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 6e8ecb6..d8c31b7 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -2177,9 +2177,9 @@ rebalance:
> > >         * running out of options and have to consider going OOM
> > >         */
> > >        if (!did_some_progress) {
> > > -               if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> > > -                       if (oom_killer_disabled)
> > > +               if (oom_killer_disabled)
> > >                                goto nopage;
> 
> You're allowing __GFP_NOFAIL allocations to fail.
> 
> > > +               if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> > >                        page = __alloc_pages_may_oom(gfp_mask, order,
> > >                                        zonelist, high_zoneidx,
> > >                                        nodemask, preferred_zone,
> > > 
> > 
> > I'd prefer something like this. The whole 'gfp_allowed_flags' thing was
> > designed to make GFP_KERNEL work during boot time where it's obviously safe to
> > do that. I really don't think that's going to work suspend cleanly.
> > 
> 
> Adding Rafael to the cc.
> 
> This has been done since 2.6.34 and presumably has been working quite 
> well.

Yes, it has.

> I don't have a specific objection to gfp_allowed_flags to be used 
> outside of boot since it seems plausible that there are system-level 
> contexts that would need different behavior in the page allocator and this 
> does it effectively without major surgery or a slower fastpath.  Suspend 
> is using it just like boot does before irqs are enabled, so I don't have 
> an objection to it.

Good. :-)

Thanks,
Rafael

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-11-16 21:44             ` David Rientjes
  2011-11-16 21:58               ` Rafael J. Wysocki
@ 2011-11-16 22:07               ` Minchan Kim
  2011-11-16 22:48                 ` David Rientjes
  1 sibling, 1 reply; 47+ messages in thread
From: Minchan Kim @ 2011-11-16 22:07 UTC (permalink / raw)
  To: David Rientjes
  Cc: Pekka Enberg, Minchan Kim, Colin Cross, Mel Gorman, Andrew Morton,
	Pekka Enberg, KAMEZAWA Hiroyuki, Andrea Arcangeli, LKML, Linux-MM,
	Rafael J. Wysocki

On Thu, Nov 17, 2011 at 6:44 AM, David Rientjes <rientjes@google.com> wrote:
> On Wed, 16 Nov 2011, Pekka Enberg wrote:
>
>> > diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
>> > index fdd4263..01aa9b5 100644
>> > --- a/kernel/power/suspend.c
>> > +++ b/kernel/power/suspend.c
>> > @@ -297,9 +297,11 @@ int enter_state(suspend_state_t state)
>> >                goto Finish;
>> >
>> >        pr_debug("PM: Entering %s sleep\n", pm_states[state]);
>> > +       oom_killer_disable();
>> >        pm_restrict_gfp_mask();
>> >        error = suspend_devices_and_enter(state);
>> >        pm_restore_gfp_mask();
>> > +       oom_killer_enable();
>> >
>> >  Finish:
>> >        pr_debug("PM: Finishing wakeup.\n");
>> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> > index 6e8ecb6..d8c31b7 100644
>> > --- a/mm/page_alloc.c
>> > +++ b/mm/page_alloc.c
>> > @@ -2177,9 +2177,9 @@ rebalance:
>> >         * running out of options and have to consider going OOM
>> >         */
>> >        if (!did_some_progress) {
>> > -               if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
>> > -                       if (oom_killer_disabled)
>> > +               if (oom_killer_disabled)
>> >                                goto nopage;
>
> You're allowing __GFP_NOFAIL allocations to fail.
>
>> > +               if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
>> >                        page = __alloc_pages_may_oom(gfp_mask, order,
>> >                                        zonelist, high_zoneidx,
>> >                                        nodemask, preferred_zone,
>> >
>>
>> I'd prefer something like this. The whole 'gfp_allowed_flags' thing was
>> designed to make GFP_KERNEL work during boot time where it's obviously safe to
>> do that. I really don't think that's going to work suspend cleanly.
>>
>
> Adding Rafael to the cc.
>
> This has been done since 2.6.34 and presumably has been working quite
> well.  I don't have a specific objection to gfp_allowed_flags to be used
> outside of boot since it seems plausible that there are system-level
> contexts that would need different behavior in the page allocator and this
> does it effectively without major surgery or a slower fastpath.  Suspend
> is using it just like boot does before irqs are enabled, so I don't have
> an objection to it.
>

My point isn't using gfp_allowed_flags(maybe it's Pekka's concern) but
why adding new special case handling code like pm_suspended_storage.
I think we can handle the issue with oom_killer_disabled(but the naming is bad)

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations
  2011-11-16 22:07               ` Minchan Kim
@ 2011-11-16 22:48                 ` David Rientjes
  0 siblings, 0 replies; 47+ messages in thread
From: David Rientjes @ 2011-11-16 22:48 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Pekka Enberg, Minchan Kim, Colin Cross, Mel Gorman, Andrew Morton,
	Pekka Enberg, KAMEZAWA Hiroyuki, Andrea Arcangeli, LKML, Linux-MM,
	Rafael J. Wysocki

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4065 bytes --]

On Thu, 17 Nov 2011, Minchan Kim wrote:

> >> > diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
> >> > index fdd4263..01aa9b5 100644
> >> > --- a/kernel/power/suspend.c
> >> > +++ b/kernel/power/suspend.c
> >> > @@ -297,9 +297,11 @@ int enter_state(suspend_state_t state)
> >> > A  A  A  A  A  A  A  A goto Finish;
> >> >
> >> > A  A  A  A pr_debug("PM: Entering %s sleep\n", pm_states[state]);
> >> > + A  A  A  oom_killer_disable();
> >> > A  A  A  A pm_restrict_gfp_mask();
> >> > A  A  A  A error = suspend_devices_and_enter(state);
> >> > A  A  A  A pm_restore_gfp_mask();
> >> > + A  A  A  oom_killer_enable();
> >> >
> >> > A Finish:
> >> > A  A  A  A pr_debug("PM: Finishing wakeup.\n");
> >> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> > index 6e8ecb6..d8c31b7 100644
> >> > --- a/mm/page_alloc.c
> >> > +++ b/mm/page_alloc.c
> >> > @@ -2177,9 +2177,9 @@ rebalance:
> >> > A  A  A  A  * running out of options and have to consider going OOM
> >> > A  A  A  A  */
> >> > A  A  A  A if (!did_some_progress) {
> >> > - A  A  A  A  A  A  A  if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> >> > - A  A  A  A  A  A  A  A  A  A  A  if (oom_killer_disabled)
> >> > + A  A  A  A  A  A  A  if (oom_killer_disabled)
> >> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A goto nopage;
> >
> > You're allowing __GFP_NOFAIL allocations to fail.
> >
> >> > + A  A  A  A  A  A  A  if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> >> > A  A  A  A  A  A  A  A  A  A  A  A page = __alloc_pages_may_oom(gfp_mask, order,
> >> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A zonelist, high_zoneidx,
> >> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A nodemask, preferred_zone,
> >> >
> >>
> >> I'd prefer something like this. The whole 'gfp_allowed_flags' thing was
> >> designed to make GFP_KERNEL work during boot time where it's obviously safe to
> >> do that. I really don't think that's going to work suspend cleanly.
> >>
> >
> > Adding Rafael to the cc.
> >
> > This has been done since 2.6.34 and presumably has been working quite
> > well. A I don't have a specific objection to gfp_allowed_flags to be used
> > outside of boot since it seems plausible that there are system-level
> > contexts that would need different behavior in the page allocator and this
> > does it effectively without major surgery or a slower fastpath. A Suspend
> > is using it just like boot does before irqs are enabled, so I don't have
> > an objection to it.
> >
> 
> My point isn't using gfp_allowed_flags(maybe it's Pekka's concern) but
> why adding new special case handling code like pm_suspended_storage.
> I think we can handle the issue with oom_killer_disabled(but the naming is bad)
> 

Ignore the name of the function that Mel is introducing, it's only related 
to suspend because that's the only thing that (currently) alters 
gfp_allowed_mask after boot.  If something else were to clear __GFP_FS and 
__GFP_IO in the future, we'd simply need to rename the function.  I was 
going to ask for a comment specifically about that, but I think it's 
proximity to the function that allows gfp_allowed_mask to be altered is 
sufficient.

We'd really like to avoid the loop if __GFP_FS and __GFP_IO are not set.  
If oom_killer_disabled is set anytime that gfp_allowed_mask does not allow 
them for _all_ page allocations, then we could certainly replace the new 
pm_suspended_storage() check in should_alloc_retry() with a check on 
oom_killer_disabled.

I'll let you and Rafael work out whether that can be done or not, I just 
see pm_restrict_gfp_mask() being called in kernel/power/hibernate.c and 
kernel/power/suspend.c whereas oom_killer_disable() is only called in 
kernel/power/process.c.  oom_killer_disable() would be unnecessary if 
__GFP_FS were already cleared, so it would require some changes in the 
suspend code.

I like Mel's patch because it's easily maintainable and only depends on 
the state of reclaim and the gfp flags being passed in, which is the 
direct cause of the infinite loop.

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2011-11-16 22:48 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-14 14:04 [PATCH] mm: avoid livelock on !__GFP_FS allocations Mel Gorman
2011-11-14 18:38 ` Andrea Arcangeli
2011-11-15 10:30   ` Mel Gorman
2011-11-14 23:03 ` Andrew Morton
2011-11-15 10:42   ` Mel Gorman
2011-11-15 15:43     ` Mel Gorman
2011-11-15 16:13 ` Minchan Kim
2011-11-15 17:36   ` Mel Gorman
2011-11-16  0:22     ` Minchan Kim
2011-11-16  0:28       ` Colin Cross
2011-11-16  0:45         ` Minchan Kim
2011-11-16  7:10           ` Pekka Enberg
2011-11-16 21:44             ` David Rientjes
2011-11-16 21:58               ` Rafael J. Wysocki
2011-11-16 22:07               ` Minchan Kim
2011-11-16 22:48                 ` David Rientjes
2011-11-15 21:40 ` David Rientjes
2011-11-16  9:52   ` Mel Gorman
2011-11-16 21:39     ` David Rientjes
  -- strict thread matches above, loose matches on Subject: below --
2011-10-25  6:39 Colin Cross
2011-10-25  7:40 ` Pekka Enberg
2011-10-25  7:51   ` Colin Cross
2011-10-25  8:08     ` Pekka Enberg
2011-10-25 22:12     ` David Rientjes
2011-10-25  9:09 ` Mel Gorman
2011-10-25  9:26   ` Colin Cross
2011-10-25 11:23     ` Mel Gorman
2011-10-25 17:08       ` Colin Cross
2011-11-01 12:28         ` Mel Gorman
2011-10-25 19:39       ` Pekka Enberg
2011-11-01 12:29         ` Mel Gorman
2011-10-25 19:29   ` Colin Cross
2011-10-25 22:18   ` David Rientjes
2011-10-26  1:46     ` Colin Cross
2011-10-26  5:47       ` David Rientjes
2011-10-26  6:12         ` David Rientjes
2011-10-26  6:16           ` Colin Cross
2011-10-26  6:24             ` David Rientjes
2011-10-26  6:26               ` Colin Cross
2011-10-26  6:33                 ` David Rientjes
2011-10-26  6:36                   ` Colin Cross
2011-10-26  6:51                     ` David Rientjes
2011-10-26  6:57                       ` Colin Cross
2011-10-26  7:10                         ` David Rientjes
2011-10-26  7:22                           ` Colin Cross
2011-11-01 12:36                             ` Mel Gorman
2011-10-25 22:10 ` David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).