* Exhausting memory makes the system unresponsive but doesn't invoke OOM killer
@ 2015-12-23 14:31 Marcin Szewczyk
2015-12-23 16:32 ` Johannes Weiner
0 siblings, 1 reply; 3+ messages in thread
From: Marcin Szewczyk @ 2015-12-23 14:31 UTC (permalink / raw)
To: linux-mm
Hi,
In 2010 I noticed that viewing many GIFs in a row using gpicview renders
my Linux unresponsive. The problem still exists. There is very little
I can do in such a situation. Rarely after some minutes the OOM killer
kicks in and saves the day. Nevertheless, usually I end up using
Alt+SysRq+B.
What happens is gpicview exhausting whole available memory in such
a pattern that userspace becomes unresponsive. My application
(`crash.c`) allocates memory in a very similar way using GDK to
replicate the problem.
I keep the updated description of the problem and the source code here:
https://github.com/wodny/crasher
I've originally posted to linux-kernel:
http://marc.info/?t=145070009500007&r=1&w=2
but got no response.
I'm using:
3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u6 (2015-11-09) x86_64 GNU/Linux
## Symptoms
The unresponsiveness goes with high CPU load and a lot of IO (read)
operations on the root file system and its block device.
If I start the application from a text terminal (TTY) I can switch
between them, but I gain nothing because shells in other terminals are
unresponsive. Additionally I cannot perform any new logins (`fork`
fails). If I stay at the same terminal I can kill the process almost
immediately using a keystroke (e.g. ^C or ^\\). So apparently the kernel
doesn't go into a deadlock.
When running the application under Xorg I cannot switch from X to a text
terminal. Probably because Xorg uses VT_PROCESS to control terminal
switching. Because the system is very busy Xorg doesn't get scheduled
for running so it doesn't have time to acknowledge the switch request.
Using SysRq+Alt+R doesn't help.
## OOM killer not triggered
At first I thought that the OOM killer needs so much time to find and
kill the process. But further experiments using just text terminals
showed that the real problem is that the kernel doesn't notice it should
use OOM killer to kill the naughty application. The experiment:
0. switch to a text terminal, e.g. TTY2,
0. run the application (`make test`) and stay at the TTY,
0. wait until the system becomes unresponsive,
0. wait a lot...
0. either the kernel finally starts suspecting something and the OOM
killer terminates the application or you just press ^C or ^\ and
the system comes back almost immediately.
Killing the application with a signal sent via the TTY doesn't leave any
suggestions in dmesg that anything bad happened. The only symptom is
that for a moment the system behaves like after dropping caches.
## IO activity
`top` (or `htop`) and `iostat` are very useful in approximating the time
left to the magic moment.
I suppose that in such a situation the OS starts to oscillate between
freeing memory, cleaning caches and buffers, and loading some new data
(see `iostat` logs).
I can observe the most impressive effects on my physical machine
(`logs/ph-*`). On a VM (`logs/vm-*`) usually the OOM killer kills the
process after a short time (5-120 seconds).
Logs from the VM have been gathered by piping `top` and `iostat` output
to `netcat`. Logs from the physical machine have been gathered using
primitive scripts and an Android phone connected over wifi.
Notice how the hang prevents scripts from reporting at stable 2-second
intervals.
## Factors
Possible factors differentiating cases of recovering in seconds from
recoveries after minutes (or never):
- another memory-consuming process running (e.g. Firefox),
- physical machine vs a VM (see dmesg logs),
- chipset and associated kernel functions (see dmesg logs) but see the
remark on the `i915` module,
- I'm ncursed.
Things that seem irrelevant (after testing):
- running the application in Xorg or a TTY,
- LUKS encryption of the root filesystem,
- `vm.oom_kill_allocating_task` setting,
- increasing `vm.admin_reserve_kbytes`,
- using swap space,
- disabling the `i915` module caused there are no i915-specific
functions in dmesg traces, but the system still blocks the same way,
- running the application with `nice`.
Any suggestions?
--
Marcin Szewczyk http://wodny.org
mailto:Marcin.Szewczyk@wodny.borg <- remove b / usuA? b
xmpp:wodny@ubuntu.pl xmpp:wodny@jabster.pl
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 3+ messages in thread* Re: Exhausting memory makes the system unresponsive but doesn't invoke OOM killer
2015-12-23 14:31 Exhausting memory makes the system unresponsive but doesn't invoke OOM killer Marcin Szewczyk
@ 2015-12-23 16:32 ` Johannes Weiner
2015-12-23 20:31 ` Marcin Szewczyk
0 siblings, 1 reply; 3+ messages in thread
From: Johannes Weiner @ 2015-12-23 16:32 UTC (permalink / raw)
To: Marcin Szewczyk; +Cc: linux-mm
Hi Marcin,
On Wed, Dec 23, 2015 at 03:31:09PM +0100, Marcin Szewczyk wrote:
> Hi,
>
> In 2010 I noticed that viewing many GIFs in a row using gpicview renders
> my Linux unresponsive. The problem still exists. There is very little
> I can do in such a situation. Rarely after some minutes the OOM killer
> kicks in and saves the day. Nevertheless, usually I end up using
> Alt+SysRq+B.
Have you tried kicking the OOM killer manually with sysrq+f?
> What happens is gpicview exhausting whole available memory in such
> a pattern that userspace becomes unresponsive. My application
> (`crash.c`) allocates memory in a very similar way using GDK to
> replicate the problem.
>
> I keep the updated description of the problem and the source code here:
> https://github.com/wodny/crasher
>
> I've originally posted to linux-kernel:
> http://marc.info/?t=145070009500007&r=1&w=2
> but got no response.
>
> I'm using:
> 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u6 (2015-11-09) x86_64 GNU/Linux
>
> ## Symptoms
>
> The unresponsiveness goes with high CPU load and a lot of IO (read)
> operations on the root file system and its block device.
There is a semi-known issue of heavily thrashing page cache. Your
crash program sucks up most memory and leaves very little for the
executables and libraries to be cached, which results in multiple
threads experiencing cache misses in their executable code, followed
by fighting over the few remaining page cache slots, which are not
enough to meet the demand at any given point in time. These threads
than end up spending a lot of time a) searching for reusable cache
slots and taking away slots that were only recently populated by
another thread, possibly before that other thread has returned to
userspace, and then b) waiting for disk to repopulate the cache slot
which will be stolen by another thread soon, possibly before this
thread had a chance to return to userspace as well.
That being said, there is no real solution to thrashing page cache as
of this day. We have most infrastructure in place to detect it, but it
isn't hooked up to the OOM killer yet. The only answer until then is
try to keep free+buffer+cache at at least 10-15% of overall memory.
Since you can reproduce it easily, is there any chance you could grab
backtraces (sysrq+t) of the tasks while the machine is in that state?
That should confirm that most tasks are either waiting for IO or are
inside page reclaim.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Exhausting memory makes the system unresponsive but doesn't invoke OOM killer
2015-12-23 16:32 ` Johannes Weiner
@ 2015-12-23 20:31 ` Marcin Szewczyk
0 siblings, 0 replies; 3+ messages in thread
From: Marcin Szewczyk @ 2015-12-23 20:31 UTC (permalink / raw)
To: Johannes Weiner; +Cc: linux-mm
On Wed, Dec 23, 2015 at 11:32:21AM -0500, Johannes Weiner wrote:
> Hi Marcin,
Hi,
> On Wed, Dec 23, 2015 at 03:31:09PM +0100, Marcin Szewczyk wrote:
> > In 2010 I noticed that viewing many GIFs in a row using gpicview renders
> > my Linux unresponsive. The problem still exists. There is very little
> > I can do in such a situation. Rarely after some minutes the OOM killer
> > kicks in and saves the day. Nevertheless, usually I end up using
> > Alt+SysRq+B.
>
> Have you tried kicking the OOM killer manually with sysrq+f?
I completely forgot about that option. It works both at TTY and under
Xorg. Thank you very much.
> > The unresponsiveness goes with high CPU load and a lot of IO (read)
> > operations on the root file system and its block device.
>
> There is a semi-known issue of heavily thrashing page cache. Your
> crash program sucks up most memory and leaves very little for the
> executables and libraries to be cached, which results in multiple
> threads experiencing cache misses in their executable code, followed
> by fighting over the few remaining page cache slots, which are not
> enough to meet the demand at any given point in time. [...]
Thank you for the explanation.
> That being said, there is no real solution to thrashing page cache as
> of this day. We have most infrastructure in place to detect it, but it
> isn't hooked up to the OOM killer yet. The only answer until then is
> try to keep free+buffer+cache at at least 10-15% of overall memory.
OK. Is there a good source of information I could subscribe to so I
don't miss the moment when the integration code enters the kernel? Do
you think LWN would mention it or should I just follow "oom" messages on
linux-kernel and linux-mm?
> Since you can reproduce it easily, is there any chance you could grab
> backtraces (sysrq+t) of the tasks while the machine is in that state?
> That should confirm that most tasks are either waiting for IO or are
> inside page reclaim.
I've updated the repository. I will later add this thread to the README.
Dump is available here:
https://github.com/wodny/crasher/blob/master/logs/kern.log
I didn't want to post 200kB to everybody so I didn't attach it to this
email.
--
Marcin Szewczyk http://wodny.org
mailto:Marcin.Szewczyk@wodny.borg <- remove b / usuA? b
xmpp:wodny@ubuntu.pl xmpp:wodny@jabster.pl
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2015-12-23 20:31 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-12-23 14:31 Exhausting memory makes the system unresponsive but doesn't invoke OOM killer Marcin Szewczyk
2015-12-23 16:32 ` Johannes Weiner
2015-12-23 20:31 ` Marcin Szewczyk
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).