From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-fsdevel-owner@vger.kernel.org>
Received: from e19.ny.us.ibm.com ([129.33.205.209]:38816 "EHLO
	e19.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751558AbcBOBCg (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Sun, 14 Feb 2016 20:02:36 -0500
Received: from localhost
	by e19.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted
	for <linux-fsdevel@vger.kernel.org> from <paulmck@linux.vnet.ibm.com>;
	Sun, 14 Feb 2016 20:02:36 -0500
Received: from b01cxnp22033.gho.pok.ibm.com (b01cxnp22033.gho.pok.ibm.com [9.57.198.23])
	by d01dlp01.pok.ibm.com (Postfix) with ESMTP id A4B0D38C8041
	for <linux-fsdevel@vger.kernel.org>; Sun, 14 Feb 2016 20:02:34 -0500 (EST)
Received: from d01av01.pok.ibm.com (d01av01.pok.ibm.com [9.56.224.215])
	by b01cxnp22033.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id u1F12YQA25428024
	for <linux-fsdevel@vger.kernel.org>; Mon, 15 Feb 2016 01:02:34 GMT
Received: from d01av01.pok.ibm.com (localhost [127.0.0.1])
	by d01av01.pok.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id u1F12X8S005280
	for <linux-fsdevel@vger.kernel.org>; Sun, 14 Feb 2016 20:02:34 -0500
Date: Sun, 14 Feb 2016 17:02:38 -0800
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Jeff Layton <jlayton@poochiereds.net>
Cc: Eryu Guan <guaneryu@gmail.com>, linux-fsdevel@vger.kernel.org,
	jack@suse.com, akpm@linux-foundation.org,
	torvalds@linux-foundation.org
Subject: Re: [BUG] inotify_add_watch/inotify_rm_watch loops trigger oom
Message-ID: <20160215010238.GT6719@linux.vnet.ibm.com>
Reply-To: paulmck@linux.vnet.ibm.com
References: <20160214083543.GC29898@eguan.usersys.redhat.com>
 <20160214093931.40da9afa@tlielax.poochiereds.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20160214093931.40da9afa@tlielax.poochiereds.net>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On Sun, Feb 14, 2016 at 09:39:31AM -0500, Jeff Layton wrote:
> On Sun, 14 Feb 2016 16:35:43 +0800
> Eryu Guan <guaneryu@gmail.com> wrote:
> 
> > Hi,
> > 
> > Starting from v4.5-rc1 running inotify_add_watch/inotify_rm_watch in
> > loop could trigger OOM and system becomes unusuable. v4.4 kernel is fine
> > with the same stress test.
> > 
> > Reverting c510eff6beba ("fsnotify: destroy marks with call_srcu instead
> > of dedicated thread") on top of v4.5-rc3 passed the same test, seems
> > that this patch introduced some kind of memleak?
> > 
> > On v4.5-rc[1-3] the test program triggers oom within 10 minutes on my
> > test vm with 8G mem.  After reverting the commit in question the same vm
> > survived more than 1 hour stress test.
> > 
> > 	./inotify <mnt>
> > 
> > I attached the test program and oom console log. If more information is
> > needed please let me know.
> > 
> > Thanks,
> > Eryu
> 
> Thanks Eryu, I think I see what the problem is. This reproducer is
> creating and deleting marks very rapidly. But the SRCU code has this:
> 
>     #define SRCU_CALLBACK_BATCH     10                                              
>     #define SRCU_INTERVAL           1                                               
> 
> So, process_srcu will only process 10 entries at a time, and only once
> per jiffy. The upshot there is that that reproducer can create entries
> _much_ faster than they can be cleaned up now that we're using
> call_srcu in this codepath. If you kill the program before the OOM
> killer kicks in, they all eventually get cleaned up but it does take a
> while (minutes).
> 
> I clearly didn't educate myself enough as to the limitations of
> call_srcu before converting this code over to use it (and I missed
> Paul's subtle hints in that regard). We may need to revert that patch
> before v4.5 ships, but I'd like to ponder it for a few days to and see
> whether there is some way to batch them up so that they get reaped more
> efficiently without requiring the dedicated thread.

One thought would be to add an "emergency mode" to SRCU similar to that
already in RCU.  Something to the effect that if the current list of
callbacks is going to take more than a second to drain at the configured
per-jiffy rate, just process them without waiting.

Would that help in this case, or am I missing something about the
reproducer?

							Thanx, Paul