All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jens Axboe <jens.axboe@oracle.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	penberg@cs.helsinki.fi, arjan@infradead.org,
	linux-kernel@vger.kernel.org, cl@linux-foundation.org,
	npiggin@suse.de
Subject: Re: upcoming kerneloops.org item: get_page_from_freelist
Date: Thu, 25 Jun 2009 22:08:33 +0200	[thread overview]
Message-ID: <20090625200833.GD31415@kernel.dk> (raw)
In-Reply-To: <20090625195538.GC31415@kernel.dk>

On Thu, Jun 25 2009, Jens Axboe wrote:
> On Wed, Jun 24 2009, Andrew Morton wrote:
> > On Wed, 24 Jun 2009 13:40:11 -0700 (PDT)
> > Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > 
> > > 
> > > 
> > > On Wed, 24 Jun 2009, Linus Torvalds wrote:
> > > > On Wed, 24 Jun 2009, Andrew Morton wrote:
> > > > > 
> > > > > If the caller gets oom-killed, the allocation attempt fails.  Callers need
> > > > > to handle that.
> > > > 
> > > > I actually disagree. I think we should just admit that we can always free 
> > > > up enough space to get a few pages, in order to then oom-kill things.
> > > 
> > > Btw, if you want to change the WARN_ON() to warn when you're in the 
> > > "allocate in order to free memory" recursive case, then I'd have no issues 
> > > with that.
> > > 
> > > In fact, in that case it probably shouldn't even be conditional on the 
> > > order. 
> > > 
> > > So a
> > > 
> > > 	WARN_ON_ONCE((p->flags & PF_MEMALLOC) && (gfpmask & __GFP_NOFAIL));
> > > 
> > > actually makes tons of sense. 
> > 
> > I suspect that warning will trigger.
> > 
> > alloc_pages
> > -> ...
> >   -> pageout
> >     -> ...
> >       -> get_request
> >         -> blk_alloc_request
> >           -> elv_set_request
> >             -> cfq_set_request
> >               -> cfq_get_queue
> >                 -> cfq_find_alloc_queue
> >                   -> kmem_cache_alloc_node(__GFP_NOFAIL)
> >                     -> Jens
> > 
> > How much this can happen in practice I don't know, but it looks bad.
> 
> Yeah it sucks, but I don't think it's that bad to fixup.
> 
> The request allocation can fail, if we just return NULL in
> cfq_find_alloc_queue() and let that error propagate back up to
> get_request_wait(), it would simply io_schedule() and wait for a request
> to be freed. The only issue here is that if we have no requests going
> for this queue already, we would be stuck since there's noone to wake us
> up eventually. So if we did this, we'd have to make the io_schedule()
> dependent on whether there are allocations out already. Use global
> congestion wait in that case, or just io_schedule_timeout() for
> retrying.
> 
> The other option is to retry in cfq_find_alloc_queue() without the
> NOFAIL and do the congestion wait logic in there.
> 
> Yet another option would be to have a dummy queue that is allocated when
> the queue io scheduler is initialized. If cfq_find_alloc_queue() fails,
> just punt the IO to that dummy queue. That would allow progress even
> under extreme failure conditions.
> 
> With all that said, the likely hood of ever hitting this path is about
> 0%. Those failures are the ones that really suck when it's hit in the
> field eventually, though :-)

Something like the below implements option 3. Totally untested, I'll
throw it through some memfail injections and blktrace before I fully
trust it. And split it into 2 patches, the first doing the
cfq_init_cfqq() stuff and the second adding oom_cfqq and using it there
too.

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 833ec18..3b075a8 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -71,6 +71,51 @@ struct cfq_rb_root {
 #define CFQ_RB_ROOT	(struct cfq_rb_root) { RB_ROOT, NULL, }
 
 /*
+ * Per process-grouping structure
+ */
+struct cfq_queue {
+	/* reference count */
+	atomic_t ref;
+	/* various state flags, see below */
+	unsigned int flags;
+	/* parent cfq_data */
+	struct cfq_data *cfqd;
+	/* service_tree member */
+	struct rb_node rb_node;
+	/* service_tree key */
+	unsigned long rb_key;
+	/* prio tree member */
+	struct rb_node p_node;
+	/* prio tree root we belong to, if any */
+	struct rb_root *p_root;
+	/* sorted list of pending requests */
+	struct rb_root sort_list;
+	/* if fifo isn't expired, next request to serve */
+	struct request *next_rq;
+	/* requests queued in sort_list */
+	int queued[2];
+	/* currently allocated requests */
+	int allocated[2];
+	/* fifo list of requests in sort_list */
+	struct list_head fifo;
+
+	unsigned long slice_end;
+	long slice_resid;
+	unsigned int slice_dispatch;
+
+	/* pending metadata requests */
+	int meta_pending;
+	/* number of requests that are on the dispatch list or inside driver */
+	int dispatched;
+
+	/* io prio of this group */
+	unsigned short ioprio, org_ioprio;
+	unsigned short ioprio_class, org_ioprio_class;
+
+	pid_t pid;
+};
+
+/*
  * Per block device queue structure
  */
 struct cfq_data {
@@ -135,51 +180,11 @@ struct cfq_data {
 	unsigned int cfq_slice_idle;
 
 	struct list_head cic_list;
-};
-
-/*
- * Per process-grouping structure
- */
-struct cfq_queue {
-	/* reference count */
-	atomic_t ref;
-	/* various state flags, see below */
-	unsigned int flags;
-	/* parent cfq_data */
-	struct cfq_data *cfqd;
-	/* service_tree member */
-	struct rb_node rb_node;
-	/* service_tree key */
-	unsigned long rb_key;
-	/* prio tree member */
-	struct rb_node p_node;
-	/* prio tree root we belong to, if any */
-	struct rb_root *p_root;
-	/* sorted list of pending requests */
-	struct rb_root sort_list;
-	/* if fifo isn't expired, next request to serve */
-	struct request *next_rq;
-	/* requests queued in sort_list */
-	int queued[2];
-	/* currently allocated requests */
-	int allocated[2];
-	/* fifo list of requests in sort_list */
-	struct list_head fifo;
-
-	unsigned long slice_end;
-	long slice_resid;
-	unsigned int slice_dispatch;
 
-	/* pending metadata requests */
-	int meta_pending;
-	/* number of requests that are on the dispatch list or inside driver */
-	int dispatched;
-
-	/* io prio of this group */
-	unsigned short ioprio, org_ioprio;
-	unsigned short ioprio_class, org_ioprio_class;
-
-	pid_t pid;
+	/*
+	 * Fallback dummy cfqq for extrem OOM conditions
+	 */
+	struct cfq_queue oom_cfqq;
 };
 
 enum cfqq_state_flags {
@@ -1641,6 +1646,26 @@ static void cfq_ioc_set_ioprio(struct io_context *ioc)
 	ioc->ioprio_changed = 0;
 }
 
+static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
+			  pid_t pid, int is_sync)
+{
+	RB_CLEAR_NODE(&cfqq->rb_node);
+	RB_CLEAR_NODE(&cfqq->p_node);
+	INIT_LIST_HEAD(&cfqq->fifo);
+
+	atomic_set(&cfqq->ref, 0);
+	cfqq->cfqd = cfqd;
+
+	cfq_mark_cfqq_prio_changed(cfqq);
+
+	if (is_sync) {
+		if (!cfq_class_idle(cfqq))
+			cfq_mark_cfqq_idle_window(cfqq);
+		cfq_mark_cfqq_sync(cfqq);
+	}
+	cfqq->pid = pid;
+}
+
 static struct cfq_queue *
 cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
 		     struct io_context *ioc, gfp_t gfp_mask)
@@ -1666,10 +1691,11 @@ retry:
 			 */
 			spin_unlock_irq(cfqd->queue->queue_lock);
 			new_cfqq = kmem_cache_alloc_node(cfq_pool,
-					gfp_mask | __GFP_NOFAIL | __GFP_ZERO,
+					gfp_mask | __GFP_ZERO,
 					cfqd->queue->node);
 			spin_lock_irq(cfqd->queue->queue_lock);
-			goto retry;
+			if (new_cfqq)
+				goto retry;
 		} else {
 			cfqq = kmem_cache_alloc_node(cfq_pool,
 					gfp_mask | __GFP_ZERO,
@@ -1678,23 +1704,8 @@ retry:
 				goto out;
 		}
 
-		RB_CLEAR_NODE(&cfqq->rb_node);
-		RB_CLEAR_NODE(&cfqq->p_node);
-		INIT_LIST_HEAD(&cfqq->fifo);
-
-		atomic_set(&cfqq->ref, 0);
-		cfqq->cfqd = cfqd;
-
-		cfq_mark_cfqq_prio_changed(cfqq);
-
+		cfq_init_cfqq(cfqd, cfqq, current->pid, is_sync);
 		cfq_init_prio_data(cfqq, ioc);
-
-		if (is_sync) {
-			if (!cfq_class_idle(cfqq))
-				cfq_mark_cfqq_idle_window(cfqq);
-			cfq_mark_cfqq_sync(cfqq);
-		}
-		cfqq->pid = current->pid;
 		cfq_log_cfqq(cfqd, cfqq, "alloced");
 	}
 
@@ -1702,6 +1713,8 @@ retry:
 		kmem_cache_free(cfq_pool, new_cfqq);
 
 out:
+	if (!cfqq)
+		cfqq = &cfqd->oom_cfqq;
 	WARN_ON((gfp_mask & __GFP_WAIT) && !cfqq);
 	return cfqq;
 }
@@ -1735,11 +1748,8 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
 		cfqq = *async_cfqq;
 	}
 
-	if (!cfqq) {
+	if (!cfqq)
 		cfqq = cfq_find_alloc_queue(cfqd, is_sync, ioc, gfp_mask);
-		if (!cfqq)
-			return NULL;
-	}
 
 	/*
 	 * pin the queue now that it's allocated, scheduler exit will prune it
@@ -2465,6 +2475,11 @@ static void *cfq_init_queue(struct request_queue *q)
 	for (i = 0; i < CFQ_PRIO_LISTS; i++)
 		cfqd->prio_trees[i] = RB_ROOT;
 
+	/*
+	 * Our fallback cfqq if cfq_find_alloc_queue() runs into OOM issues
+	 */
+	cfq_init_cfqq(cfqd, &cfqd->oom_cfqq, 1, 0);
+
 	INIT_LIST_HEAD(&cfqd->cic_list);
 
 	cfqd->queue = q;

-- 
Jens Axboe


  reply	other threads:[~2009-06-25 20:08 UTC|newest]

Thread overview: 81+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-06-24 15:07 upcoming kerneloops.org item: get_page_from_freelist Arjan van de Ven
2009-06-24 16:46 ` Andrew Morton
2009-06-24 16:52   ` Linus Torvalds
2009-06-24 16:55   ` Pekka Enberg
2009-06-24 16:56     ` Pekka Enberg
2009-06-24 17:00       ` Pekka Enberg
2009-06-24 17:55     ` Andrew Morton
2009-06-24 17:53       ` Pekka Enberg
2009-06-24 18:30         ` Andrew Morton
2009-06-24 18:42           ` Linus Torvalds
2009-06-24 18:44             ` Pekka Enberg
2009-06-24 18:50               ` Linus Torvalds
2009-06-24 19:12                 ` Pekka J Enberg
2009-06-24 19:21                   ` Linus Torvalds
2009-06-24 19:06             ` Andrew Morton
2009-06-24 19:16               ` Linus Torvalds
2009-06-24 19:36                 ` Andrew Morton
2009-06-24 19:46                   ` Linus Torvalds
2009-06-24 19:47                     ` Linus Torvalds
2009-06-24 20:01                     ` Andrew Morton
2009-06-24 20:13                       ` Linus Torvalds
2009-06-24 20:40                         ` Linus Torvalds
2009-06-24 22:07                           ` Andrew Morton
2009-06-25  4:05                             ` Nick Piggin
2009-06-25 13:25                             ` Theodore Tso
2009-06-25 18:51                               ` David Rientjes
2009-06-25 19:38                                 ` Theodore Tso
2009-06-25 19:44                                   ` Theodore Tso
2009-06-25 19:55                                     ` Andrew Morton
2009-06-25 20:11                                     ` Linus Torvalds
2009-06-25 20:22                                       ` Linus Torvalds
2009-06-25 20:36                                         ` David Rientjes
2009-06-25 20:51                                           ` Linus Torvalds
2009-06-25 22:25                                             ` David Rientjes
2009-06-26  8:51                                         ` Nick Piggin
2009-06-25 20:18                                     ` David Rientjes
2009-06-25 20:37                                       ` Theodore Tso
2009-06-25 21:05                                         ` Joel Becker
2009-06-25 21:05                                           ` Joel Becker
2009-06-25 21:26                                         ` Andreas Dilger
2009-06-25 21:26                                           ` Andreas Dilger
2009-06-25 22:05                                           ` Theodore Tso
2009-06-25 22:11                                             ` Eric Sandeen
2009-06-25 22:11                                               ` Eric Sandeen
2009-06-26  1:11                                               ` Theodore Tso
2009-06-26  5:16                                                 ` Pekka J Enberg
2009-06-26  8:56                                                   ` Nick Piggin
2009-06-26  8:58                                                     ` Pekka Enberg
2009-06-26  9:07                                                       ` Nick Piggin
2009-06-29 21:06                                                       ` Christoph Lameter
2009-06-30  7:59                                                         ` Nick Piggin
2009-06-26 14:41                                                   ` Eric Sandeen
2009-06-29 21:15                                                     ` Christoph Lameter
2009-06-29 21:20                                                       ` Eric Sandeen
2009-06-29 22:35                                                         ` Christoph Lameter
2009-06-25 19:55                             ` Jens Axboe
2009-06-25 20:08                               ` Jens Axboe [this message]
2009-06-24 21:56                         ` Andrew Morton
2009-06-25  4:14                           ` Nick Piggin
2009-06-25  8:21                           ` David Rientjes
2009-06-29 15:30                           ` Mel Gorman
2009-06-29 19:20                             ` Andrew Morton
2009-06-30 11:00                               ` Mel Gorman
2009-06-30 19:35                                 ` David Rientjes
2009-06-30 20:32                                   ` Mel Gorman
2009-06-30 20:51                                     ` David Rientjes
2009-07-01 10:22                                       ` Mel Gorman
2009-06-29 23:35                             ` David Rientjes
2009-06-30  7:47                               ` Nick Piggin
2009-06-30  8:13                                 ` David Rientjes
2009-06-30  8:24                                   ` Nick Piggin
2009-06-30  8:41                                     ` David Rientjes
2009-06-30  9:09                                       ` Nick Piggin
2009-06-30 19:47                                         ` David Rientjes
2009-06-30  6:27                           ` Pavel Machek
2009-06-28 10:16                     ` Pavel Machek
2009-06-28 18:01                       ` Linus Torvalds
2009-06-28 18:27                         ` Arjan van de Ven
2009-06-28 18:36                           ` Linus Torvalds
2009-06-30  7:35                         ` Pavel Machek
2009-06-24 18:43           ` Pekka Enberg

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090625200833.GD31415@kernel.dk \
    --to=jens.axboe@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=arjan@infradead.org \
    --cc=cl@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=npiggin@suse.de \
    --cc=penberg@cs.helsinki.fi \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.