Flexible I/O Tester development
 help / color / mirror / Atom feed
From: Jens Axboe <jens.axboe@oracle.com>
To: Shawn Lewis <shawnlewis@google.com>
Cc: fio <fio@vger.kernel.org>
Subject: Re: BUG: option runtime not working during a particular failure mode.
Date: Wed, 8 Oct 2008 13:17:54 +0200	[thread overview]
Message-ID: <20081008111754.GO19428@kernel.dk> (raw)
In-Reply-To: <20081008110106.GN19428@kernel.dk>

On Wed, Oct 08 2008, Jens Axboe wrote:
> On Tue, Oct 07 2008, Shawn Lewis wrote:
> > On Tue, Oct 7, 2008 at 2:30 AM, Jens Axboe <jens.axboe@oracle.com> wrote:
> > > On Mon, Oct 06 2008, Shawn Lewis wrote:
> > >> Hi,
> > >>
> > >> I have a random read load in which fio hung on a machine. It is
> > >> time_based with runtime=60. A few of the disks in question experienced
> > >> errors at the same time so I would expect fio to fail or stop after
> > >> 60- seconds.
> > >>
> > >> I haven't tried to debug this in depth yet. Jens I thought an answer
> > >> might jump out at you. If not I'll take a look.
> > >>
> > >> Full disclosure: I modified the config and the strace output to show
> > >> fewer disks then were actually being accessed.
> > >>
> > >> Here is the config file:
> > >> [sda-randomaccess]
> > >> filename=/export/sda3/
> > >> datafile.tmp
> > >> rw=randread
> > >> bs=64k
> > >> ioengine=sync
> > >> time_based=1
> > >> runtime=3600
> > >> bwavgtime=5000
> > >> direct=1
> > >> thread=1
> > >>
> > >> [sdb-randomaccess]
> > >> filename=/export/sdb3/datafile.tmp
> > >> rw=randread
> > >> bs=64k
> > >> ioengine=sync
> > >> time_based=1
> > >> runtime=3600
> > >> bwavgtime=5000
> > >> direct=1
> > >> thread=1
> > >>
> > >> [sdc-randomaccess]
> > >> filename=/export/sdc3/datafile.tmp
> > >> rw=randread
> > >> bs=64k
> > >> ioengine=sync
> > >> time_based=1
> > >> runtime=3600
> > >> bwavgtime=5000
> > >> direct=1
> > >> thread=1
> > >>
> > >> [sdd-randomaccess]
> > >> filename=/export/sdd3/datafile.tmp
> > >> rw=randread
> > >> bs=64k
> > >> ioengine=sync
> > >> time_based=1
> > >> runtime=3600
> > >> bwavgtime=5000
> > >> direct=1
> > >> thread=1
> > >>
> > >>
> > >> We get some hints from strace. It looks like we're just doing the
> > >> sig_alrm loop. But why aren't hitting runtime? Are the other threads
> > >> stopped already for some reason?
> > >>
> > >> static void sig_alrm(int sig)
> > >> {
> > >>         if (threads) {
> > >>                 update_io_ticks();
> > >>                 print_thread_status();
> > >>                 status_timer_arm();
> > >>         }
> > >> }
> > >
> > > Good question. What are the other threads doing, have you poked around
> > > to see what they are up to? You mention IO errors, so are some of the
> > > threads stuck in error handling or did they all exit? If they did exit,
> > > did they exit nicely or did they get killed by the kernel?
> > There was only one thread running as far as I could tell. I don't
> > think the kernel killed them.
> 
> Unless you have an oops in the kernel, they were likely not killed. What
> version of fio is this? Fio used to have problems detecting missing
> threads/processes, but it should be solid now.
> 
> I haven't been able to (artificially) reproduce this case. I guess, for
> safety, I could add a forced exit in the sigalrm loop if we exceeded
> ther untime (plus a little margin).

Something like this should force fio to quit, when it detects such a
situation. I guess it would be hard for you to reproduce this situation,
correct?


diff --git a/fio.c b/fio.c
index 773f106..1ba227e 100644
--- a/fio.c
+++ b/fio.c
@@ -71,7 +71,7 @@ void td_set_runstate(struct thread_data *td, int runstate)
 	td->runstate = runstate;
 }
 
-static void terminate_threads(int group_id)
+static void terminate_threads(int group_id, int force)
 {
 	struct thread_data *td;
 	int i;
@@ -88,7 +88,7 @@ static void terminate_threads(int group_id)
 			/*
 			 * if the thread is running, just let it exit
 			 */
-			if (td->runstate < TD_RUNNING)
+			if (td->runstate < TD_RUNNING || force)
 				kill(td->pid, SIGQUIT);
 			else {
 				struct ioengine_ops *ops = td->io_ops;
@@ -107,8 +107,32 @@ static void status_timer_arm(void)
 	setitimer(ITIMER_REAL, &itimer, NULL);
 }
 
+static int overtime(void)
+{
+	unsigned long elapsed;
+
+	if (!def_thread.o.timeout)
+		return 0;
+
+	/*
+	 * Allow for 10 seconds of slack, this is supposed to be a last
+	 * measure to make fio exit
+	 */
+	elapsed = mtime_since_genesis() / 1000;
+	if (elapsed > def_thread.o.timeout + 10) {
+		log_err("fio: runtime seems way exceeded, yet fio is still "
+			"alive. Forcing quit.\n");
+		terminate_threads(TERMINATE_ALL, 1);
+		return 1;
+	}
+
+	return 0;
+}
+
 static void sig_alrm(int fio_unused sig)
 {
+	if (overtime())
+		return;
 	if (threads) {
 		update_io_ticks();
 		print_thread_status();
@@ -121,7 +145,7 @@ static void sig_int(int sig)
 	if (threads) {
 		printf("\nfio: terminating on signal %d\n", sig);
 		fflush(stdout);
-		terminate_threads(TERMINATE_ALL);
+		terminate_threads(TERMINATE_ALL, 1);
 	}
 }
 
@@ -132,7 +156,7 @@ static void sig_ill(int fio_unused sig)
 
 	log_err("fio: illegal instruction. your cpu does not support "
 		"the sse4.2 instruction for crc32c\n");
-	terminate_threads(TERMINATE_ALL);
+	terminate_threads(TERMINATE_ALL, 1);
 	exit(4);
 }
 
@@ -631,7 +655,7 @@ sync_done:
 
 			if (check_min_rate(td, &comp_time)) {
 				if (exitall_on_terminate)
-					terminate_threads(td->groupid);
+					terminate_threads(td->groupid, 0);
 				td_verror(td, EIO, "check_min_rate");
 				break;
 			}
@@ -1053,7 +1077,7 @@ static void *thread_main(void *data)
 	}
 
 	if (exitall_on_terminate)
-		terminate_threads(td->groupid);
+		terminate_threads(td->groupid, 0);
 
 err:
 	if (td->error)
@@ -1189,7 +1213,7 @@ reaped:
 	}
 
 	if (*nr_running == cputhreads && !pending && realthreads)
-		terminate_threads(TERMINATE_ALL);
+		terminate_threads(TERMINATE_ALL, 0);
 }
 
 /*
diff --git a/fio.h b/fio.h
index 63cd83f..3ec0591 100644
--- a/fio.h
+++ b/fio.h
@@ -705,6 +705,7 @@ extern unsigned long done_secs;
 extern char *job_section;
 
 extern struct thread_data *threads;
+extern struct thread_data def_thread;
 
 #define td_read(td)		((td)->o.td_ddir & TD_DDIR_READ)
 #define td_write(td)		((td)->o.td_ddir & TD_DDIR_WRITE)

-- 
Jens Axboe


  reply	other threads:[~2008-10-08 11:18 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-10-06 22:04 BUG: option runtime not working during a particular failure mode Shawn Lewis
2008-10-07  9:30 ` Jens Axboe
2008-10-07 16:28   ` Shawn Lewis
2008-10-08 11:01     ` Jens Axboe
2008-10-08 11:17       ` Jens Axboe [this message]
2008-10-08 16:24         ` Shawn Lewis
2008-10-08 17:23           ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20081008111754.GO19428@kernel.dk \
    --to=jens.axboe@oracle.com \
    --cc=fio@vger.kernel.org \
    --cc=shawnlewis@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox