public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Greg KH <gregkh@suse.de>
To: linux-kernel@vger.kernel.org, stable@kernel.org, jejb@kernel.org
Cc: Justin Forbes <jmforbes@linuxtx.org>,
	Zwane Mwaikambo <zwane@arm.linux.org.uk>,
	"Theodore Ts'o" <tytso@mit.edu>,
	Randy Dunlap <rdunlap@xenotime.net>,
	Dave Jones <davej@redhat.com>,
	Chuck Wolber <chuckw@quantumlinux.com>,
	Chris Wedgwood <reviews@ml.cw.f00f.org>,
	Michael Krufky <mkrufky@linuxtv.org>,
	Chuck Ebbert <cebbert@redhat.com>,
	Domenico Andreoli <cavokz@gmail.com>, Willy Tarreau <w@1wt.eu>,
	Rodrigo Rubira Branco <rbranco@la.checkpoint.com>,
	Jake Edge <jake@lwn.net>,
	torvalds@linux-foundation.org, akpm@linux-foundation.org,
	alan@lxorguk.ukuu.org.uk, Nick Piggin <npiggin@suse.de>,
	Jens Axboe <jens.axboe@oracle.com>
Subject: [patch 05/47] block: Fix the starving writes bug in the anticipatory IO scheduler
Date: Tue, 22 Jul 2008 16:14:31 -0700	[thread overview]
Message-ID: <20080722231431.GF8282@suse.de> (raw)
In-Reply-To: <20080722231342.GA8282@suse.de>

[-- Attachment #1: block-fix-the-starving-writes-bug-in-the-anticipatory-io-scheduler.patch --]
[-- Type: text/plain, Size: 3866 bytes --]

2.6.25-stable review patch.  If anyone has any objections, please let us
know.

------------------
From: Divyesh Shah <dpshah@google.com>

commit d585d0b9d73ed999cc7b8cf3cac4a5b01abb544e upstream

AS scheduler alternates between issuing read and write batches. It does
the batch switch only after all requests from the previous batch are
completed.

When switching to a write batch, if there is an on-going read request,
it waits for its completion and indicates its intention of switching by
setting ad->changed_batch and the new direction but does not update the
batch_expire_time for the new write batch which it does in the case of
no previous pending requests.
On completion of the read request, it sees that we were waiting for the
switch and schedules work for kblockd right away and resets the
ad->changed_data flag.
Now when kblockd enters dispatch_request where it is expected to pick
up a write request, it in turn ends the write batch because the
batch_expire_timer was not updated and shows the expire timestamp for
the previous batch.

This results in the write starvation for all the cases where there is
the intention for switching to a write batch, but there is a previous
in-flight read request and the batch gets reverted to a read_batch
right away.

This also holds true in the reverse case (switching from a write batch
to a read batch with an in-flight write request).

I've checked that this bug exists on 2.6.11, 2.6.18, 2.6.24 and
linux-2.6-block git HEAD. I've tested the fix on x86 platforms with
SCSI drives where the driver asks for the next request while a current
request is in-flight.

This patch is based off linux-2.6-block git HEAD.

Bug reproduction:
A simple scenario which reproduces this bug is:
- dd if=/dev/hda3 of=/dev/null &
- lilo
   The lilo takes forever to complete.

This can also be reproduced fairly easily with the earlier dd and
another test
program doing msync().

The example test program below should print out a message after every
iteration
but it simply hangs forever. With this bugfix it makes forward progress.

====
Example test program using msync() (thanks to suleiman AT google DOT
com)

inline uint64_t
rdtsc(void)
{
         int64_t tsc;

         __asm __volatile("rdtsc" : "=A" (tsc));
         return (tsc);
}

int
main(int argc, char **argv)
{
         struct stat st;
         uint64_t e, s, t;
         char *p, q;
         long i;
         int fd;

         if (argc < 2) {
                 printf("Usage: %s <file>\n", argv[0]);
                 return (1);
         }

         if ((fd = open(argv[1], O_RDWR | O_NOATIME)) < 0)
                 err(1, "open");

         if (fstat(fd, &st) < 0)
                 err(1, "fstat");

         p = mmap(NULL, st.st_size, PROT_READ | PROT_WRITE,
MAP_SHARED, fd, 0);

         t = 0;
         for (i = 0; i < 1000; i++) {
                 *p = 0;
                 msync(p, 4096, MS_SYNC);
                 s = rdtsc();
                *p = 0;
                 __asm __volatile(""::: "memory");
                 e = rdtsc();
                 if (argc > 2)
                         printf("%d: %lld cycles %jd %jd\n",
                                i, e - s, (intmax_t)s, (intmax_t)e);
                 t += e - s;
         }
         printf("average time: %lld cycles\n", t / 1000);
         return (0);
}

Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---
 block/as-iosched.c |    2 ++
 1 file changed, 2 insertions(+)

--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -831,6 +831,8 @@ static void as_completed_request(struct 
 	}
 
 	if (ad->changed_batch && ad->nr_dispatched == 1) {
+		ad->current_batch_expires = jiffies +
+					ad->batch_expire[ad->batch_data_dir];
 		kblockd_schedule_work(&ad->antic_work);
 		ad->changed_batch = 0;
 

-- 

  parent reply	other threads:[~2008-07-22 23:19 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20080722230208.148102983@mini.kroah.org>
2008-07-22 23:13 ` [patch 00/47] 2.6.25-stable review Greg KH
2008-07-22 23:14   ` [patch 01/47] b43legacy: Do not return TX_BUSY from op_tx Greg KH
2008-07-22 23:14   ` [patch 02/47] b43: " Greg KH
2008-07-22 23:14   ` [patch 03/47] b43: Fix possible MMIO access while device is down Greg KH
2008-07-22 23:14   ` [patch 04/47] mac80211: detect driver tx bugs Greg KH
2008-07-22 23:14   ` Greg KH [this message]
2008-07-22 23:14   ` [patch 06/47] md: Fix error paths if md_probe fails Greg KH
2008-07-22 23:14   ` [patch 07/47] md: Dont acknowlege that stripe-expand is complete until it really is Greg KH
2008-07-22 23:14   ` [patch 08/47] md: Ensure interrupted recovery completed properly (v1 metadata plus bitmap) Greg KH
2008-07-22 23:14   ` [patch 09/47] block: Properly notify block layer of sync writes Greg KH
2008-07-22 23:14   ` [patch 10/47] OHCI: Fix problem if SM501 and another platform driver is selected Greg KH
2008-07-22 23:14   ` [patch 11/47] USB: ehci - fix timer regression Greg KH
2008-07-22 23:14   ` [patch 12/47] USB: ohci - record data toggle after unlink Greg KH
2008-07-22 23:15   ` [patch 13/47] USB: fix interrupt disabling for HCDs with shared interrupt handlers Greg KH
2008-07-22 23:15   ` [patch 14/47] hdaps: add support for various newer Lenovo thinkpads Greg KH
2008-07-22 23:15   ` [patch 15/47] b43legacy: Fix possible NULL pointer dereference in DMA code Greg KH
2008-07-22 23:15   ` [patch 16/47] netdrvr: 3c59x: remove irqs_disabled warning from local_bh_enable Greg KH
2008-07-22 23:15   ` [patch 17/47] SCSI: esp: Fix OOPS in esp_reset_cleanup() Greg KH
2008-07-22 23:15   ` [patch 18/47] SCSI: esp: tidy up target reference counting Greg KH
2008-07-22 23:15   ` [patch 19/47] SCSI: ses: Fix timeout Greg KH
2008-07-22 23:16   ` [patch 20/47] mm: switch node meminfo Active & Inactive pages to Kbytes Greg KH
2008-07-22 23:16   ` [patch 21/47] reiserfs: discard prealloc in reiserfs_delete_inode Greg KH
2008-07-22 23:16   ` [patch 22/47] cciss: read config to obtain max outstanding commands per controller Greg KH
2008-07-22 23:16   ` [patch 23/47] serial: fix serial_match_port() for dynamic major tty-device numbers Greg KH
2008-07-22 23:16   ` [patch 24/47] can: add sanity checks Greg KH
2008-07-22 23:16   ` [patch 25/47] sisusbvga: Fix oops on disconnect Greg KH
2008-07-22 23:16   ` [patch 26/47] md: ensure all blocks are uptodate or locked when syncing Greg KH
2008-07-22 23:16   ` [patch 27/47] textsearch: fix Boyer-Moore text search bug Greg KH
2008-07-22 23:16   ` [patch 28/47] netfilter: nf_conntrack_tcp: fixing to check the lower bound of valid ACK Greg KH
2008-07-22 23:16   ` [patch 29/47] zd1211rw: add ID for AirTies WUS-201 Greg KH
2008-07-22 23:16   ` [patch 30/47] exec: fix stack excutability without PT_GNU_STACK Greg KH
2008-07-22 23:16   ` [patch 31/47] slub: Fix use-after-preempt of per-CPU data structure Greg KH
2008-07-22 23:16   ` [patch 32/47] rtc: fix reported IRQ rate for when HPET is enabled Greg KH
2008-07-22 23:16   ` [patch 33/47] rapidio: fix device reference counting Greg KH
2008-07-22 23:16   ` [patch 34/47] tpm: add Intel TPM TIS device HID Greg KH
2008-07-22 23:16   ` [patch 35/47] cifs: fix wksidarr declaration to be big-endian friendly Greg KH
2008-07-22 23:16   ` [patch 36/47] ov7670: clean up ov7670_read semantics Greg KH
2008-07-22 23:17   ` [patch 37/47] serial8250: sanity check nr_uarts on all paths Greg KH
2008-07-22 23:17   ` [patch 38/47] fbdev: bugfix for multiprocess defio Greg KH
2008-07-22 23:17   ` [patch 39/47] drivers/isdn/i4l/isdn_common.c fix small resource leak Greg KH
2008-07-22 23:17   ` [patch 40/47] drivers/char/pcmcia/ipwireless/hardware.c fix " Greg KH
2008-07-22 23:17   ` [patch 41/47] SCSI: mptspi: fix oops in mptspi_dv_renegotiate_work() Greg KH
2008-07-22 23:17   ` [patch 42/47] crypto: chainiv - Invoke completion function Greg KH
2008-07-22 23:17   ` [patch 43/47] powerpc: Add missing reference to coherent_dma_mask Greg KH
2008-07-22 23:17   ` [patch 44/47] pxamci: fix byte aligned DMA transfers Greg KH
2008-07-23  7:01     ` pHilipp Zabel
2008-07-23 20:12       ` [stable] " Greg KH
2008-07-23 20:24         ` Linus Torvalds
2008-07-23 20:32           ` Greg KH
2008-07-24 10:33             ` pHilipp Zabel
2008-07-24 15:05               ` Greg KH
2008-07-24 19:22               ` Linus Torvalds
2008-07-24 20:34                 ` Pierre Ossman
2008-07-22 23:17   ` [patch 45/47] mmc: dont use DMA on newer ENE controllers Greg KH
2008-07-22 23:17   ` [patch 46/47] hrtimer: prevent migration for raising softirq Greg KH
2008-07-22 23:17   ` [patch 47/47] V4L/DVB (7475): Added support for Terratec Cinergy T USB XXS Greg KH
2008-07-23  4:42   ` [patch 00/47] 2.6.25-stable review Michael Krufky
2008-07-23  4:51     ` Michael Krufky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080722231431.GF8282@suse.de \
    --to=gregkh@suse.de \
    --cc=akpm@linux-foundation.org \
    --cc=alan@lxorguk.ukuu.org.uk \
    --cc=cavokz@gmail.com \
    --cc=cebbert@redhat.com \
    --cc=chuckw@quantumlinux.com \
    --cc=davej@redhat.com \
    --cc=jake@lwn.net \
    --cc=jejb@kernel.org \
    --cc=jens.axboe@oracle.com \
    --cc=jmforbes@linuxtx.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mkrufky@linuxtv.org \
    --cc=npiggin@suse.de \
    --cc=rbranco@la.checkpoint.com \
    --cc=rdunlap@xenotime.net \
    --cc=reviews@ml.cw.f00f.org \
    --cc=stable@kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=tytso@mit.edu \
    --cc=w@1wt.eu \
    --cc=zwane@arm.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox