From mboxrd@z Thu Jan 1 00:00:00 1970 From: Kiyoshi Ueda Subject: Re: [PATCH 3/5] dm: relax ordering of bio-based flush implementation Date: Fri, 03 Sep 2010 15:04:36 +0900 Message-ID: <4C808FF4.4010901@ct.jp.nec.com> References: <1283162296-13650-1-git-send-email-tj@kernel.org> <1283162296-13650-4-git-send-email-tj@kernel.org> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-2022-JP Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <1283162296-13650-4-git-send-email-tj@kernel.org> Sender: linux-raid-owner@vger.kernel.org To: Tejun Heo Cc: jaxboe@fusionio.com, snitzer@redhat.com, j-nomura@ce.jp.nec.com, jamie@shareable.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-raid@vger.kernel.org, hch@lst.de List-Id: linux-raid.ids Hi Tejun, On 08/30/2010 06:58 PM +0900, Tejun Heo wrote: > Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA doesn't mandate any ordering > against other bio's. This patch relaxes ordering around flushes. ... > * When dec_pending() detects that a flush has completed, it checks > whether the original bio has data. If so, the bio is queued to the > deferred list w/ REQ_FLUSH cleared; otherwise, it's completed. ... > @@ -529,16 +523,10 @@ static void end_io_acct(struct dm_io *io) > */ > static void queue_io(struct mapped_device *md, struct bio *bio) > { > - down_write(&md->io_lock); > - > spin_lock_irq(&md->deferred_lock); > bio_list_add(&md->deferred, bio); > spin_unlock_irq(&md->deferred_lock); > - > - if (!test_and_set_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags)) > - queue_work(md->wq, &md->work); > - > - up_write(&md->io_lock); > + queue_work(md->wq, &md->work); ... > @@ -638,26 +624,22 @@ static void dec_pending(struct dm_io *io, int error) ... > - } else { > - end_io_acct(io); > - free_io(md, io); > - > - if (io_error != DM_ENDIO_REQUEUE) { > - trace_block_bio_complete(md->queue, bio); > - > - bio_endio(bio, io_error); > - } > + bio->bi_rw &= ~REQ_FLUSH; > + queue_io(md, bio); dec_pending() is a function which is called during I/O completion where the caller may be disabling interrupts. So if you use queue_io() inside dec_pending(), the spin_lock must be taken/released with irqsave/irqrestore like the patch below. BTW, lockdep detects the issue and a warning like below is displayed. It may break the underlying drivers. ================================= [ INFO: inconsistent lock state ] 2.6.36-rc2+ #2 --------------------------------- inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. kworker/0:1/0 [HC0[0]:SC1[1]:HE1:SE0] takes: (&(&q->__queue_lock)->rlock){?.-...}, at: [] blk_end_bidi_request+0x44/0x80 {IN-HARDIRQ-W} state was registered at: [] __lock_acquire+0x8c6/0xb30 [] lock_acquire+0xa0/0x120 [] _raw_spin_lock_irqsave+0x4e/0x70 [] ata_qc_schedule_eh+0x5a/0xa0 [libata] [] ata_qc_complete+0x147/0x1f0 [libata] [] ata_hsm_qc_complete+0xc2/0x140 [libata] [] ata_sff_hsm_move+0x1d5/0x700 [libata] [] __ata_sff_port_intr+0xb3/0x100 [libata] [] ata_bmdma_port_intr+0x3f/0x120 [libata] [] ata_bmdma_interrupt+0x195/0x1e0 [libata] [] handle_IRQ_event+0x54/0x170 [] handle_edge_irq+0xc8/0x170 [] handle_irq+0x4b/0xa0 [] do_IRQ+0x6f/0xf0 [] ret_from_intr+0x0/0x16 [] _raw_spin_unlock+0x23/0x40 [] sys_dup3+0x122/0x1a0 [] sys_dup2+0x23/0xb0 [] system_call_fastpath+0x16/0x1b irq event stamp: 14660913 hardirqs last enabled at (14660912): [] _raw_spin_unlock_irqrestore+0x65/0x80 hardirqs last disabled at (14660913): [] _raw_spin_lock_irqsave+0x2e/0x70 softirqs last enabled at (14660874): [] __do_softirq+0x14e/0x210 softirqs last disabled at (14660879): [] call_softirq+0x1c/0x50 other info that might help us debug this: 1 lock held by kworker/0:1/0: #0: (&(&q->__queue_lock)->rlock){?.-...}, at: [] blk_end_bidi_request+0x44/0x80 stack backtrace: Pid: 0, comm: kworker/0:1 Not tainted 2.6.36-rc2+ #2 Call Trace: [] print_usage_bug+0x1a6/0x1f0 [] mark_lock+0x661/0x690 [] ? check_usage_backwards+0x0/0xf0 [] mark_held_locks+0x60/0x80 [] ? _raw_spin_unlock_irq+0x30/0x40 [] trace_hardirqs_on_caller+0x83/0x1a0 [] trace_hardirqs_on+0xd/0x10 [] _raw_spin_unlock_irq+0x30/0x40 [] ? queue_io+0x2e/0x90 [dm_mod] [] queue_io+0x57/0x90 [dm_mod] [] dec_pending+0x22a/0x320 [dm_mod] [] ? dec_pending+0x55/0x320 [dm_mod] [] clone_endio+0xad/0xc0 [dm_mod] [] bio_endio+0x1d/0x40 [] req_bio_endio+0x81/0xf0 [] blk_update_request+0x23d/0x460 [] ? blk_update_request+0x116/0x460 [] blk_update_bidi_request+0x27/0x80 [] __blk_end_bidi_request+0x20/0x50 [] __blk_end_request_all+0x1f/0x40 [] blk_flush_complete_seq+0x140/0x1a0 [] pre_flush_end_io+0x39/0x50 [] blk_finish_request+0x85/0x290 [] blk_end_bidi_request+0x52/0x80 [] blk_end_request_all+0x1f/0x40 [] dm_softirq_done+0xad/0x120 [dm_mod] [] blk_done_softirq+0x86/0xa0 [] __do_softirq+0xd6/0x210 [] call_softirq+0x1c/0x50 [] do_softirq+0x95/0xd0 [] irq_exit+0x4d/0x60 [] do_IRQ+0x78/0xf0 [] ret_from_intr+0x0/0x16 [] ? mwait_idle+0x79/0xe0 [] ? mwait_idle+0x70/0xe0 [] cpu_idle+0x66/0xe0 [] ? start_secondary+0x181/0x1f0 [] start_secondary+0x18f/0x1f0 Thanks, Kiyoshi Ueda Now queue_io() is called from dec_pending(), which may be called with interrupts disabled. So queue_io() must not enable interrupts unconditionally and must save/restore the current interrupts status. Signed-off-by: Kiyoshi Ueda Signed-off-by: Jun'ichi Nomura --- drivers/md/dm.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) Index: misc/drivers/md/dm.c =================================================================== --- misc.orig/drivers/md/dm.c +++ misc/drivers/md/dm.c @@ -512,9 +512,11 @@ static void end_io_acct(struct dm_io *io */ static void queue_io(struct mapped_device *md, struct bio *bio) { - spin_lock_irq(&md->deferred_lock); + unsigned long flags; + + spin_lock_irqsave(&md->deferred_lock, flags); bio_list_add(&md->deferred, bio); - spin_unlock_irq(&md->deferred_lock); + spin_unlock_irqrestore(&md->deferred_lock, flags); queue_work(md->wq, &md->work); }