All of lore.kernel.org
 help / color / mirror / Atom feed
From: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
To: Alasdair Kergon <agk@redhat.com>
Cc: device-mapper development <dm-devel@redhat.com>
Subject: [PATCH 0/5] request-based dm-multipath (v3)
Date: Tue, 02 Jun 2009 15:57:56 +0900	[thread overview]
Message-ID: <4A24CD74.80708@ct.jp.nec.com> (raw)

Hi,

This patch-set is the updated version of request-based dm-multipath.
Alasdair, please consider to include this for 2.6.31.

The patches are based on the combination of the following trees and patches:
  - Jens' block/for-2.6.31 (c143dc903d7c0b15f5052e00b2c7de33a8b4299c)
  - Alasdair's linux-next patches on June 2nd (UTC).
  - Mike Snitzer's dm-topology patch:
    http://marc.info/?l=dm-devel&m=124345842831206&w=2
    (Actually, no functional dependencies on this patch.)

This patch-set doesn't have barrier support yet.

Some basic function/performance testings are done with NEC iStorage
(active-active multipath), and no problem was found.
Also heavy stress testing with a lot of path up/down and table
switching is done to check panic, stall and memory leak don't occur.
Please review and apply if no problem.

NOTE:
Recently, __sector and __data_len of struct request are commented
as 'internal, NEVER access directly' because they reflect the latest
status of request processing and change as BIOs are added/completed.
However, since request-based dm needs to copy the latest status
of the original request to clone, those fields are accessed directly
in this patch-set.
Further block layer change will be needed to clean up this situation.
I'm going to discuss about it separately in linux-kernel.


Summary of the patch-set
========================
  1/5: dm core: add core functions for request-based dm
  2/5: dm core: enable request-based dm
  3/5: dm core: don't set QUEUE_ORDERED_DRAIN for request-based dm
  4/5: dm core: disable interrupt when taking map_lock
  5/5: dm-mpath: convert to request-based

 drivers/md/dm-ioctl.c         |   13 
 drivers/md/dm-mpath.c         |  193 +++++---
 drivers/md/dm-table.c         |  130 +++++
 drivers/md/dm.c               |  947 ++++++++++++++++++++++++++++++++++++++++--
 drivers/md/dm.h               |   27 +
 include/linux/device-mapper.h |    9 
 6 files changed, 1218 insertions(+), 101 deletions(-)


CHANGES
=======
v3:
  - Rebased to
      * Jens' block/for-2.6.31 (c143dc903d7c0b15f5052e00b2c7de33a8b4299c)
      * Alasdair's linux-next patches on June 2nd (UTC)
      * Mike Snitzer's dm-topology patch:
        http://marc.info/?l=dm-devel&m=124345842831206&w=2
        (Actually, no functional dependencies on this patch.)

  - Removed the patch which rejects I/Os violating new queue limits,
    since Hannes pointed out that the patch seems to cause unexpected
    -EIO in no-path situation.
      * This unexpected -EIO problem is still under investigation.
      * The patch is for dangerous table swapping like shrinking
        the queue limits.  Generally, such table swapping shouldn't
        be done.  So removing the patch as of now shouldn't cause
        a problem.

  - Merged the integrity patch into the core function patch (PATCH 1)

  - Fixed types of variables (PATCH 1, PATCH 2)
      * Changed to use 'unsigned' instead of 'int' for 'i' in
        dm_table_any_busy_target() and dm_table_set_type(),
        since 'i' is always positive value and compared with
        't->num_targets', which is 'unsigned'.
      * Changed to use 'unsigned' instead of 'int' for 'bio_based'
        and 'request_based' in dm_table_set_type(), since they are
        always positive values.

  - Moved table type from struct io_restrictions to struct dm_table
    (PATCH 2)
      * In v2, table type was stored in struct io_restrictions as
        'no_request_stacking'.  But Mike Snitzer's dm-topology patch
        replaced the dm-specific io_restrictions with generic I/O
        topology framework (queue_limits).
      * Since table type is dm-specific, it cannot be a part of
        queue_limits.  So 'type' member is added to struct dm_table.

  - Dropped cast in multipath_busy(), since it's from void (PATCH 5)


v2: (http://marc.info/?l=dm-devel&m=124056068208687&w=2)
  - Rebased to 2.6.30-rc3

  - Fixed memory leak when bio_integrity_clone() fails (PATCH 2)
      * Added missing bio_free() when bio_integrity_clone() fails.

  - Added a patch not to set QUEUE_ORDERED_DRAIN for request-based dm
    (PATCH 4)
      * This version of request-based dm doesn't have barrier support
        yet.  So set QUEUE_ORDERED_DRAIN only for bio-based dm.
        Since the device type is decided at the first table binding
        time, the flag set is deferred until then.


v1: (http://marc.info/?l=dm-devel&m=123736590931733&w=2)
  - Rebased to 2.6.29-rc8

  - Fixed oops on table swapping because of broken suspend (PATCH 1)
    http://marc.info/?l=dm-devel&m=123667308218868&w=2
      * dm_suspend() recognized that suspend completed while some
        clones were still in flight.  So swapping/freeing table
        was possible against the in-use table, and it caused oops
      * The problem was a race condition on in-flight checking:
        new I/O could become in-flight while suspend code found
        no I/O was in-flight and went on
      * Fixed it by protecting the in-flight counting and queue
        suspension with queue lock.  Also using q->in_flight
        instead of md->pending since it is straightforward to
        understand the meaning of the in-flight clones

  - Fixed NULL pointer reference when allocation for clone bio fails
    (From Hannes Reinecke) (PATCH 1)
    http://marc.info/?l=dm-devel&m=123666946315335&w=2
      * free_bio_clone() used clone->end_io_data to get md
      * But clone->end_io_data can be NULL when free_bio_clone()
        is called from error handling path in clone_request_bios()
      * Fixed it by passing md directly to free_bio_clone()
      * Removed the work around code to check bio->bi_private:
        http://marc.info/?l=dm-devel&m=123666946315335&w=2

  - Fixed unnecessary -EIO on table swapping by removing
    the map-existence check in dm_make_request() (PATCH 1)
    http://marc.info/?l=dm-devel&m=123322573410050&w=2
      * The check was there to reject BIOs until a table is loaded
        and the device and the queue are initialized correctly
      * But the check is not needed because BIOs are rejected
        in __generic_make_request() by device size checking
        until the device and the queue are initialized

  - Fixed freeing of in-use md (PATCH 1)
      * Upper layer can close the device just after all BIOs are
        completed by blk_update_request() or blk_end_request(),
        even if the request is still in the process of completion.
        It creates a small window where dm_put() can free the md
        which is needed for the process of completion
      * Fixed it by incrementing md->holders while I/O is in-flight

  - Fixed memory leak (missing unmap) on failure of dispatching
    a mapped clone (PATCH 1)
      * dm_kill_request() was used in dm_dispatch_request() when
        the dispatching fails
      * But dm_kill_requst() is for not-mapped clone, so target's
        rq_end_io() function is not called. This caused memory leak.
      * Fixed it by creating dm_complete_request() for mapped clone
        and calling it in dm_dispatch_request()

  - Changed exported function names to prevent misuse like above
    (PATCH 1, PATCH 3, PATCH 5)
      * dm_kill_request()    => dm_kill_unmapped_request()
      * dm_requeue_request() => dm_requeue_unmapped_request()

  - Removed the REQ_QUIET flag in dm_kill_unmapped_request(), since
    "I/O error" message is not printed for failed I/O (PATCH 1)

  - Removed 'inline' for simple static functions, since compiler
    will do inline anyway (PATCH 1)

  - Removed if-request-based check in dm_prep_fn(), since the device
    is always request-based when dm_prep_fn() is called (PATCH 1)
      * The check was for the case where the table is switched from
        request-based to bio-based.  But currently no plan to support
        such switching

  - Fixed the problem that switching device type is unexpectedly
    available on table swapping (PATCH 2)
      * table_load() checks md->map for the current table type
        to prevent the device type switching
      * But table_load() can run in parallel with dm_swap_table(),
        which has a small no-map window.
        So dm_init_md_mempool() in table_load() could switch
        the mempool type for an in-use device in this window
      * Fixed it by placing the pre-allocated mempools in table
        and binding them on resume.  The resume is rejected if
        the device is in-use with different type

  - Fixed the problem that BIO is submitted to a request-based device
    in which some settings may not be done yet (PATCH 2)
      * The QUEUE_FLAG_STACKABLE flag was set in the middle of
        dm_table_set_restrictions()
      * But the flag needs to be set after all queue settings are
        done and they become visible to other CPUs, because:
          o BIOs can be submitted during dm_table_set_restrictions()
          o Once the QUEUE_FLAG_STACKABLE flag is set in the queue,
            BIOs are passed to request-based dm and eventually to
            __make_request() where the queue settings are used
      * Fixed it by re-ordering the flag set and putting a barrier
        before the flag set.

  - Fixed lockdep warnings (From Christof Schmitt) (PATCH 4)
    http://marc.info/?l=dm-devel&m=123501258326935&w=2
      * lockdep warns some self-deadlock possibilities since:
          o map_lock is taken without disabling irq
          o request-based dm takes map_lock after taking
            queue_lock with disabling irq in dm_request_fn()
          o so logically a deadlock may happen:
              write_lock(map_lock)
              <interrupt>
              spin_lock_irqsave(queue_lock)
              dm_request_fn()
                  => dm_get_table()
                      => read_lock(map_lock)
      * dm_request_fn() is never called in interrupt context, so
        such self-deadlocks don't happen actually
      * But fixed the unneeded warnings by disabling irq when
        taking map_lock


Summary of the design and request-based dm-multipath are below.
(No changes since the previous post.)

BACKGROUND
==========
Currently, device-mapper (dm) is implemented as a stacking block device
at bio level.  This bio-based implementation has an issue below
on dm-multipath.

  Because hook for I/O mapping is above block layer __make_request(),
  contiguous bios can be mapped to different underlying devices
  and these bios aren't merged into a request.
  Dynamic load balancing could happen this situation, though
  it has not been implemented yet.
  Therefore, I/O mapping after bio merging is needed for better
  dynamic load balancing.

The basic idea to resolve the issue is to move multipathing layer
down below the I/O scheduler, and it was proposed from Mike Christie
as the block layer (request-based) multipath:
  http://marc.info/?l=linux-scsi&m=115520444515914&w=2

Mike's patch added new block layer device for multipath and didn't
have dm interface.  So I modified his patch to be used from dm.
It is request-based dm-multipath.


DESIGN OVERVIEW
===============
While currently dm and md stacks block devices at bio level,
request-based dm stacks at request level and submits/completes
struct request instead of struct bio.


Overview of the request-based dm patch:
  - Mapping is done in a unit of struct request, instead of struct bio
  - Hook for I/O mapping is at q->request_fn() after merging and
    sorting by I/O scheduler, instead of q->make_request_fn().
  - Hook for I/O completion is at bio->bi_end_io() and rq->end_io(),
    instead of only bio->bi_end_io()
                  bio-based (current)     request-based (this patch)
      ------------------------------------------------------------------
      submission  q->make_request_fn()    q->request_fn()
      completion  bio->bi_end_io()        bio->bi_end_io(), rq->end_io()
  - Whether the dm device is bio-based or request-based is determined
    at table loading time
  - Keep user interface same (table/message/status/ioctl)
  - Any bio-based devices (like dm/md) can be stacked on request-based
    dm device.
    Request-based dm device *cannot* be stacked on any bio-based device.


Expected benefit:
  - better load balancing


Additional explanations:

Why does request-based dm use bio->bi_end_io(), too?
Because:
  - dm needs to keep not only the request but also bios of the request,
    if dm target drivers want to retry or do something on the request.
    For example, dm-multipath has to check errors and retry with other
    paths if necessary before returning the I/O result to the upper layer.

  - But rq->end_io() is called at the very late stage of completion
    handling where all bios in the request have been completed and
    the I/O results are already visible to the upper layer.
So request-based dm hooks bio->bi_end_io() and doesn't complete the bio
in error cases, and gives over the error handling to rq->end_io() hook.


Thanks,
Kiyoshi Ueda

             reply	other threads:[~2009-06-02  6:57 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-06-02  6:57 Kiyoshi Ueda [this message]
2009-06-02  6:59 ` [PATCH 1/5] dm core: add core functions for request-based dm Kiyoshi Ueda
2009-06-02  7:01 ` [PATCH 2/5] dm core: enable " Kiyoshi Ueda
2009-06-02  7:01 ` [PATCH 3/5] dm core: don't set QUEUE_ORDERED_DRAIN for " Kiyoshi Ueda
2009-06-02  7:02 ` [PATCH 4/5] dm core: disable interrupt when taking map_lock Kiyoshi Ueda
2009-06-02  7:03 ` [PATCH 5/5] dm-mpath: convert to request-based Kiyoshi Ueda
2009-08-27 17:54   ` Alasdair G Kergon
2009-08-28  5:00     ` Kiyoshi Ueda
2009-08-28 13:36       ` Mike Snitzer
2009-08-29 18:23         ` Mike Snitzer
2009-11-12 10:08       ` reinstate bio-based dm-multipath? (Was: dm-mpath: convert to request-based) Kiyoshi Ueda

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4A24CD74.80708@ct.jp.nec.com \
    --to=k-ueda@ct.jp.nec.com \
    --cc=agk@redhat.com \
    --cc=dm-devel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.