public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed
* [GIT PULL] Block changes for 4.18-rc
@ 2018-06-04  0:42 Jens Axboe
  2018-06-04 15:51 ` Linus Torvalds
  0 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2018-06-04  0:42 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-block@vger.kernel.org

Hi Linus,

This is the main pull request for the 4.18 merge window. This pull
request contains:

- Series from Christoph, cleaning up how we pass around gfp_t and
  blk_mq_req_flags_t.

- Series from Christoph, preparing us to defer scheduler attach.

- Series from Christoph, cleaing up drivers handling of bounce buffers

- Series from Christoph/Bart/Keith, fixing timeout handling corner
  cases.

- Series of bcache fixes, by way of Coly.

- Series from Kent, with both some prep work for bcachefs and some
  block layer optimizations.

- Series from Kent, converting users of bio_sets to using embedded
  structs.

- Patches from Paolo/Davide/Filippo, with fixes for the BFQ io
  scheduler.

- Set of lightnvm fixes and improvements. By way of Matias,
  contributions from Hans and Javier.

- Series from me, adding discard throttling to blk-wbt.

- Series from me/Omar/Ming on sbitmap blk-mq-tag handling.

- Removal of the sparc jsflash block driver, acked by DaveM.

- Kyber scheduler improvement from Jianchao, making it more
  friendly wrt merging.

- Conversion of symbolic proc permissions to octal, from Joe
  Perches. Previously the block parts were a mix of both.

- Series of nbd fixes from Josef and Kevin Vigor.

- Series from Omar, unifying how we handle the various kinds of
  timestamps that the block core and utility code uses.

- Three NVMe pull requests from Keith and Christoph, bringing
  AEN to feature completeness, file backed namespaces, cq/sq
  lock split, and various fixes.

- Various little fixes and improvements all over the map.


  git://git.kernel.dk/linux-block.git tags/for-4.18/block-20180603


----------------------------------------------------------------
Andy Shevchenko (3):
      bcache: Move couple of string arrays to sysfs.c
      bcache: Move couple of functions to sysfs.c
      bcache: Replace bch_read_string_list() by __sysfs_match_string()

Anna-Maria Gleixner (1):
      block: Remove redundant WARN_ON()

Bart Van Assche (1):
      blkdev_report_zones_ioctl(): Use vmalloc() to allocate large buffers

Chaitanya Kulkarni (4):
      nvmet: make a few error messages more generic
      nvmet: remove duplicate NULL initialization for req->ns
      nvmet: add simple file backed ns support
      nvmet-loop: use nr_phys_segments when map rq to sgl

Chengguang Xu (1):
      blk-throttle: return proper bool type to caller instead of 0/1

Christoph Hellwig (49):
      mtip32xx: don't use block layer bounce buffers
      DAC960: don't use block layer bounce buffers
      memstick: don't call blk_queue_bounce_limit
      mtd_blkdevs: handle highmem pages
      aoe: handle highmem pages
      jsflash: handle highmem pages
      ps3disk: handle highmem pages
      memstick: remove unused variables
      scsi/osd: remove the gfp argument to osd_start_request
      block: fix __get_request documentation
      block: sanitize blk_get_request calling conventions
      block: pass an explicit gfp_t to get_request
      block: use GFP_NOIO instead of __GFP_DIRECT_RECLAIM
      block: consistently use GFP_NOIO instead of __GFP_NORECLAIM
      nvme: mark the result argument to nvme_complete_async_event volatile
      nvme-pci: simplify nvme_cqe_valid
      libata: remove ata_scsi_timed_out
      block: rename BLK_EH_NOT_HANDLED to BLK_EH_DONE
      nvme: return BLK_EH_DONE from ->timeout
      nbd: complete requests from ->timeout
      mtip32xx: complete requests from ->timeout
      null_blk: complete requests from ->timeout
      scsi_transport_fc: complete requests from ->timeout
      mmc: complete requests from ->timeout
      libiscsi: don't try to bypass SCSI EH
      block: remove BLK_EH_HANDLED
      block: document the blk_eh_timer_return values
      blk-mq: simplify blk_mq_rq_timed_out
      block: unexport check_disk_size_change
      block: don't print a message when the device went away
      block: remove parent device reference from struct bsg_class_device
      nvme-pci: simplify __nvme_submit_cmd
      nvme-loop: add support for multiple ports
      blk-mq: only iterate over inflight requests in blk_mq_tagset_busy_iter
      nvme-fabrics: allow internal passthrough command on deleting controllers
      nvme.h: untangle AEN notice definitions
      nvme.h: add the changed namespace list log
      nvmet: add a new nvmet_zero_sgl helper
      nvmet: split log page implementation
      nvmet: implement the changed namespaces log
      nvmet: add AEN configuration support
      nvmet: mask pending AENs
      nvme: mark nvme_queue_scan static
      nvme: use the changed namespaces list log to clear ns data changed AENs
      block: move initialization of elevator-related fields to blk_alloc_queue_node
      block: unexport elevator_init/exit
      block: remove the always unused name argument to elevator_init
      block: move sysfs_lock into elevator_init
      block: split the blk-mq case from elevator_init

Christophe JAILLET (1):
      mtip32xx: Fix an error handling path in 'mtip_pci_probe()'

Coly Li (1):
      bcache: stop bcache device when backing device is offline

Dan Melnic (1):
      block/ndb: add WQ_UNBOUND to the knbd-recv workqueue

Davide Sapienza (2):
      block, bfq: increase weight-raising duration for interactive apps
      block, bfq: prevent soft_rt_next_start from being stuck at infinity

Filippo Muzzini (2):
      block, bfq: remove wrong lock in bfq_requests_merged
      block, bfq: remove the removal of 'next' rq in bfq_requests_merged

Hannes Reinecke (6):
      nvme-fabrics: centralize discovery controller defaults
      nvme-fabrics: allow duplicate connections to the discovery controller
      nvme: fix KASAN warning when parsing host nqn
      nvme: fixup memory leak in nvme_init_identify()
      nvme.h: add AEN configuration symbols
      nvme: submit AEN event configuration on startup

Hans Holmberg (5):
      lightnvm: pblk: rework write error recovery path
      lightnvm: pblk: garbage collect lines with failed writes
      lightnvm: pblk: fix smeta write error path
      lightnvm: pblk: only try to recover lines with written smeta
      lightnvm: pblk: kick writer on new flush points

Igor Konopko (2):
      lightnvm: proper error handling for pblk_bio_add_pages
      lightnvm: fix partial read error path

Ivan Bornyakov (1):
      nvme: host: core: fix precedence of ternary operator

James Smart (4):
      nvme-fc: remove setting DNR on exception conditions
      nvme-fabrics: remove unnecessary controller subnqn validation
      nvmet-fc: increase LS buffer count per fc port
      nvme: allow duplicate controller if prior controller being deleted

Javier González (13):
      lightnvm: pblk: fail gracefully on line alloc. failure
      lightnvm: pblk: recheck for bad lines at runtime
      lightnvm: pblk: check read lba on gc path
      lightnvm: pblk: improve error msg on corrupted LBAs
      lightnvm: pblk: warn in case of corrupted write buffer
      lightnvm: pblk: return NVM_ error on failed submission
      lightnvm: pblk: remove unnecessary indirection
      lightnvm: pblk: remove unnecessary argument
      lightnvm: pblk: check for chunk size before allocating it
      lightnvm: pass flag on graceful teardown to targets
      lightnvm: pblk: remove dead function
      lightnvm: pblk: remove unnecessary bio_get/put
      lightnvm: pblk: take bitmap alloc. out of critical section

Jens Axboe (26):
      nvme: only reconfigure discard if necessary
      block: break discard submissions into the user defined size
      blk-wbt: account any writing command as a write
      blk-wbt: pass in enum wbt_flags to get_rq_wait()
      blk-wbt: throttle discards like background writes
      blk-mq: don't call into depth limiting for reserved tags
      bfq-iosched: don't worry about reserved tags in limit_depth
      bfq: calculate shallow depths at init time
      bfq-iosched: remove unused variable
      bfq-iosched: update shallow depth to smallest one used
      kyber-iosched: update shallow depth when setting up hardware queue
      sbitmap: fix race in wait batch accounting
      Remove jsflash driver
      nvme-pci: remove cq check after submission
      nvme-pci: move ->cq_vector == -1 check outside of ->q_lock
      nvme-pci: handle completions outside of the queue lock
      nvme-pci: split the nvme queue lock into submission and completion locks
      nvme-pci: drop IRQ disabling on submission queue lock
      Merge branch 'nvme-4.18' of git://git.infradead.org/nvme into for-4.18/block
      nvme-pci: fix race between poll and IRQ completions
      block: move ->timeout request member
      Merge branch 'nvme-4.18-2' of git://git.infradead.org/nvme into for-4.18/block
      blk-mq: abstract out blk-mq-sched rq list iteration bio merge helper
      block: fixup bioset_integrity_create() call
      Merge branch 'nvme-4.18' of git://git.infradead.org/nvme into for-4.18/block
      block: don't use blocking queue entered for recursive bio submits

Jianchao Wang (3):
      nvme-pci: set nvmeq->cq_vector after alloc cq/sq
      nvme-rdma: stop admin queue before freeing it
      block: kyber: make kyber more friendly with merging

Joe Perches (1):
      block drivers/block: Use octal not symbolic permissions

Johannes Thumshirn (3):
      nvme: fc: provide a descriptive error
      nvme: change order of qid and cmdid in completion trace
      nvme: fix lockdep warning in nvme_mpath_clear_current_path

Josef Bacik (8):
      block: fix MAINTAINERS email for nbd
      nbd: fix nbd device deletion
      nbd: update size when connected
      nbd: use bd_set_size when updating disk size
      nbd: clear_sock on netlink disconnect
      nbd: fix how we set bd_invalidated
      nbd: call nbd_bdev_reset instead of bd_set_size on disconnect
      nbd: set discard granularity properly

Keith Busch (7):
      nvme/pci: Use async_schedule for initial reset work
      nvme/pci: Hold controller reference during async probe
      nvme/pci: Sync controller reset for AER slot_reset
      nvme-pci: Fix AER reset handling
      blk-mq: Fix timeout and state order
      blk-mq: Remove generation seqeunce
      nvme-pci: Rate limit the nvme timeout warnings

Kent Overstreet (23):
      mempool: Add mempool_init()/mempool_exit()
      block: Convert bio_set to mempool_init()
      block: Add bioset_init()/bioset_exit()
      block: Use bioset_init() for fs_bio_set
      block: Add bio_copy_data_iter(), zero_fill_bio_iter()
      block: Split out bio_list_copy_data()
      block: Add missing flush_dcache_page() call
      block: Add warning for bi_next not NULL in bio_endio()
      block: Export bio check/set pages_dirty
      block: Add sysfs entry for fua support
      block: convert bounce, q->bio_split to bioset_init()/mempool_init()
      drbd: convert to bioset_init()/mempool_init()
      pktcdvd: convert to bioset_init()/mempool_init()
      lightnvm: convert to bioset_init()/mempool_init()
      bcache: convert to bioset_init()/mempool_init()
      md: convert to bioset_init()/mempool_init()
      dm: convert to bioset_init()/mempool_init()
      target: convert to bioset_init()/mempool_init()
      fs: convert block_dev.c to bioset_init()
      btrfs: convert to bioset_init()/mempool_init()
      xfs: convert to bioset_init()/mempool_init()
      block: Drop bioset_create()
      dm-crypt: fix warning in shutdown path

Kevin Vigor (1):
      nbd: clear DISCONNECT_REQUESTED flag once disconnection occurs.

Liu Bo (2):
      null_blk: add blocking description and remove lightnvm
      blk-throttle: fix potential NULL pointer dereference in throtl_select_dispatch

Marcin Dziegielewski (1):
      lightnvm: pblk: add possibility to set write buffer size manually

Micah Parrish (1):
      NVMe: Add Quirk Delay before CHK RDY for Seagate Nytro Flash Storage

Ming Lei (2):
      blk-mq: avoid starving tag allocation after allocating process migrates
      blk-mq: update nr_requests when switching to 'none' scheduler

Omar Sandoval (9):
      block: move some wbt helpers to blk-wbt.c
      block: pass struct request instead of struct blk_issue_stat to wbt
      block: replace bio->bi_issue_stat with bio-specific type
      block: get rid of struct blk_issue_stat
      block: use ktime_get_ns() instead of sched_clock() for cfq and bfq
      block: move blk_stat_add() to __blk_mq_end_request()
      block: consolidate struct request timestamp fields
      sbitmap: fix missed wakeups caused by sbitmap_queue_get_shallow()
      sbitmap: warn if using smaller shallow depth than was setup

Paolo Valente (4):
      block, bfq: postpone rq preparation to insert or merge
      block, bfq: remove wrong check in bfq_requests_merged
      block, bfq: add description of weight-raising heuristics
      block, bfq: remove slow-system class

Sebastian Andrzej Siewior (1):
      block: don't disable interrupts during kmap_atomic()

SeongJae Park (1):
      brd: Mark as non-rotational

Tejun Heo (1):
      bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueue

Tetsuo Handa (1):
      loop: remember whether sysfs_create_group() was done

Thomas Gleixner (1):
      block: Shorten interrupt disabled regions

Wei Xu (1):
      nvme: lightnvm: add granby support

Wei Yongjun (2):
      nvmet: fix a typo in nvmet_file_ns_enable()
      nvmet: fix error return code in nvmet_file_ns_enable()

huhai (3):
      blk-mq: remove redundant insert case in blk_mq_make_request()
      blk-mq: clear hctx->dispatch_from when mappings change
      blk-mq: remove wrong 'unlikely' check

 Documentation/block/null_blk.txt                |   9 +-
 Documentation/scsi/scsi_eh.txt                  |  15 +-
 MAINTAINERS                                     |   2 +-
 arch/sparc/include/uapi/asm/jsflash.h           |  40 --
 block/bfq-cgroup.c                              |  40 +-
 block/bfq-iosched.c                             | 478 +++++++++--------
 block/bfq-iosched.h                             |  30 +-
 block/bio-integrity.c                           |  29 +-
 block/bio.c                                     | 189 ++++---
 block/blk-core.c                                | 120 ++---
 block/blk-integrity.c                           |  12 +-
 block/blk-lib.c                                 |  12 +-
 block/blk-merge.c                               |  29 +-
 block/blk-mq-debugfs.c                          |   1 -
 block/blk-mq-sched.c                            |  46 +-
 block/blk-mq-sched.h                            |   2 -
 block/blk-mq-sysfs.c                            |   6 +-
 block/blk-mq-tag.c                              |  14 +-
 block/blk-mq.c                                  | 340 ++++--------
 block/blk-mq.h                                  |  42 +-
 block/blk-stat.c                                |  10 +-
 block/blk-stat.h                                |  45 +-
 block/blk-sysfs.c                               |  80 +--
 block/blk-throttle.c                            |  35 +-
 block/blk-timeout.c                             |   6 +-
 block/blk-wbt.c                                 | 129 +++--
 block/blk-wbt.h                                 |  55 +-
 block/blk-zoned.c                               |   8 +-
 block/blk.h                                     |   5 +-
 block/bounce.c                                  |  52 +-
 block/bsg-lib.c                                 |   6 +-
 block/bsg.c                                     |  44 +-
 block/cfq-iosched.c                             |  66 ++-
 block/deadline-iosched.c                        |   3 +-
 block/elevator.c                                | 101 ++--
 block/genhd.c                                   |  37 +-
 block/kyber-iosched.c                           | 199 +++++--
 block/mq-deadline.c                             |   3 +-
 block/partition-generic.c                       |  26 +-
 block/scsi_ioctl.c                              |  10 +-
 drivers/ata/libata-eh.c                         |  51 --
 drivers/block/DAC960.c                          |  11 +-
 drivers/block/DAC960.h                          |   1 -
 drivers/block/aoe/aoeblk.c                      |  11 +-
 drivers/block/aoe/aoecmd.c                      |   3 +-
 drivers/block/brd.c                             |  10 +-
 drivers/block/drbd/drbd_bitmap.c                |   5 +-
 drivers/block/drbd/drbd_debugfs.c               |  20 +-
 drivers/block/drbd/drbd_int.h                   |  10 +-
 drivers/block/drbd/drbd_main.c                  |  73 +--
 drivers/block/drbd/drbd_receiver.c              |   6 +-
 drivers/block/drbd/drbd_req.c                   |   4 +-
 drivers/block/drbd/drbd_req.h                   |   2 +-
 drivers/block/floppy.c                          |   2 +-
 drivers/block/loop.c                            |  17 +-
 drivers/block/loop.h                            |   1 +
 drivers/block/mtip32xx/mtip32xx.c               |  29 +-
 drivers/block/nbd.c                             |  77 ++-
 drivers/block/null_blk.c                        |  36 +-
 drivers/block/paride/pd.c                       |   2 +-
 drivers/block/pktcdvd.c                         |  60 +--
 drivers/block/ps3disk.c                         |   2 -
 drivers/block/rbd.c                             |  44 +-
 drivers/block/rsxx/core.c                       |   6 +-
 drivers/block/sx8.c                             |   2 +-
 drivers/block/virtio_blk.c                      |   8 +-
 drivers/block/xen-blkback/blkback.c             |   2 +-
 drivers/block/xen-blkback/xenbus.c              |   4 +-
 drivers/block/xen-blkfront.c                    |   7 +-
 drivers/cdrom/cdrom.c                           |   2 +-
 drivers/ide/ide-atapi.c                         |   2 +-
 drivers/ide/ide-cd.c                            |   2 +-
 drivers/ide/ide-cd_ioctl.c                      |   2 +-
 drivers/ide/ide-devsets.c                       |   2 +-
 drivers/ide/ide-disk.c                          |   2 +-
 drivers/ide/ide-ioctls.c                        |   4 +-
 drivers/ide/ide-park.c                          |   4 +-
 drivers/ide/ide-pm.c                            |   5 +-
 drivers/ide/ide-tape.c                          |   4 +-
 drivers/ide/ide-taskfile.c                      |   4 +-
 drivers/lightnvm/core.c                         |  10 +-
 drivers/lightnvm/pblk-cache.c                   |  10 +-
 drivers/lightnvm/pblk-core.c                    | 233 ++++++---
 drivers/lightnvm/pblk-gc.c                      | 112 ++--
 drivers/lightnvm/pblk-init.c                    | 172 ++++---
 drivers/lightnvm/pblk-map.c                     |  33 +-
 drivers/lightnvm/pblk-rb.c                      |  48 +-
 drivers/lightnvm/pblk-read.c                    | 146 +++---
 drivers/lightnvm/pblk-recovery.c                | 121 +----
 drivers/lightnvm/pblk-rl.c                      |  29 +-
 drivers/lightnvm/pblk-sysfs.c                   |  15 +-
 drivers/lightnvm/pblk-write.c                   | 269 ++++++----
 drivers/lightnvm/pblk.h                         |  58 ++-
 drivers/md/bcache/bcache.h                      |  14 +-
 drivers/md/bcache/bset.c                        |  13 +-
 drivers/md/bcache/bset.h                        |   2 +-
 drivers/md/bcache/btree.c                       |   4 +-
 drivers/md/bcache/io.c                          |   4 +-
 drivers/md/bcache/request.c                     |  18 +-
 drivers/md/bcache/super.c                       | 109 ++--
 drivers/md/bcache/sysfs.c                       |  51 +-
 drivers/md/bcache/util.c                        |  35 --
 drivers/md/bcache/util.h                        |   5 -
 drivers/md/dm-bio-prison-v1.c                   |  13 +-
 drivers/md/dm-bio-prison-v2.c                   |  13 +-
 drivers/md/dm-cache-target.c                    |  25 +-
 drivers/md/dm-core.h                            |   4 +-
 drivers/md/dm-crypt.c                           |  58 +--
 drivers/md/dm-integrity.c                       |  15 +-
 drivers/md/dm-io.c                              |  29 +-
 drivers/md/dm-kcopyd.c                          |  22 +-
 drivers/md/dm-log-userspace-base.c              |  19 +-
 drivers/md/dm-mpath.c                           |   3 +-
 drivers/md/dm-region-hash.c                     |  23 +-
 drivers/md/dm-rq.c                              |   4 +-
 drivers/md/dm-snap.c                            |  17 +-
 drivers/md/dm-thin.c                            |  32 +-
 drivers/md/dm-verity-fec.c                      |  55 +-
 drivers/md/dm-verity-fec.h                      |   8 +-
 drivers/md/dm-zoned-target.c                    |  13 +-
 drivers/md/dm.c                                 |  55 +-
 drivers/md/md-faulty.c                          |   2 +-
 drivers/md/md-linear.c                          |   2 +-
 drivers/md/md-multipath.c                       |  17 +-
 drivers/md/md-multipath.h                       |   2 +-
 drivers/md/md.c                                 |  61 +--
 drivers/md/md.h                                 |   4 +-
 drivers/md/raid0.c                              |   5 +-
 drivers/md/raid1.c                              |  76 ++-
 drivers/md/raid1.h                              |   6 +-
 drivers/md/raid10.c                             |  60 ++-
 drivers/md/raid10.h                             |   6 +-
 drivers/md/raid5-cache.c                        |  43 +-
 drivers/md/raid5-ppl.c                          |  42 +-
 drivers/md/raid5.c                              |  12 +-
 drivers/md/raid5.h                              |   2 +-
 drivers/memstick/core/ms_block.c                |   6 -
 drivers/memstick/core/mspro_block.c             |   6 -
 drivers/message/fusion/mptsas.c                 |   2 +-
 drivers/mmc/core/block.c                        |  12 +-
 drivers/mmc/core/queue.c                        |   5 +-
 drivers/mtd/mtd_blkdevs.c                       |  20 +-
 drivers/nvme/host/core.c                        | 157 ++++--
 drivers/nvme/host/fabrics.c                     | 100 ++--
 drivers/nvme/host/fabrics.h                     |   4 +-
 drivers/nvme/host/fc.c                          |  15 +-
 drivers/nvme/host/nvme.h                        |   9 +-
 drivers/nvme/host/pci.c                         | 258 ++++++----
 drivers/nvme/host/rdma.c                        |  12 +-
 drivers/nvme/host/trace.h                       |   4 +-
 drivers/nvme/target/Makefile                    |   4 +-
 drivers/nvme/target/admin-cmd.c                 | 143 ++---
 drivers/nvme/target/core.c                      | 128 +++--
 drivers/nvme/target/discovery.c                 |   2 -
 drivers/nvme/target/fabrics-cmd.c               |   4 -
 drivers/nvme/target/fc.c                        |   2 +-
 drivers/nvme/target/{io-cmd.c => io-cmd-bdev.c} |  77 +--
 drivers/nvme/target/io-cmd-file.c               | 304 +++++++++++
 drivers/nvme/target/loop.c                      |  52 +-
 drivers/nvme/target/nvmet.h                     |  51 +-
 drivers/s390/block/dasd.c                       |   6 +-
 drivers/sbus/char/Kconfig                       |   7 -
 drivers/sbus/char/Makefile                      |   1 -
 drivers/sbus/char/jsflash.c                     | 658 ------------------------
 drivers/scsi/gdth.c                             |   2 +-
 drivers/scsi/libiscsi.c                         |   6 +-
 drivers/scsi/megaraid/megaraid_sas_base.c       |   2 +-
 drivers/scsi/mvumi.c                            |   2 +-
 drivers/scsi/osd/osd_initiator.c                |  24 +-
 drivers/scsi/osst.c                             |   2 +-
 drivers/scsi/qla4xxx/ql4_os.c                   |   2 +-
 drivers/scsi/scsi_error.c                       |  10 +-
 drivers/scsi/scsi_lib.c                         |   4 +-
 drivers/scsi/scsi_transport_fc.c                |  16 +-
 drivers/scsi/scsi_transport_iscsi.c             |   2 +-
 drivers/scsi/scsi_transport_sas.c               |  19 +-
 drivers/scsi/scsi_transport_srp.c               |   4 +-
 drivers/scsi/sg.c                               |   2 +-
 drivers/scsi/st.c                               |   2 +-
 drivers/scsi/ufs/ufshcd.c                       |   6 +-
 drivers/target/target_core_iblock.c             |  16 +-
 drivers/target/target_core_iblock.h             |   2 +-
 drivers/target/target_core_pscsi.c              |   3 +-
 fs/block_dev.c                                  |  24 +-
 fs/btrfs/extent_io.c                            |  25 +-
 fs/direct-io.c                                  |   4 +-
 fs/exofs/ore.c                                  |  10 +-
 fs/exofs/super.c                                |   2 +-
 fs/nfsd/blocklayout.c                           |   2 +-
 fs/xfs/xfs_aops.c                               |   2 +-
 fs/xfs/xfs_aops.h                               |   2 +-
 fs/xfs/xfs_super.c                              |  11 +-
 include/linux/bio.h                             |  42 +-
 include/linux/blk-mq.h                          |   3 +-
 include/linux/blk_types.h                       |  51 +-
 include/linux/blkdev.h                          | 110 ++--
 include/linux/bsg-lib.h                         |   3 +-
 include/linux/bsg.h                             |   6 +-
 include/linux/elevator.h                        |   2 -
 include/linux/fs.h                              |   2 +-
 include/linux/libata.h                          |   2 -
 include/linux/lightnvm.h                        |   2 +-
 include/linux/mempool.h                         |  34 ++
 include/linux/nvme.h                            |  16 +-
 include/linux/pktcdvd.h                         |   2 +-
 include/linux/sbitmap.h                         |  36 ++
 include/scsi/osd_initiator.h                    |   6 +-
 include/scsi/scsi_host.h                        |   2 +-
 kernel/power/swap.c                             |  14 +-
 lib/sbitmap.c                                   | 113 ++--
 mm/backing-dev.c                                |  18 +-
 mm/mempool.c                                    | 108 +++-
 212 files changed, 4020 insertions(+), 3983 deletions(-)
 delete mode 100644 arch/sparc/include/uapi/asm/jsflash.h
 rename drivers/nvme/target/{io-cmd.c => io-cmd-bdev.c} (76%)
 create mode 100644 drivers/nvme/target/io-cmd-file.c
 delete mode 100644 drivers/sbus/char/jsflash.c

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [GIT PULL] Block changes for 4.18-rc
  2018-06-04  0:42 [GIT PULL] Block changes for 4.18-rc Jens Axboe
@ 2018-06-04 15:51 ` Linus Torvalds
  2018-06-04 15:54   ` Jens Axboe
  0 siblings, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2018-06-04 15:51 UTC (permalink / raw)
  To: Jens Axboe, Kent Overstreet; +Cc: linux-block

On Sun, Jun 3, 2018 at 5:42 PM Jens Axboe <axboe@kernel.dk> wrote:
>
>  drivers/md/md.c                                 |  61 +--
>  drivers/md/md.h                                 |   4 +-

So I've pulled this, but I get a new warning:

  drivers/md/md.c: In function =E2=80=98mddev_put=E2=80=99:
  drivers/md/md.c:543:1: warning: the frame size of 2112 bytes is
larger than 2048 bytes [-Wframe-larger-than=3D]

which seems to be due to commit afeee514ce7f ("md: convert to
bioset_init()/mempool_init()").

As of that commit, mddev_put() now allocates *two* "struct bio_set"
structures on the stack, and those really aren't small. Yes, part of
it is that I do my test-builds with all that debug stuff, but those
structures have several embedded mempool_t members, and they really
are pretty sizable.

And I _really_ do not think it's acceptable to have that kind of stack
usage in there. I'm not sure you can trigger mddev_put() in the IO
paths, but even without that I don't think it's acceptable.

Also, the code simply looks like complete and utter garbage. It does

                bs =3D mddev->bio_set;
                sync_bs =3D mddev->sync_set;

to _copy_ those structures, and then does bioset_exit() on them. WTF?

Why the hell doesn't it just do biset_exit() on the originals instead,
before freeing the mddev?

I've pulled this, but this needs to be fixed. That is just broken
garbage right now. No way in hell is it acceptable to waste 2kB of
stack space on something idiotic like this.

                 Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [GIT PULL] Block changes for 4.18-rc
  2018-06-04 15:51 ` Linus Torvalds
@ 2018-06-04 15:54   ` Jens Axboe
  2018-06-04 16:24     ` Linus Torvalds
  2018-06-04 21:16     ` NeilBrown
  0 siblings, 2 replies; 17+ messages in thread
From: Jens Axboe @ 2018-06-04 15:54 UTC (permalink / raw)
  To: Linus Torvalds, Kent Overstreet; +Cc: linux-block, NeilBrown, Arnd Bergmann

On 6/4/18 9:51 AM, Linus Torvalds wrote:
> On Sun, Jun 3, 2018 at 5:42 PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>>  drivers/md/md.c                                 |  61 +--
>>  drivers/md/md.h                                 |   4 +-
> 
> So I've pulled this, but I get a new warning:
> 
>   drivers/md/md.c: In function ‘mddev_put’:
>   drivers/md/md.c:543:1: warning: the frame size of 2112 bytes is
> larger than 2048 bytes [-Wframe-larger-than=]
> 
> which seems to be due to commit afeee514ce7f ("md: convert to
> bioset_init()/mempool_init()").
> 
> As of that commit, mddev_put() now allocates *two* "struct bio_set"
> structures on the stack, and those really aren't small. Yes, part of
> it is that I do my test-builds with all that debug stuff, but those
> structures have several embedded mempool_t members, and they really
> are pretty sizable.
> 
> And I _really_ do not think it's acceptable to have that kind of stack
> usage in there. I'm not sure you can trigger mddev_put() in the IO
> paths, but even without that I don't think it's acceptable.
> 
> Also, the code simply looks like complete and utter garbage. It does
> 
>                 bs = mddev->bio_set;
>                 sync_bs = mddev->sync_set;
> 
> to _copy_ those structures, and then does bioset_exit() on them. WTF?
> 
> Why the hell doesn't it just do biset_exit() on the originals instead,
> before freeing the mddev?
> 
> I've pulled this, but this needs to be fixed. That is just broken
> garbage right now. No way in hell is it acceptable to waste 2kB of
> stack space on something idiotic like this.

Works is already underway to get this fixed up. I agree that the
excessive stack usage isn't great, but the code in that function
is pretty miserable, as you say. That's not new with the patch,
the conversion just carried that forward.

CC'ing Neil to get his input on how best to clean that up, I'd
be much more comfortable with that logic improved rather than
just catering to the stack usage issue. Also include Arnd, as
he had a test patch for it.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [GIT PULL] Block changes for 4.18-rc
  2018-06-04 15:54   ` Jens Axboe
@ 2018-06-04 16:24     ` Linus Torvalds
  2018-06-04 18:20       ` Tejun Heo
  2018-06-04 21:16     ` NeilBrown
  1 sibling, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2018-06-04 16:24 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Kent Overstreet, linux-block, NeilBrown, Arnd Bergmann

On Mon, Jun 4, 2018 at 8:54 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 6/4/18 9:51 AM, Linus Torvalds wrote:
> >
> > Why the hell doesn't it just do bioset_exit() on the originals instead,
> > before freeing the mddev?
>
> CC'ing Neil to get his input on how best to clean that up, I'd
> be much more comfortable with that logic improved rather than
> just catering to the stack usage issue. Also include Arnd, as
> he had a test patch for it.

Also adding Tejun, because the only reason for that odd sequencing
seems to be that

 - we want to queue the deletion work *inside* the spinlock, because
it actually has synchronization between the workqueue and the
'all_mddevs_lock' spinlock.

 - we want to bioset_exit() the fields *outside* the spinlock.

So what it does now is to copy those big 'struct bio_set' fields
because they might go away from under it.

Tejun, the code in question is mddev_put() in drivers/md/md.c, and it
basically does

                INIT_WORK(&mddev->del_work, mddev_delayed_delete);
                queue_work(md_misc_wq, &mddev->del_work);

inside a spinlock, but then wants to do some stuff *outside* the
spinlock before that mddev_delayed_delete() thing actually gets
called.

Is there some way to "half-queue" the work - enough that a
flush_workqueue() will wait for it, but still guaranteeing that it
won't actually run until released?

IOW, what I think that code really wants to do is perhaps something like

        /* under  &all_mddevs_lock spinlock */
        .. remove from all lists so that it can't be found ..

        .. but add it to the md_misc_wq so that people will wait for it ..

                INIT_WORK_LOCKED(&mddev->del_work, mddev_delayed_delete);
                queue_work(md_misc_wq, &mddev->del_work);
        ...

        spin_unlock(&all_mddevs_lock);

        /* mddev is still guaranteed live */
        bioset_exit(&mddev->bio_set);
        bioset_exit(&mddev->sync_set);

        /* NOW we can release it */
        if (queued)
                unlock_work(&mddev->del_work);
        else
                kfree(mddev);

or something like that?

The above is *ENTIRELY* hand-wavy garbage - don't take it seriously
per se, consider it just pseudo-code for documentation reasons, not
serious in any other way.

             Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [GIT PULL] Block changes for 4.18-rc
  2018-06-04 16:24     ` Linus Torvalds
@ 2018-06-04 18:20       ` Tejun Heo
  2018-06-04 18:22         ` Linus Torvalds
  0 siblings, 1 reply; 17+ messages in thread
From: Tejun Heo @ 2018-06-04 18:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Kent Overstreet, linux-block, NeilBrown,
	Arnd Bergmann

Hey, Linus.

On Mon, Jun 04, 2018 at 09:24:28AM -0700, Linus Torvalds wrote:
> Tejun, the code in question is mddev_put() in drivers/md/md.c, and it
> basically does
> 
>                 INIT_WORK(&mddev->del_work, mddev_delayed_delete);
>                 queue_work(md_misc_wq, &mddev->del_work);
> 
> inside a spinlock, but then wants to do some stuff *outside* the
> spinlock before that mddev_delayed_delete() thing actually gets
> called.
> 
> Is there some way to "half-queue" the work - enough that a
> flush_workqueue() will wait for it, but still guaranteeing that it
> won't actually run until released?

Such interface doesn't exist, yet anyway.  Workqueue does have the
concept of delayed queueing to handle max concurrency limits and we
can probably piggyback on that to create an interface which queues and
delays the work item and another interface to undelay it.

That said, it feels like a super-niche feature for a weird use case.

> IOW, what I think that code really wants to do is perhaps something like
> 
>         /* under  &all_mddevs_lock spinlock */
>         .. remove from all lists so that it can't be found ..
> 
>         .. but add it to the md_misc_wq so that people will wait for it ..
> 
>                 INIT_WORK_LOCKED(&mddev->del_work, mddev_delayed_delete);
>                 queue_work(md_misc_wq, &mddev->del_work);
>         ...
> 
>         spin_unlock(&all_mddevs_lock);
> 
>         /* mddev is still guaranteed live */
>         bioset_exit(&mddev->bio_set);
>         bioset_exit(&mddev->sync_set);
> 
>         /* NOW we can release it */
>         if (queued)
>                 unlock_work(&mddev->del_work);
>         else
>                 kfree(mddev);
>              Linus

Looking at the code, the fundamental problem seems to be that it's
weaving different parts of sync and async paths.  I don't understand
why it'd punt the destructin of mddev but destroy biosets
synchronously.  Can't it do sth like the following?

	lock;

	if (need to be async) {
		/* async destruction path */
		INIT_WORK(...);
		queue_work(...);	/* the work does bioset_exits */
		unlock;
		return;
	}

	/* sync destructino path */
	do whatever needs to be done under lock;
	unlock;
	bioset_exit(...);
	bioset_exit(...);
	kfree(mddev);

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [GIT PULL] Block changes for 4.18-rc
  2018-06-04 18:20       ` Tejun Heo
@ 2018-06-04 18:22         ` Linus Torvalds
  2018-06-04 18:25           ` Jens Axboe
  0 siblings, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2018-06-04 18:22 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Kent Overstreet, linux-block, NeilBrown,
	Arnd Bergmann

On Mon, Jun 4, 2018 at 11:20 AM Tejun Heo <tj@kernel.org> wrote:
>
>
> Looking at the code, the fundamental problem seems to be that it's
> weaving different parts of sync and async paths.  I don't understand
> why it'd punt the destructin of mddev but destroy biosets
> synchronously.  Can't it do sth like the following?

Yeah, that looks like the right thing to do.

Jens/Kent?

             Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [GIT PULL] Block changes for 4.18-rc
  2018-06-04 18:22         ` Linus Torvalds
@ 2018-06-04 18:25           ` Jens Axboe
  2018-06-04 19:04             ` Kent Overstreet
  0 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2018-06-04 18:25 UTC (permalink / raw)
  To: Linus Torvalds, Tejun Heo
  Cc: Kent Overstreet, linux-block, NeilBrown, Arnd Bergmann

On 6/4/18 12:22 PM, Linus Torvalds wrote:
> On Mon, Jun 4, 2018 at 11:20 AM Tejun Heo <tj@kernel.org> wrote:
>>
>>
>> Looking at the code, the fundamental problem seems to be that it's
>> weaving different parts of sync and async paths.  I don't understand
>> why it'd punt the destructin of mddev but destroy biosets
>> synchronously.  Can't it do sth like the following?
> 
> Yeah, that looks like the right thing to do.
> 
> Jens/Kent?

I agree with Tejun, we already discussed this offline.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [GIT PULL] Block changes for 4.18-rc
  2018-06-04 18:25           ` Jens Axboe
@ 2018-06-04 19:04             ` Kent Overstreet
  2018-06-05  0:42               ` Linus Torvalds
  0 siblings, 1 reply; 17+ messages in thread
From: Kent Overstreet @ 2018-06-04 19:04 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Tejun Heo, linux-block, NeilBrown, Arnd Bergmann

On Mon, Jun 04, 2018 at 12:25:54PM -0600, Jens Axboe wrote:
> On 6/4/18 12:22 PM, Linus Torvalds wrote:
> > On Mon, Jun 4, 2018 at 11:20 AM Tejun Heo <tj@kernel.org> wrote:
> >>
> >>
> >> Looking at the code, the fundamental problem seems to be that it's
> >> weaving different parts of sync and async paths.  I don't understand
> >> why it'd punt the destructin of mddev but destroy biosets
> >> synchronously.  Can't it do sth like the following?
> > 
> > Yeah, that looks like the right thing to do.
> > 
> > Jens/Kent?
> 
> I agree with Tejun, we already discussed this offline.

I don't see any good reason for the exit path to have two or three variations,
depending on how you count. My preference would be for a patch that does this:


diff --git a/drivers/md/md.c b/drivers/md/md.c
index fc692b7128..f18d783785 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -510,36 +510,24 @@ static void mddev_delayed_delete(struct work_struct *ws);
 
 static void mddev_put(struct mddev *mddev)
 {
-	struct bio_set bs, sync_bs;
-
-	memset(&bs, 0, sizeof(bs));
-	memset(&sync_bs, 0, sizeof(sync_bs));
-
 	if (!atomic_dec_and_lock(&mddev->active, &all_mddevs_lock))
 		return;
+
 	if (!mddev->raid_disks && list_empty(&mddev->disks) &&
 	    mddev->ctime == 0 && !mddev->hold_active) {
 		/* Array is not configured at all, and not held active,
 		 * so destroy it */
 		list_del_init(&mddev->all_mddevs);
-		bs = mddev->bio_set;
-		sync_bs = mddev->sync_set;
-		memset(&mddev->bio_set, 0, sizeof(mddev->bio_set));
-		memset(&mddev->sync_set, 0, sizeof(mddev->sync_set));
-		if (mddev->gendisk) {
-			/* We did a probe so need to clean up.  Call
-			 * queue_work inside the spinlock so that
-			 * flush_workqueue() after mddev_find will
-			 * succeed in waiting for the work to be done.
-			 */
-			INIT_WORK(&mddev->del_work, mddev_delayed_delete);
-			queue_work(md_misc_wq, &mddev->del_work);
-		} else
-			kfree(mddev);
+
+		/* We did a probe so need to clean up.  Call
+		 * queue_work inside the spinlock so that
+		 * flush_workqueue() after mddev_find will
+		 * succeed in waiting for the work to be done.
+		 */
+		INIT_WORK(&mddev->del_work, mddev_delayed_delete);
+		queue_work(md_misc_wq, &mddev->del_work);
 	}
 	spin_unlock(&all_mddevs_lock);
-	bioset_exit(&bs);
-	bioset_exit(&sync_bs);
 }
 
 static void md_safemode_timeout(struct timer_list *t);

However, that's not correct as is because mddev_delayed_put() calls
kobject_put(), and the kobject isn't initialized when the mddev is first
allocated, it's initialized when the gendisk is allocated... that isn't hard to
fix but that's getting into real refactoring that I'll need to put actual work
into testing.

However, I have looked at where mddev_put() is called from and I think the stack
usage is fine for now - unless I've missed something it's not called from e.g.
under generic_make_request() anywhere, it's just called from sysfs code and
ioctls and things like that, where the stack is going to be pretty much empty.

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [GIT PULL] Block changes for 4.18-rc
  2018-06-04 15:54   ` Jens Axboe
  2018-06-04 16:24     ` Linus Torvalds
@ 2018-06-04 21:16     ` NeilBrown
  2018-06-04 21:21       ` Kent Overstreet
  1 sibling, 1 reply; 17+ messages in thread
From: NeilBrown @ 2018-06-04 21:16 UTC (permalink / raw)
  To: Jens Axboe, Linus Torvalds, Kent Overstreet
  Cc: linux-block, Arnd Bergmann, Shaohua Li

[-- Attachment #1: Type: text/plain, Size: 3067 bytes --]

On Mon, Jun 04 2018, Jens Axboe wrote:

> On 6/4/18 9:51 AM, Linus Torvalds wrote:
>> On Sun, Jun 3, 2018 at 5:42 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>>  drivers/md/md.c                                 |  61 +--
>>>  drivers/md/md.h                                 |   4 +-
>> 
>> So I've pulled this, but I get a new warning:
>> 
>>   drivers/md/md.c: In function ‘mddev_put’:
>>   drivers/md/md.c:543:1: warning: the frame size of 2112 bytes is
>> larger than 2048 bytes [-Wframe-larger-than=]
>> 
>> which seems to be due to commit afeee514ce7f ("md: convert to
>> bioset_init()/mempool_init()").
>> 
>> As of that commit, mddev_put() now allocates *two* "struct bio_set"
>> structures on the stack, and those really aren't small. Yes, part of
>> it is that I do my test-builds with all that debug stuff, but those
>> structures have several embedded mempool_t members, and they really
>> are pretty sizable.
>> 
>> And I _really_ do not think it's acceptable to have that kind of stack
>> usage in there. I'm not sure you can trigger mddev_put() in the IO
>> paths, but even without that I don't think it's acceptable.
>> 
>> Also, the code simply looks like complete and utter garbage. It does
>> 
>>                 bs = mddev->bio_set;
>>                 sync_bs = mddev->sync_set;
>> 
>> to _copy_ those structures, and then does bioset_exit() on them. WTF?
>> 
>> Why the hell doesn't it just do biset_exit() on the originals instead,
>> before freeing the mddev?
>> 
>> I've pulled this, but this needs to be fixed. That is just broken
>> garbage right now. No way in hell is it acceptable to waste 2kB of
>> stack space on something idiotic like this.
>
> Works is already underway to get this fixed up. I agree that the
> excessive stack usage isn't great, but the code in that function
> is pretty miserable, as you say. That's not new with the patch,
> the conversion just carried that forward.
>
> CC'ing Neil to get his input on how best to clean that up, I'd
> be much more comfortable with that logic improved rather than
> just catering to the stack usage issue. Also include Arnd, as
> he had a test patch for it.

Cc:ing Shaohua as well, as he is the current maintainer, and authored
  
  Commit: 7184ef8bab0c ("MD: fix sleep in atomic")

which seems most relevant.
The justification given in that patch is that bioset_free() will take a
mutex.
This must have been in destory_workqueue() when destroying
bs->rescue_workqueue.
md biosets don't have a work queue (BIOSET_NEED_RESCUER isn't set),
so these calls will no longer take a mutex.  So this commit
can now be reverted.  I think that will clean up this code suitably.

I really should get back to BIOSET_NEED_RESCUER and see if I can discard
it completely.
Kent - the only remaining user is bcache, and the main reason I haven't
removed it is that I cannot follow the code (or at least, I couldn't
last time I tried).  Are you able to see if it is still needed?

Thanks,
NeilBrown

>
> -- 
> Jens Axboe

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [GIT PULL] Block changes for 4.18-rc
  2018-06-04 21:16     ` NeilBrown
@ 2018-06-04 21:21       ` Kent Overstreet
  2018-06-04 22:52         ` NeilBrown
  0 siblings, 1 reply; 17+ messages in thread
From: Kent Overstreet @ 2018-06-04 21:21 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jens Axboe, Linus Torvalds, linux-block, Arnd Bergmann,
	Shaohua Li

On Tue, Jun 05, 2018 at 07:16:51AM +1000, NeilBrown wrote:
> I really should get back to BIOSET_NEED_RESCUER and see if I can discard
> it completely.
> Kent - the only remaining user is bcache, and the main reason I haven't
> removed it is that I cannot follow the code (or at least, I couldn't
> last time I tried).  Are you able to see if it is still needed?

Oh man, rescueurs make my head hurt. I remember you changed something so that
they weren't needed most of the time, but I can't remember what you did - can
you remind me what that was and whatever else you can remember offhand?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [GIT PULL] Block changes for 4.18-rc
  2018-06-04 21:21       ` Kent Overstreet
@ 2018-06-04 22:52         ` NeilBrown
  2018-06-05  0:34           ` Kent Overstreet
  0 siblings, 1 reply; 17+ messages in thread
From: NeilBrown @ 2018-06-04 22:52 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Jens Axboe, Linus Torvalds, linux-block, Arnd Bergmann,
	Shaohua Li

[-- Attachment #1: Type: text/plain, Size: 1936 bytes --]

On Mon, Jun 04 2018, Kent Overstreet wrote:

> On Tue, Jun 05, 2018 at 07:16:51AM +1000, NeilBrown wrote:
>> I really should get back to BIOSET_NEED_RESCUER and see if I can discard
>> it completely.
>> Kent - the only remaining user is bcache, and the main reason I haven't
>> removed it is that I cannot follow the code (or at least, I couldn't
>> last time I tried).  Are you able to see if it is still needed?
>
> Oh man, rescueurs make my head hurt. I remember you changed something so that
> they weren't needed most of the time, but I can't remember what you did - can
> you remind me what that was and whatever else you can remember offhand?

(and closures make my head hurt :-)

There were two closely related changes.
One is make sure that generic_make_request processed bios strictly
in a depth-first order.  The other is to ensure that all (stacking)
block device drivers never block in their make_request_fn waiting
for something that have themselves submitted.

Together these mean that a thread in generic_make_request will never
block waiting for a bio that might be queued in current->bio_list.
The bio chosen by generic_make_request will be a leaf and won't depend
on anything else in the list, and the handler for that request won't
block after submitting a lower-level request.

The  second property is the important one for drivers and the one we
need to ensure that bcache follows.
A common bad pattern is something like:

 while (too_big(bio)) {
    split = bio_spilt(bio,...))
    process(split)
 }
 process(bio)

That needs to be more like
  if (too_big(bio)) {
     split = bio_split(bio,...);
     generic_make_request(bio);
     bio = split;
  }
  process(bio)

so that generic_make_request() gets to do that while-loop
and can control the order in which bios are handled.

I cannot convince myself that bcache doesn't have something similar to
that while loop (structured with closures).

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [GIT PULL] Block changes for 4.18-rc
  2018-06-04 22:52         ` NeilBrown
@ 2018-06-05  0:34           ` Kent Overstreet
  0 siblings, 0 replies; 17+ messages in thread
From: Kent Overstreet @ 2018-06-05  0:34 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jens Axboe, Linus Torvalds, linux-block, Arnd Bergmann,
	Shaohua Li

On Tue, Jun 05, 2018 at 08:52:32AM +1000, NeilBrown wrote:
> On Mon, Jun 04 2018, Kent Overstreet wrote:
> 
> > On Tue, Jun 05, 2018 at 07:16:51AM +1000, NeilBrown wrote:
> >> I really should get back to BIOSET_NEED_RESCUER and see if I can discard
> >> it completely.
> >> Kent - the only remaining user is bcache, and the main reason I haven't
> >> removed it is that I cannot follow the code (or at least, I couldn't
> >> last time I tried).  Are you able to see if it is still needed?
> >
> > Oh man, rescueurs make my head hurt. I remember you changed something so that
> > they weren't needed most of the time, but I can't remember what you did - can
> > you remind me what that was and whatever else you can remember offhand?
> 
> (and closures make my head hurt :-)

Fair enough :)

> There were two closely related changes.
> One is make sure that generic_make_request processed bios strictly
> in a depth-first order.  The other is to ensure that all (stacking)
> block device drivers never block in their make_request_fn waiting
> for something that have themselves submitted.
> 
> Together these mean that a thread in generic_make_request will never
> block waiting for a bio that might be queued in current->bio_list.
> The bio chosen by generic_make_request will be a leaf and won't depend
> on anything else in the list, and the handler for that request won't
> block after submitting a lower-level request.
> 
> The  second property is the important one for drivers and the one we
> need to ensure that bcache follows.
> A common bad pattern is something like:
> 
>  while (too_big(bio)) {
>     split = bio_spilt(bio,...))
>     process(split)
>  }
>  process(bio)
> 
> That needs to be more like
>   if (too_big(bio)) {
>      split = bio_split(bio,...);
>      generic_make_request(bio);
>      bio = split;
>   }
>   process(bio)
> 
> so that generic_make_request() gets to do that while-loop
> and can control the order in which bios are handled.
> 
> I cannot convince myself that bcache doesn't have something similar to
> that while loop (structured with closures).

Oh yeah, it's all coming back to me...

so, the problem isn't really closures, it's that we don't know where we have to
split until after we've already allocated all our state for the request and done
almost all our work for the request - also, the state we allocate is per
_incoming_ bio, not per split we submit.

So for the read side, it looks like
 - allocate a struct search (man, so much of the naming in bcache is crap, it's
   painful going back from bcachefs)
 - traverse the btree to the start of this request
 - iterate over extents, splitting and submitting for each extent the original
   bio overlaps with

so, converting it to your 2nd pattern would mean changing that code so that when
we walk the btree and find an extent we overlap with, if the extent doesn't
cover the whole bio, we split, submit the split, resubmit the original bio - and
then return and _restart the entire btree traversal_, for each split. ouch.

the write path is different, but the problem is roughly the same in that we
don't know where we're going to need to split until we're deep inside the write
path, allocating space for where the write will go - we split if the bucket
we're currently writing to doesn't have enough space. But again, we don't find
that out until well after allocating data structures for that write, taking
locks for space allocation, etc. etc...

So the approach of just resubmitting the rest of the bio just flat out won't
work, we've already allocated our data structures and they're tied to that
bio... it would require massive changes and deep surgery on how the entire
control flow works...

Also unconditionally doing this for every bio split (bailing out and restarting
the btree traversal/write path) would be pretty crappy performance wise, e.g.
when doing big sequential reads of data that was written all with random 4k
writes...

but...

the problem we're trying to avoid is just blocking while allocating memory,
while we've got bios blocked on current->bio_list, right?

So another approach would be to checking when we're allocating a bio split if
that allocation might deadlock - if so, do it with GFP_NOWAIT, and if it fails -
just punt the read or write request to workqueue.

this is something the code can actually do the way it's structured now... making
use of closures :)

cache_lookup() calls bch_btree_map_keys() to walk the extents the request
covers, and restarts itself if cache_lookup_fn() returns -EAGAIN - this is
because the btree lookup might require reading in btree nodes, so we can't block
on a btree node read while running under generic_make_request().

So this (untested!) patch should be sufficient for the read side:

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index ae67f5fa80..9aed34d050 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -538,6 +538,14 @@ static int cache_lookup_fn(struct btree_op *op, struct btree *b, struct bkey *k)
 	if (!KEY_SIZE(k))
 		return MAP_CONTINUE;
 
+	n = bio_next_split(bio, min_t(uint64_t, INT_MAX,
+				      KEY_OFFSET(k) - bio->bi_iter.bi_sector),
+		!current->bio_list || bio_list_empty(current->bio_list)
+		? GFP_NOIO : GFP_NOWAIT,
+		&s->d->bio_split);
+	if (!n)
+		return -EAGAIN;
+
 	/* XXX: figure out best pointer - for multiple cache devices */
 	ptr = 0;
 
@@ -546,10 +554,6 @@ static int cache_lookup_fn(struct btree_op *op, struct btree *b, struct bkey *k)
 	if (KEY_DIRTY(k))
 		s->read_dirty_data = true;
 
-	n = bio_next_split(bio, min_t(uint64_t, INT_MAX,
-				      KEY_OFFSET(k) - bio->bi_iter.bi_sector),
-			   GFP_NOIO, &s->d->bio_split);
-
 	bio_key = &container_of(n, struct bbio, bio)->key;
 	bch_bkey_copy_single_ptr(bio_key, k, ptr);
 

It's a bit more complicated for the write path - right now, we split after
allocating space (bch_data_insert_start() -> bch_alloc_sectors()). To be able to
punt to workqueue if the split fails, bch_alloc_sectors() needs to be broken up
into bch_alloc_sectors_start() to pick and lock a write point and allocate a
bucket if necessary, so that bch_data_insert_start() can find out how much space
there is available and attempt the split before actually consuming that space.

That's how the sector allocator already works in bcachefs though, so it's a
straightforward change...

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [GIT PULL] Block changes for 4.18-rc
  2018-06-04 19:04             ` Kent Overstreet
@ 2018-06-05  0:42               ` Linus Torvalds
  2018-06-05  0:44                 ` Linus Torvalds
  2018-06-05  0:56                 ` Kent Overstreet
  0 siblings, 2 replies; 17+ messages in thread
From: Linus Torvalds @ 2018-06-05  0:42 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Jens Axboe, Tejun Heo, linux-block, NeilBrown, Arnd Bergmann

[-- Attachment #1: Type: text/plain, Size: 629 bytes --]

On Mon, Jun 4, 2018 at 12:04 PM Kent Overstreet
<kent.overstreet@gmail.com> wrote:
>
> However, that's not correct as is because mddev_delayed_put() calls
> kobject_put(), and the kobject isn't initialized when the mddev is first
> allocated, it's initialized when the gendisk is allocated... that isn't hard to
> fix but that's getting into real refactoring that I'll need to put actual work
> into testing.

Well, it also removes the bioset_exit() calls entirely.

How about just the attached?

It simply does it as two different cases, and adds the bioset_exit()
calls to mddev_delayed_delete().

Hmm?

                 Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 1870 bytes --]

 drivers/md/md.c | 23 ++++++++++-------------
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index fc692b7128bb..6a2494065ab2 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -510,11 +510,6 @@ static void mddev_delayed_delete(struct work_struct *ws);
 
 static void mddev_put(struct mddev *mddev)
 {
-	struct bio_set bs, sync_bs;
-
-	memset(&bs, 0, sizeof(bs));
-	memset(&sync_bs, 0, sizeof(sync_bs));
-
 	if (!atomic_dec_and_lock(&mddev->active, &all_mddevs_lock))
 		return;
 	if (!mddev->raid_disks && list_empty(&mddev->disks) &&
@@ -522,10 +517,6 @@ static void mddev_put(struct mddev *mddev)
 		/* Array is not configured at all, and not held active,
 		 * so destroy it */
 		list_del_init(&mddev->all_mddevs);
-		bs = mddev->bio_set;
-		sync_bs = mddev->sync_set;
-		memset(&mddev->bio_set, 0, sizeof(mddev->bio_set));
-		memset(&mddev->sync_set, 0, sizeof(mddev->sync_set));
 		if (mddev->gendisk) {
 			/* We did a probe so need to clean up.  Call
 			 * queue_work inside the spinlock so that
@@ -534,12 +525,15 @@ static void mddev_put(struct mddev *mddev)
 			 */
 			INIT_WORK(&mddev->del_work, mddev_delayed_delete);
 			queue_work(md_misc_wq, &mddev->del_work);
-		} else
-			kfree(mddev);
+			spin_unlock(&all_mddevs_lock);
+			return;
+		}
 	}
 	spin_unlock(&all_mddevs_lock);
-	bioset_exit(&bs);
-	bioset_exit(&sync_bs);
+
+	bioset_exit(&mddev->bio_set);
+	bioset_exit(&mddev->sync_set);
+	kfree(mddev);
 }
 
 static void md_safemode_timeout(struct timer_list *t);
@@ -5234,6 +5228,9 @@ static void mddev_delayed_delete(struct work_struct *ws)
 {
 	struct mddev *mddev = container_of(ws, struct mddev, del_work);
 
+	bioset_exit(&mddev->bio_set);
+	bioset_exit(&mddev->sync_set);
+
 	sysfs_remove_group(&mddev->kobj, &md_bitmap_group);
 	kobject_del(&mddev->kobj);
 	kobject_put(&mddev->kobj);

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [GIT PULL] Block changes for 4.18-rc
  2018-06-05  0:42               ` Linus Torvalds
@ 2018-06-05  0:44                 ` Linus Torvalds
  2018-06-05  0:56                 ` Kent Overstreet
  1 sibling, 0 replies; 17+ messages in thread
From: Linus Torvalds @ 2018-06-05  0:44 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Jens Axboe, Tejun Heo, linux-block, NeilBrown, Arnd Bergmann

On Mon, Jun 4, 2018 at 5:42 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> How about just the attached?

Note: it probably goes without saying that the patch was entirely
untested, but it does build, and it does get rid of the insane stack
use.

                Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [GIT PULL] Block changes for 4.18-rc
  2018-06-05  0:42               ` Linus Torvalds
  2018-06-05  0:44                 ` Linus Torvalds
@ 2018-06-05  0:56                 ` Kent Overstreet
  2018-06-05  1:00                   ` Linus Torvalds
  2018-06-07  1:45                   ` Jens Axboe
  1 sibling, 2 replies; 17+ messages in thread
From: Kent Overstreet @ 2018-06-05  0:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Tejun Heo, linux-block, NeilBrown, Arnd Bergmann

On Mon, Jun 04, 2018 at 05:42:04PM -0700, Linus Torvalds wrote:
> On Mon, Jun 4, 2018 at 12:04 PM Kent Overstreet
> <kent.overstreet@gmail.com> wrote:
> >
> > However, that's not correct as is because mddev_delayed_put() calls
> > kobject_put(), and the kobject isn't initialized when the mddev is first
> > allocated, it's initialized when the gendisk is allocated... that isn't hard to
> > fix but that's getting into real refactoring that I'll need to put actual work
> > into testing.
> 
> Well, it also removes the bioset_exit() calls entirely.

Yeah, I realized that when I went back to finish that patch
> 
> How about just the attached?
> 
> It simply does it as two different cases, and adds the bioset_exit()
> calls to mddev_delayed_delete().

Oh right, just taking advantage of the fact that just the queue_work() needs to
be under the spinlock, not the actual free in the other case.

I like your patch for a less invasive version, but I did finish and test my
version, which deletes more code :)

I've already gone to the trouble of coming up with a VM smoketest, so I can test
yours too... I don't really have a strong opinion on which patch should go in.

 drivers/md/md.c | 53 ++++++++++++++++++-----------------------------------
 1 file changed, 18 insertions(+), 35 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index fc692b7128..22203eba1e 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -84,6 +84,8 @@ static void autostart_arrays(int part);
 static LIST_HEAD(pers_list);
 static DEFINE_SPINLOCK(pers_lock);
 
+static struct kobj_type md_ktype;
+
 struct md_cluster_operations *md_cluster_ops;
 EXPORT_SYMBOL(md_cluster_ops);
 struct module *md_cluster_mod;
@@ -510,11 +512,6 @@ static void mddev_delayed_delete(struct work_struct *ws);
 
 static void mddev_put(struct mddev *mddev)
 {
-	struct bio_set bs, sync_bs;
-
-	memset(&bs, 0, sizeof(bs));
-	memset(&sync_bs, 0, sizeof(sync_bs));
-
 	if (!atomic_dec_and_lock(&mddev->active, &all_mddevs_lock))
 		return;
 	if (!mddev->raid_disks && list_empty(&mddev->disks) &&
@@ -522,30 +519,23 @@ static void mddev_put(struct mddev *mddev)
 		/* Array is not configured at all, and not held active,
 		 * so destroy it */
 		list_del_init(&mddev->all_mddevs);
-		bs = mddev->bio_set;
-		sync_bs = mddev->sync_set;
-		memset(&mddev->bio_set, 0, sizeof(mddev->bio_set));
-		memset(&mddev->sync_set, 0, sizeof(mddev->sync_set));
-		if (mddev->gendisk) {
-			/* We did a probe so need to clean up.  Call
-			 * queue_work inside the spinlock so that
-			 * flush_workqueue() after mddev_find will
-			 * succeed in waiting for the work to be done.
-			 */
-			INIT_WORK(&mddev->del_work, mddev_delayed_delete);
-			queue_work(md_misc_wq, &mddev->del_work);
-		} else
-			kfree(mddev);
+
+		/*
+		 * Call queue_work inside the spinlock so that
+		 * flush_workqueue() after mddev_find will succeed in waiting
+		 * for the work to be done.
+		 */
+		INIT_WORK(&mddev->del_work, mddev_delayed_delete);
+		queue_work(md_misc_wq, &mddev->del_work);
 	}
 	spin_unlock(&all_mddevs_lock);
-	bioset_exit(&bs);
-	bioset_exit(&sync_bs);
 }
 
 static void md_safemode_timeout(struct timer_list *t);
 
 void mddev_init(struct mddev *mddev)
 {
+	kobject_init(&mddev->kobj, &md_ktype);
 	mutex_init(&mddev->open_mutex);
 	mutex_init(&mddev->reconfig_mutex);
 	mutex_init(&mddev->bitmap_info.mutex);
@@ -5215,6 +5205,8 @@ static void md_free(struct kobject *ko)
 		put_disk(mddev->gendisk);
 	percpu_ref_exit(&mddev->writes_pending);
 
+	bioset_exit(&mddev->bio_set);
+	bioset_exit(&mddev->sync_set);
 	kfree(mddev);
 }
 
@@ -5348,8 +5340,7 @@ static int md_alloc(dev_t dev, char *name)
 	mutex_lock(&mddev->open_mutex);
 	add_disk(disk);
 
-	error = kobject_init_and_add(&mddev->kobj, &md_ktype,
-				     &disk_to_dev(disk)->kobj, "%s", "md");
+	error = kobject_add(&mddev->kobj, &disk_to_dev(disk)->kobj, "%s", "md");
 	if (error) {
 		/* This isn't possible, but as kobject_init_and_add is marked
 		 * __must_check, we must do something with the result
@@ -5506,7 +5497,7 @@ int md_run(struct mddev *mddev)
 	if (!bioset_initialized(&mddev->sync_set)) {
 		err = bioset_init(&mddev->sync_set, BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
 		if (err)
-			goto abort;
+			return err;
 	}
 
 	spin_lock(&pers_lock);
@@ -5519,8 +5510,7 @@ int md_run(struct mddev *mddev)
 		else
 			pr_warn("md: personality for level %s is not loaded!\n",
 				mddev->clevel);
-		err = -EINVAL;
-		goto abort;
+		return -EINVAL;
 	}
 	spin_unlock(&pers_lock);
 	if (mddev->level != pers->level) {
@@ -5533,8 +5523,7 @@ int md_run(struct mddev *mddev)
 	    pers->start_reshape == NULL) {
 		/* This personality cannot handle reshaping... */
 		module_put(pers->owner);
-		err = -EINVAL;
-		goto abort;
+		return -EINVAL;
 	}
 
 	if (pers->sync_request) {
@@ -5603,7 +5592,7 @@ int md_run(struct mddev *mddev)
 		mddev->private = NULL;
 		module_put(pers->owner);
 		bitmap_destroy(mddev);
-		goto abort;
+		return err;
 	}
 	if (mddev->queue) {
 		bool nonrot = true;
@@ -5665,12 +5654,6 @@ int md_run(struct mddev *mddev)
 	sysfs_notify_dirent_safe(mddev->sysfs_action);
 	sysfs_notify(&mddev->kobj, NULL, "degraded");
 	return 0;
-
-abort:
-	bioset_exit(&mddev->bio_set);
-	bioset_exit(&mddev->sync_set);
-
-	return err;
 }
 EXPORT_SYMBOL_GPL(md_run);
 

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [GIT PULL] Block changes for 4.18-rc
  2018-06-05  0:56                 ` Kent Overstreet
@ 2018-06-05  1:00                   ` Linus Torvalds
  2018-06-07  1:45                   ` Jens Axboe
  1 sibling, 0 replies; 17+ messages in thread
From: Linus Torvalds @ 2018-06-05  1:00 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Jens Axboe, Tejun Heo, linux-block, NeilBrown, Arnd Bergmann

On Mon, Jun 4, 2018 at 5:56 PM Kent Overstreet
<kent.overstreet@gmail.com> wrote:
>
> I like your patch for a less invasive version, but I did finish and test my
> version, which deletes more code :)

I certainly won't object to that.

Your patch looks fine, and looks like the right thing in the long run anyway.

Plus, it's apparently even tested. What a concept.

Regardless, I'll sleep better tonight that this will all be fixed one
way or the other.

Thanks,

                Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [GIT PULL] Block changes for 4.18-rc
  2018-06-05  0:56                 ` Kent Overstreet
  2018-06-05  1:00                   ` Linus Torvalds
@ 2018-06-07  1:45                   ` Jens Axboe
  1 sibling, 0 replies; 17+ messages in thread
From: Jens Axboe @ 2018-06-07  1:45 UTC (permalink / raw)
  To: Kent Overstreet, Linus Torvalds
  Cc: Tejun Heo, linux-block, NeilBrown, Arnd Bergmann

On 6/4/18 6:56 PM, Kent Overstreet wrote:
> On Mon, Jun 04, 2018 at 05:42:04PM -0700, Linus Torvalds wrote:
>> On Mon, Jun 4, 2018 at 12:04 PM Kent Overstreet
>> <kent.overstreet@gmail.com> wrote:
>>>
>>> However, that's not correct as is because mddev_delayed_put() calls
>>> kobject_put(), and the kobject isn't initialized when the mddev is first
>>> allocated, it's initialized when the gendisk is allocated... that isn't hard to
>>> fix but that's getting into real refactoring that I'll need to put actual work
>>> into testing.
>>
>> Well, it also removes the bioset_exit() calls entirely.
> 
> Yeah, I realized that when I went back to finish that patch
>>
>> How about just the attached?
>>
>> It simply does it as two different cases, and adds the bioset_exit()
>> calls to mddev_delayed_delete().
> 
> Oh right, just taking advantage of the fact that just the queue_work() needs to
> be under the spinlock, not the actual free in the other case.
> 
> I like your patch for a less invasive version, but I did finish and test my
> version, which deletes more code :)
> 
> I've already gone to the trouble of coming up with a VM smoketest, so I can test
> yours too... I don't really have a strong opinion on which patch should go in.

Kent, care to submit a proper version? We should get this in.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2018-06-07  1:45 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-06-04  0:42 [GIT PULL] Block changes for 4.18-rc Jens Axboe
2018-06-04 15:51 ` Linus Torvalds
2018-06-04 15:54   ` Jens Axboe
2018-06-04 16:24     ` Linus Torvalds
2018-06-04 18:20       ` Tejun Heo
2018-06-04 18:22         ` Linus Torvalds
2018-06-04 18:25           ` Jens Axboe
2018-06-04 19:04             ` Kent Overstreet
2018-06-05  0:42               ` Linus Torvalds
2018-06-05  0:44                 ` Linus Torvalds
2018-06-05  0:56                 ` Kent Overstreet
2018-06-05  1:00                   ` Linus Torvalds
2018-06-07  1:45                   ` Jens Axboe
2018-06-04 21:16     ` NeilBrown
2018-06-04 21:21       ` Kent Overstreet
2018-06-04 22:52         ` NeilBrown
2018-06-05  0:34           ` Kent Overstreet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox