* [GIT PULL] Block changes for 4.18-rc
@ 2018-06-04 0:42 Jens Axboe
2018-06-04 15:51 ` Linus Torvalds
0 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2018-06-04 0:42 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-block@vger.kernel.org
Hi Linus,
This is the main pull request for the 4.18 merge window. This pull
request contains:
- Series from Christoph, cleaning up how we pass around gfp_t and
blk_mq_req_flags_t.
- Series from Christoph, preparing us to defer scheduler attach.
- Series from Christoph, cleaing up drivers handling of bounce buffers
- Series from Christoph/Bart/Keith, fixing timeout handling corner
cases.
- Series of bcache fixes, by way of Coly.
- Series from Kent, with both some prep work for bcachefs and some
block layer optimizations.
- Series from Kent, converting users of bio_sets to using embedded
structs.
- Patches from Paolo/Davide/Filippo, with fixes for the BFQ io
scheduler.
- Set of lightnvm fixes and improvements. By way of Matias,
contributions from Hans and Javier.
- Series from me, adding discard throttling to blk-wbt.
- Series from me/Omar/Ming on sbitmap blk-mq-tag handling.
- Removal of the sparc jsflash block driver, acked by DaveM.
- Kyber scheduler improvement from Jianchao, making it more
friendly wrt merging.
- Conversion of symbolic proc permissions to octal, from Joe
Perches. Previously the block parts were a mix of both.
- Series of nbd fixes from Josef and Kevin Vigor.
- Series from Omar, unifying how we handle the various kinds of
timestamps that the block core and utility code uses.
- Three NVMe pull requests from Keith and Christoph, bringing
AEN to feature completeness, file backed namespaces, cq/sq
lock split, and various fixes.
- Various little fixes and improvements all over the map.
git://git.kernel.dk/linux-block.git tags/for-4.18/block-20180603
----------------------------------------------------------------
Andy Shevchenko (3):
bcache: Move couple of string arrays to sysfs.c
bcache: Move couple of functions to sysfs.c
bcache: Replace bch_read_string_list() by __sysfs_match_string()
Anna-Maria Gleixner (1):
block: Remove redundant WARN_ON()
Bart Van Assche (1):
blkdev_report_zones_ioctl(): Use vmalloc() to allocate large buffers
Chaitanya Kulkarni (4):
nvmet: make a few error messages more generic
nvmet: remove duplicate NULL initialization for req->ns
nvmet: add simple file backed ns support
nvmet-loop: use nr_phys_segments when map rq to sgl
Chengguang Xu (1):
blk-throttle: return proper bool type to caller instead of 0/1
Christoph Hellwig (49):
mtip32xx: don't use block layer bounce buffers
DAC960: don't use block layer bounce buffers
memstick: don't call blk_queue_bounce_limit
mtd_blkdevs: handle highmem pages
aoe: handle highmem pages
jsflash: handle highmem pages
ps3disk: handle highmem pages
memstick: remove unused variables
scsi/osd: remove the gfp argument to osd_start_request
block: fix __get_request documentation
block: sanitize blk_get_request calling conventions
block: pass an explicit gfp_t to get_request
block: use GFP_NOIO instead of __GFP_DIRECT_RECLAIM
block: consistently use GFP_NOIO instead of __GFP_NORECLAIM
nvme: mark the result argument to nvme_complete_async_event volatile
nvme-pci: simplify nvme_cqe_valid
libata: remove ata_scsi_timed_out
block: rename BLK_EH_NOT_HANDLED to BLK_EH_DONE
nvme: return BLK_EH_DONE from ->timeout
nbd: complete requests from ->timeout
mtip32xx: complete requests from ->timeout
null_blk: complete requests from ->timeout
scsi_transport_fc: complete requests from ->timeout
mmc: complete requests from ->timeout
libiscsi: don't try to bypass SCSI EH
block: remove BLK_EH_HANDLED
block: document the blk_eh_timer_return values
blk-mq: simplify blk_mq_rq_timed_out
block: unexport check_disk_size_change
block: don't print a message when the device went away
block: remove parent device reference from struct bsg_class_device
nvme-pci: simplify __nvme_submit_cmd
nvme-loop: add support for multiple ports
blk-mq: only iterate over inflight requests in blk_mq_tagset_busy_iter
nvme-fabrics: allow internal passthrough command on deleting controllers
nvme.h: untangle AEN notice definitions
nvme.h: add the changed namespace list log
nvmet: add a new nvmet_zero_sgl helper
nvmet: split log page implementation
nvmet: implement the changed namespaces log
nvmet: add AEN configuration support
nvmet: mask pending AENs
nvme: mark nvme_queue_scan static
nvme: use the changed namespaces list log to clear ns data changed AENs
block: move initialization of elevator-related fields to blk_alloc_queue_node
block: unexport elevator_init/exit
block: remove the always unused name argument to elevator_init
block: move sysfs_lock into elevator_init
block: split the blk-mq case from elevator_init
Christophe JAILLET (1):
mtip32xx: Fix an error handling path in 'mtip_pci_probe()'
Coly Li (1):
bcache: stop bcache device when backing device is offline
Dan Melnic (1):
block/ndb: add WQ_UNBOUND to the knbd-recv workqueue
Davide Sapienza (2):
block, bfq: increase weight-raising duration for interactive apps
block, bfq: prevent soft_rt_next_start from being stuck at infinity
Filippo Muzzini (2):
block, bfq: remove wrong lock in bfq_requests_merged
block, bfq: remove the removal of 'next' rq in bfq_requests_merged
Hannes Reinecke (6):
nvme-fabrics: centralize discovery controller defaults
nvme-fabrics: allow duplicate connections to the discovery controller
nvme: fix KASAN warning when parsing host nqn
nvme: fixup memory leak in nvme_init_identify()
nvme.h: add AEN configuration symbols
nvme: submit AEN event configuration on startup
Hans Holmberg (5):
lightnvm: pblk: rework write error recovery path
lightnvm: pblk: garbage collect lines with failed writes
lightnvm: pblk: fix smeta write error path
lightnvm: pblk: only try to recover lines with written smeta
lightnvm: pblk: kick writer on new flush points
Igor Konopko (2):
lightnvm: proper error handling for pblk_bio_add_pages
lightnvm: fix partial read error path
Ivan Bornyakov (1):
nvme: host: core: fix precedence of ternary operator
James Smart (4):
nvme-fc: remove setting DNR on exception conditions
nvme-fabrics: remove unnecessary controller subnqn validation
nvmet-fc: increase LS buffer count per fc port
nvme: allow duplicate controller if prior controller being deleted
Javier González (13):
lightnvm: pblk: fail gracefully on line alloc. failure
lightnvm: pblk: recheck for bad lines at runtime
lightnvm: pblk: check read lba on gc path
lightnvm: pblk: improve error msg on corrupted LBAs
lightnvm: pblk: warn in case of corrupted write buffer
lightnvm: pblk: return NVM_ error on failed submission
lightnvm: pblk: remove unnecessary indirection
lightnvm: pblk: remove unnecessary argument
lightnvm: pblk: check for chunk size before allocating it
lightnvm: pass flag on graceful teardown to targets
lightnvm: pblk: remove dead function
lightnvm: pblk: remove unnecessary bio_get/put
lightnvm: pblk: take bitmap alloc. out of critical section
Jens Axboe (26):
nvme: only reconfigure discard if necessary
block: break discard submissions into the user defined size
blk-wbt: account any writing command as a write
blk-wbt: pass in enum wbt_flags to get_rq_wait()
blk-wbt: throttle discards like background writes
blk-mq: don't call into depth limiting for reserved tags
bfq-iosched: don't worry about reserved tags in limit_depth
bfq: calculate shallow depths at init time
bfq-iosched: remove unused variable
bfq-iosched: update shallow depth to smallest one used
kyber-iosched: update shallow depth when setting up hardware queue
sbitmap: fix race in wait batch accounting
Remove jsflash driver
nvme-pci: remove cq check after submission
nvme-pci: move ->cq_vector == -1 check outside of ->q_lock
nvme-pci: handle completions outside of the queue lock
nvme-pci: split the nvme queue lock into submission and completion locks
nvme-pci: drop IRQ disabling on submission queue lock
Merge branch 'nvme-4.18' of git://git.infradead.org/nvme into for-4.18/block
nvme-pci: fix race between poll and IRQ completions
block: move ->timeout request member
Merge branch 'nvme-4.18-2' of git://git.infradead.org/nvme into for-4.18/block
blk-mq: abstract out blk-mq-sched rq list iteration bio merge helper
block: fixup bioset_integrity_create() call
Merge branch 'nvme-4.18' of git://git.infradead.org/nvme into for-4.18/block
block: don't use blocking queue entered for recursive bio submits
Jianchao Wang (3):
nvme-pci: set nvmeq->cq_vector after alloc cq/sq
nvme-rdma: stop admin queue before freeing it
block: kyber: make kyber more friendly with merging
Joe Perches (1):
block drivers/block: Use octal not symbolic permissions
Johannes Thumshirn (3):
nvme: fc: provide a descriptive error
nvme: change order of qid and cmdid in completion trace
nvme: fix lockdep warning in nvme_mpath_clear_current_path
Josef Bacik (8):
block: fix MAINTAINERS email for nbd
nbd: fix nbd device deletion
nbd: update size when connected
nbd: use bd_set_size when updating disk size
nbd: clear_sock on netlink disconnect
nbd: fix how we set bd_invalidated
nbd: call nbd_bdev_reset instead of bd_set_size on disconnect
nbd: set discard granularity properly
Keith Busch (7):
nvme/pci: Use async_schedule for initial reset work
nvme/pci: Hold controller reference during async probe
nvme/pci: Sync controller reset for AER slot_reset
nvme-pci: Fix AER reset handling
blk-mq: Fix timeout and state order
blk-mq: Remove generation seqeunce
nvme-pci: Rate limit the nvme timeout warnings
Kent Overstreet (23):
mempool: Add mempool_init()/mempool_exit()
block: Convert bio_set to mempool_init()
block: Add bioset_init()/bioset_exit()
block: Use bioset_init() for fs_bio_set
block: Add bio_copy_data_iter(), zero_fill_bio_iter()
block: Split out bio_list_copy_data()
block: Add missing flush_dcache_page() call
block: Add warning for bi_next not NULL in bio_endio()
block: Export bio check/set pages_dirty
block: Add sysfs entry for fua support
block: convert bounce, q->bio_split to bioset_init()/mempool_init()
drbd: convert to bioset_init()/mempool_init()
pktcdvd: convert to bioset_init()/mempool_init()
lightnvm: convert to bioset_init()/mempool_init()
bcache: convert to bioset_init()/mempool_init()
md: convert to bioset_init()/mempool_init()
dm: convert to bioset_init()/mempool_init()
target: convert to bioset_init()/mempool_init()
fs: convert block_dev.c to bioset_init()
btrfs: convert to bioset_init()/mempool_init()
xfs: convert to bioset_init()/mempool_init()
block: Drop bioset_create()
dm-crypt: fix warning in shutdown path
Kevin Vigor (1):
nbd: clear DISCONNECT_REQUESTED flag once disconnection occurs.
Liu Bo (2):
null_blk: add blocking description and remove lightnvm
blk-throttle: fix potential NULL pointer dereference in throtl_select_dispatch
Marcin Dziegielewski (1):
lightnvm: pblk: add possibility to set write buffer size manually
Micah Parrish (1):
NVMe: Add Quirk Delay before CHK RDY for Seagate Nytro Flash Storage
Ming Lei (2):
blk-mq: avoid starving tag allocation after allocating process migrates
blk-mq: update nr_requests when switching to 'none' scheduler
Omar Sandoval (9):
block: move some wbt helpers to blk-wbt.c
block: pass struct request instead of struct blk_issue_stat to wbt
block: replace bio->bi_issue_stat with bio-specific type
block: get rid of struct blk_issue_stat
block: use ktime_get_ns() instead of sched_clock() for cfq and bfq
block: move blk_stat_add() to __blk_mq_end_request()
block: consolidate struct request timestamp fields
sbitmap: fix missed wakeups caused by sbitmap_queue_get_shallow()
sbitmap: warn if using smaller shallow depth than was setup
Paolo Valente (4):
block, bfq: postpone rq preparation to insert or merge
block, bfq: remove wrong check in bfq_requests_merged
block, bfq: add description of weight-raising heuristics
block, bfq: remove slow-system class
Sebastian Andrzej Siewior (1):
block: don't disable interrupts during kmap_atomic()
SeongJae Park (1):
brd: Mark as non-rotational
Tejun Heo (1):
bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueue
Tetsuo Handa (1):
loop: remember whether sysfs_create_group() was done
Thomas Gleixner (1):
block: Shorten interrupt disabled regions
Wei Xu (1):
nvme: lightnvm: add granby support
Wei Yongjun (2):
nvmet: fix a typo in nvmet_file_ns_enable()
nvmet: fix error return code in nvmet_file_ns_enable()
huhai (3):
blk-mq: remove redundant insert case in blk_mq_make_request()
blk-mq: clear hctx->dispatch_from when mappings change
blk-mq: remove wrong 'unlikely' check
Documentation/block/null_blk.txt | 9 +-
Documentation/scsi/scsi_eh.txt | 15 +-
MAINTAINERS | 2 +-
arch/sparc/include/uapi/asm/jsflash.h | 40 --
block/bfq-cgroup.c | 40 +-
block/bfq-iosched.c | 478 +++++++++--------
block/bfq-iosched.h | 30 +-
block/bio-integrity.c | 29 +-
block/bio.c | 189 ++++---
block/blk-core.c | 120 ++---
block/blk-integrity.c | 12 +-
block/blk-lib.c | 12 +-
block/blk-merge.c | 29 +-
block/blk-mq-debugfs.c | 1 -
block/blk-mq-sched.c | 46 +-
block/blk-mq-sched.h | 2 -
block/blk-mq-sysfs.c | 6 +-
block/blk-mq-tag.c | 14 +-
block/blk-mq.c | 340 ++++--------
block/blk-mq.h | 42 +-
block/blk-stat.c | 10 +-
block/blk-stat.h | 45 +-
block/blk-sysfs.c | 80 +--
block/blk-throttle.c | 35 +-
block/blk-timeout.c | 6 +-
block/blk-wbt.c | 129 +++--
block/blk-wbt.h | 55 +-
block/blk-zoned.c | 8 +-
block/blk.h | 5 +-
block/bounce.c | 52 +-
block/bsg-lib.c | 6 +-
block/bsg.c | 44 +-
block/cfq-iosched.c | 66 ++-
block/deadline-iosched.c | 3 +-
block/elevator.c | 101 ++--
block/genhd.c | 37 +-
block/kyber-iosched.c | 199 +++++--
block/mq-deadline.c | 3 +-
block/partition-generic.c | 26 +-
block/scsi_ioctl.c | 10 +-
drivers/ata/libata-eh.c | 51 --
drivers/block/DAC960.c | 11 +-
drivers/block/DAC960.h | 1 -
drivers/block/aoe/aoeblk.c | 11 +-
drivers/block/aoe/aoecmd.c | 3 +-
drivers/block/brd.c | 10 +-
drivers/block/drbd/drbd_bitmap.c | 5 +-
drivers/block/drbd/drbd_debugfs.c | 20 +-
drivers/block/drbd/drbd_int.h | 10 +-
drivers/block/drbd/drbd_main.c | 73 +--
drivers/block/drbd/drbd_receiver.c | 6 +-
drivers/block/drbd/drbd_req.c | 4 +-
drivers/block/drbd/drbd_req.h | 2 +-
drivers/block/floppy.c | 2 +-
drivers/block/loop.c | 17 +-
drivers/block/loop.h | 1 +
drivers/block/mtip32xx/mtip32xx.c | 29 +-
drivers/block/nbd.c | 77 ++-
drivers/block/null_blk.c | 36 +-
drivers/block/paride/pd.c | 2 +-
drivers/block/pktcdvd.c | 60 +--
drivers/block/ps3disk.c | 2 -
drivers/block/rbd.c | 44 +-
drivers/block/rsxx/core.c | 6 +-
drivers/block/sx8.c | 2 +-
drivers/block/virtio_blk.c | 8 +-
drivers/block/xen-blkback/blkback.c | 2 +-
drivers/block/xen-blkback/xenbus.c | 4 +-
drivers/block/xen-blkfront.c | 7 +-
drivers/cdrom/cdrom.c | 2 +-
drivers/ide/ide-atapi.c | 2 +-
drivers/ide/ide-cd.c | 2 +-
drivers/ide/ide-cd_ioctl.c | 2 +-
drivers/ide/ide-devsets.c | 2 +-
drivers/ide/ide-disk.c | 2 +-
drivers/ide/ide-ioctls.c | 4 +-
drivers/ide/ide-park.c | 4 +-
drivers/ide/ide-pm.c | 5 +-
drivers/ide/ide-tape.c | 4 +-
drivers/ide/ide-taskfile.c | 4 +-
drivers/lightnvm/core.c | 10 +-
drivers/lightnvm/pblk-cache.c | 10 +-
drivers/lightnvm/pblk-core.c | 233 ++++++---
drivers/lightnvm/pblk-gc.c | 112 ++--
drivers/lightnvm/pblk-init.c | 172 ++++---
drivers/lightnvm/pblk-map.c | 33 +-
drivers/lightnvm/pblk-rb.c | 48 +-
drivers/lightnvm/pblk-read.c | 146 +++---
drivers/lightnvm/pblk-recovery.c | 121 +----
drivers/lightnvm/pblk-rl.c | 29 +-
drivers/lightnvm/pblk-sysfs.c | 15 +-
drivers/lightnvm/pblk-write.c | 269 ++++++----
drivers/lightnvm/pblk.h | 58 ++-
drivers/md/bcache/bcache.h | 14 +-
drivers/md/bcache/bset.c | 13 +-
drivers/md/bcache/bset.h | 2 +-
drivers/md/bcache/btree.c | 4 +-
drivers/md/bcache/io.c | 4 +-
drivers/md/bcache/request.c | 18 +-
drivers/md/bcache/super.c | 109 ++--
drivers/md/bcache/sysfs.c | 51 +-
drivers/md/bcache/util.c | 35 --
drivers/md/bcache/util.h | 5 -
drivers/md/dm-bio-prison-v1.c | 13 +-
drivers/md/dm-bio-prison-v2.c | 13 +-
drivers/md/dm-cache-target.c | 25 +-
drivers/md/dm-core.h | 4 +-
drivers/md/dm-crypt.c | 58 +--
drivers/md/dm-integrity.c | 15 +-
drivers/md/dm-io.c | 29 +-
drivers/md/dm-kcopyd.c | 22 +-
drivers/md/dm-log-userspace-base.c | 19 +-
drivers/md/dm-mpath.c | 3 +-
drivers/md/dm-region-hash.c | 23 +-
drivers/md/dm-rq.c | 4 +-
drivers/md/dm-snap.c | 17 +-
drivers/md/dm-thin.c | 32 +-
drivers/md/dm-verity-fec.c | 55 +-
drivers/md/dm-verity-fec.h | 8 +-
drivers/md/dm-zoned-target.c | 13 +-
drivers/md/dm.c | 55 +-
drivers/md/md-faulty.c | 2 +-
drivers/md/md-linear.c | 2 +-
drivers/md/md-multipath.c | 17 +-
drivers/md/md-multipath.h | 2 +-
drivers/md/md.c | 61 +--
drivers/md/md.h | 4 +-
drivers/md/raid0.c | 5 +-
drivers/md/raid1.c | 76 ++-
drivers/md/raid1.h | 6 +-
drivers/md/raid10.c | 60 ++-
drivers/md/raid10.h | 6 +-
drivers/md/raid5-cache.c | 43 +-
drivers/md/raid5-ppl.c | 42 +-
drivers/md/raid5.c | 12 +-
drivers/md/raid5.h | 2 +-
drivers/memstick/core/ms_block.c | 6 -
drivers/memstick/core/mspro_block.c | 6 -
drivers/message/fusion/mptsas.c | 2 +-
drivers/mmc/core/block.c | 12 +-
drivers/mmc/core/queue.c | 5 +-
drivers/mtd/mtd_blkdevs.c | 20 +-
drivers/nvme/host/core.c | 157 ++++--
drivers/nvme/host/fabrics.c | 100 ++--
drivers/nvme/host/fabrics.h | 4 +-
drivers/nvme/host/fc.c | 15 +-
drivers/nvme/host/nvme.h | 9 +-
drivers/nvme/host/pci.c | 258 ++++++----
drivers/nvme/host/rdma.c | 12 +-
drivers/nvme/host/trace.h | 4 +-
drivers/nvme/target/Makefile | 4 +-
drivers/nvme/target/admin-cmd.c | 143 ++---
drivers/nvme/target/core.c | 128 +++--
drivers/nvme/target/discovery.c | 2 -
drivers/nvme/target/fabrics-cmd.c | 4 -
drivers/nvme/target/fc.c | 2 +-
drivers/nvme/target/{io-cmd.c => io-cmd-bdev.c} | 77 +--
drivers/nvme/target/io-cmd-file.c | 304 +++++++++++
drivers/nvme/target/loop.c | 52 +-
drivers/nvme/target/nvmet.h | 51 +-
drivers/s390/block/dasd.c | 6 +-
drivers/sbus/char/Kconfig | 7 -
drivers/sbus/char/Makefile | 1 -
drivers/sbus/char/jsflash.c | 658 ------------------------
drivers/scsi/gdth.c | 2 +-
drivers/scsi/libiscsi.c | 6 +-
drivers/scsi/megaraid/megaraid_sas_base.c | 2 +-
drivers/scsi/mvumi.c | 2 +-
drivers/scsi/osd/osd_initiator.c | 24 +-
drivers/scsi/osst.c | 2 +-
drivers/scsi/qla4xxx/ql4_os.c | 2 +-
drivers/scsi/scsi_error.c | 10 +-
drivers/scsi/scsi_lib.c | 4 +-
drivers/scsi/scsi_transport_fc.c | 16 +-
drivers/scsi/scsi_transport_iscsi.c | 2 +-
drivers/scsi/scsi_transport_sas.c | 19 +-
drivers/scsi/scsi_transport_srp.c | 4 +-
drivers/scsi/sg.c | 2 +-
drivers/scsi/st.c | 2 +-
drivers/scsi/ufs/ufshcd.c | 6 +-
drivers/target/target_core_iblock.c | 16 +-
drivers/target/target_core_iblock.h | 2 +-
drivers/target/target_core_pscsi.c | 3 +-
fs/block_dev.c | 24 +-
fs/btrfs/extent_io.c | 25 +-
fs/direct-io.c | 4 +-
fs/exofs/ore.c | 10 +-
fs/exofs/super.c | 2 +-
fs/nfsd/blocklayout.c | 2 +-
fs/xfs/xfs_aops.c | 2 +-
fs/xfs/xfs_aops.h | 2 +-
fs/xfs/xfs_super.c | 11 +-
include/linux/bio.h | 42 +-
include/linux/blk-mq.h | 3 +-
include/linux/blk_types.h | 51 +-
include/linux/blkdev.h | 110 ++--
include/linux/bsg-lib.h | 3 +-
include/linux/bsg.h | 6 +-
include/linux/elevator.h | 2 -
include/linux/fs.h | 2 +-
include/linux/libata.h | 2 -
include/linux/lightnvm.h | 2 +-
include/linux/mempool.h | 34 ++
include/linux/nvme.h | 16 +-
include/linux/pktcdvd.h | 2 +-
include/linux/sbitmap.h | 36 ++
include/scsi/osd_initiator.h | 6 +-
include/scsi/scsi_host.h | 2 +-
kernel/power/swap.c | 14 +-
lib/sbitmap.c | 113 ++--
mm/backing-dev.c | 18 +-
mm/mempool.c | 108 +++-
212 files changed, 4020 insertions(+), 3983 deletions(-)
delete mode 100644 arch/sparc/include/uapi/asm/jsflash.h
rename drivers/nvme/target/{io-cmd.c => io-cmd-bdev.c} (76%)
create mode 100644 drivers/nvme/target/io-cmd-file.c
delete mode 100644 drivers/sbus/char/jsflash.c
--
Jens Axboe
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [GIT PULL] Block changes for 4.18-rc 2018-06-04 0:42 [GIT PULL] Block changes for 4.18-rc Jens Axboe @ 2018-06-04 15:51 ` Linus Torvalds 2018-06-04 15:54 ` Jens Axboe 0 siblings, 1 reply; 17+ messages in thread From: Linus Torvalds @ 2018-06-04 15:51 UTC (permalink / raw) To: Jens Axboe, Kent Overstreet; +Cc: linux-block On Sun, Jun 3, 2018 at 5:42 PM Jens Axboe <axboe@kernel.dk> wrote: > > drivers/md/md.c | 61 +-- > drivers/md/md.h | 4 +- So I've pulled this, but I get a new warning: drivers/md/md.c: In function =E2=80=98mddev_put=E2=80=99: drivers/md/md.c:543:1: warning: the frame size of 2112 bytes is larger than 2048 bytes [-Wframe-larger-than=3D] which seems to be due to commit afeee514ce7f ("md: convert to bioset_init()/mempool_init()"). As of that commit, mddev_put() now allocates *two* "struct bio_set" structures on the stack, and those really aren't small. Yes, part of it is that I do my test-builds with all that debug stuff, but those structures have several embedded mempool_t members, and they really are pretty sizable. And I _really_ do not think it's acceptable to have that kind of stack usage in there. I'm not sure you can trigger mddev_put() in the IO paths, but even without that I don't think it's acceptable. Also, the code simply looks like complete and utter garbage. It does bs =3D mddev->bio_set; sync_bs =3D mddev->sync_set; to _copy_ those structures, and then does bioset_exit() on them. WTF? Why the hell doesn't it just do biset_exit() on the originals instead, before freeing the mddev? I've pulled this, but this needs to be fixed. That is just broken garbage right now. No way in hell is it acceptable to waste 2kB of stack space on something idiotic like this. Linus ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [GIT PULL] Block changes for 4.18-rc 2018-06-04 15:51 ` Linus Torvalds @ 2018-06-04 15:54 ` Jens Axboe 2018-06-04 16:24 ` Linus Torvalds 2018-06-04 21:16 ` NeilBrown 0 siblings, 2 replies; 17+ messages in thread From: Jens Axboe @ 2018-06-04 15:54 UTC (permalink / raw) To: Linus Torvalds, Kent Overstreet; +Cc: linux-block, NeilBrown, Arnd Bergmann On 6/4/18 9:51 AM, Linus Torvalds wrote: > On Sun, Jun 3, 2018 at 5:42 PM Jens Axboe <axboe@kernel.dk> wrote: >> >> drivers/md/md.c | 61 +-- >> drivers/md/md.h | 4 +- > > So I've pulled this, but I get a new warning: > > drivers/md/md.c: In function ‘mddev_put’: > drivers/md/md.c:543:1: warning: the frame size of 2112 bytes is > larger than 2048 bytes [-Wframe-larger-than=] > > which seems to be due to commit afeee514ce7f ("md: convert to > bioset_init()/mempool_init()"). > > As of that commit, mddev_put() now allocates *two* "struct bio_set" > structures on the stack, and those really aren't small. Yes, part of > it is that I do my test-builds with all that debug stuff, but those > structures have several embedded mempool_t members, and they really > are pretty sizable. > > And I _really_ do not think it's acceptable to have that kind of stack > usage in there. I'm not sure you can trigger mddev_put() in the IO > paths, but even without that I don't think it's acceptable. > > Also, the code simply looks like complete and utter garbage. It does > > bs = mddev->bio_set; > sync_bs = mddev->sync_set; > > to _copy_ those structures, and then does bioset_exit() on them. WTF? > > Why the hell doesn't it just do biset_exit() on the originals instead, > before freeing the mddev? > > I've pulled this, but this needs to be fixed. That is just broken > garbage right now. No way in hell is it acceptable to waste 2kB of > stack space on something idiotic like this. Works is already underway to get this fixed up. I agree that the excessive stack usage isn't great, but the code in that function is pretty miserable, as you say. That's not new with the patch, the conversion just carried that forward. CC'ing Neil to get his input on how best to clean that up, I'd be much more comfortable with that logic improved rather than just catering to the stack usage issue. Also include Arnd, as he had a test patch for it. -- Jens Axboe ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [GIT PULL] Block changes for 4.18-rc 2018-06-04 15:54 ` Jens Axboe @ 2018-06-04 16:24 ` Linus Torvalds 2018-06-04 18:20 ` Tejun Heo 2018-06-04 21:16 ` NeilBrown 1 sibling, 1 reply; 17+ messages in thread From: Linus Torvalds @ 2018-06-04 16:24 UTC (permalink / raw) To: Jens Axboe, Tejun Heo Cc: Kent Overstreet, linux-block, NeilBrown, Arnd Bergmann On Mon, Jun 4, 2018 at 8:54 AM Jens Axboe <axboe@kernel.dk> wrote: > > On 6/4/18 9:51 AM, Linus Torvalds wrote: > > > > Why the hell doesn't it just do bioset_exit() on the originals instead, > > before freeing the mddev? > > CC'ing Neil to get his input on how best to clean that up, I'd > be much more comfortable with that logic improved rather than > just catering to the stack usage issue. Also include Arnd, as > he had a test patch for it. Also adding Tejun, because the only reason for that odd sequencing seems to be that - we want to queue the deletion work *inside* the spinlock, because it actually has synchronization between the workqueue and the 'all_mddevs_lock' spinlock. - we want to bioset_exit() the fields *outside* the spinlock. So what it does now is to copy those big 'struct bio_set' fields because they might go away from under it. Tejun, the code in question is mddev_put() in drivers/md/md.c, and it basically does INIT_WORK(&mddev->del_work, mddev_delayed_delete); queue_work(md_misc_wq, &mddev->del_work); inside a spinlock, but then wants to do some stuff *outside* the spinlock before that mddev_delayed_delete() thing actually gets called. Is there some way to "half-queue" the work - enough that a flush_workqueue() will wait for it, but still guaranteeing that it won't actually run until released? IOW, what I think that code really wants to do is perhaps something like /* under &all_mddevs_lock spinlock */ .. remove from all lists so that it can't be found .. .. but add it to the md_misc_wq so that people will wait for it .. INIT_WORK_LOCKED(&mddev->del_work, mddev_delayed_delete); queue_work(md_misc_wq, &mddev->del_work); ... spin_unlock(&all_mddevs_lock); /* mddev is still guaranteed live */ bioset_exit(&mddev->bio_set); bioset_exit(&mddev->sync_set); /* NOW we can release it */ if (queued) unlock_work(&mddev->del_work); else kfree(mddev); or something like that? The above is *ENTIRELY* hand-wavy garbage - don't take it seriously per se, consider it just pseudo-code for documentation reasons, not serious in any other way. Linus ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [GIT PULL] Block changes for 4.18-rc 2018-06-04 16:24 ` Linus Torvalds @ 2018-06-04 18:20 ` Tejun Heo 2018-06-04 18:22 ` Linus Torvalds 0 siblings, 1 reply; 17+ messages in thread From: Tejun Heo @ 2018-06-04 18:20 UTC (permalink / raw) To: Linus Torvalds Cc: Jens Axboe, Kent Overstreet, linux-block, NeilBrown, Arnd Bergmann Hey, Linus. On Mon, Jun 04, 2018 at 09:24:28AM -0700, Linus Torvalds wrote: > Tejun, the code in question is mddev_put() in drivers/md/md.c, and it > basically does > > INIT_WORK(&mddev->del_work, mddev_delayed_delete); > queue_work(md_misc_wq, &mddev->del_work); > > inside a spinlock, but then wants to do some stuff *outside* the > spinlock before that mddev_delayed_delete() thing actually gets > called. > > Is there some way to "half-queue" the work - enough that a > flush_workqueue() will wait for it, but still guaranteeing that it > won't actually run until released? Such interface doesn't exist, yet anyway. Workqueue does have the concept of delayed queueing to handle max concurrency limits and we can probably piggyback on that to create an interface which queues and delays the work item and another interface to undelay it. That said, it feels like a super-niche feature for a weird use case. > IOW, what I think that code really wants to do is perhaps something like > > /* under &all_mddevs_lock spinlock */ > .. remove from all lists so that it can't be found .. > > .. but add it to the md_misc_wq so that people will wait for it .. > > INIT_WORK_LOCKED(&mddev->del_work, mddev_delayed_delete); > queue_work(md_misc_wq, &mddev->del_work); > ... > > spin_unlock(&all_mddevs_lock); > > /* mddev is still guaranteed live */ > bioset_exit(&mddev->bio_set); > bioset_exit(&mddev->sync_set); > > /* NOW we can release it */ > if (queued) > unlock_work(&mddev->del_work); > else > kfree(mddev); > Linus Looking at the code, the fundamental problem seems to be that it's weaving different parts of sync and async paths. I don't understand why it'd punt the destructin of mddev but destroy biosets synchronously. Can't it do sth like the following? lock; if (need to be async) { /* async destruction path */ INIT_WORK(...); queue_work(...); /* the work does bioset_exits */ unlock; return; } /* sync destructino path */ do whatever needs to be done under lock; unlock; bioset_exit(...); bioset_exit(...); kfree(mddev); Thanks. -- tejun ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [GIT PULL] Block changes for 4.18-rc 2018-06-04 18:20 ` Tejun Heo @ 2018-06-04 18:22 ` Linus Torvalds 2018-06-04 18:25 ` Jens Axboe 0 siblings, 1 reply; 17+ messages in thread From: Linus Torvalds @ 2018-06-04 18:22 UTC (permalink / raw) To: Tejun Heo Cc: Jens Axboe, Kent Overstreet, linux-block, NeilBrown, Arnd Bergmann On Mon, Jun 4, 2018 at 11:20 AM Tejun Heo <tj@kernel.org> wrote: > > > Looking at the code, the fundamental problem seems to be that it's > weaving different parts of sync and async paths. I don't understand > why it'd punt the destructin of mddev but destroy biosets > synchronously. Can't it do sth like the following? Yeah, that looks like the right thing to do. Jens/Kent? Linus ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [GIT PULL] Block changes for 4.18-rc 2018-06-04 18:22 ` Linus Torvalds @ 2018-06-04 18:25 ` Jens Axboe 2018-06-04 19:04 ` Kent Overstreet 0 siblings, 1 reply; 17+ messages in thread From: Jens Axboe @ 2018-06-04 18:25 UTC (permalink / raw) To: Linus Torvalds, Tejun Heo Cc: Kent Overstreet, linux-block, NeilBrown, Arnd Bergmann On 6/4/18 12:22 PM, Linus Torvalds wrote: > On Mon, Jun 4, 2018 at 11:20 AM Tejun Heo <tj@kernel.org> wrote: >> >> >> Looking at the code, the fundamental problem seems to be that it's >> weaving different parts of sync and async paths. I don't understand >> why it'd punt the destructin of mddev but destroy biosets >> synchronously. Can't it do sth like the following? > > Yeah, that looks like the right thing to do. > > Jens/Kent? I agree with Tejun, we already discussed this offline. -- Jens Axboe ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [GIT PULL] Block changes for 4.18-rc 2018-06-04 18:25 ` Jens Axboe @ 2018-06-04 19:04 ` Kent Overstreet 2018-06-05 0:42 ` Linus Torvalds 0 siblings, 1 reply; 17+ messages in thread From: Kent Overstreet @ 2018-06-04 19:04 UTC (permalink / raw) To: Jens Axboe Cc: Linus Torvalds, Tejun Heo, linux-block, NeilBrown, Arnd Bergmann On Mon, Jun 04, 2018 at 12:25:54PM -0600, Jens Axboe wrote: > On 6/4/18 12:22 PM, Linus Torvalds wrote: > > On Mon, Jun 4, 2018 at 11:20 AM Tejun Heo <tj@kernel.org> wrote: > >> > >> > >> Looking at the code, the fundamental problem seems to be that it's > >> weaving different parts of sync and async paths. I don't understand > >> why it'd punt the destructin of mddev but destroy biosets > >> synchronously. Can't it do sth like the following? > > > > Yeah, that looks like the right thing to do. > > > > Jens/Kent? > > I agree with Tejun, we already discussed this offline. I don't see any good reason for the exit path to have two or three variations, depending on how you count. My preference would be for a patch that does this: diff --git a/drivers/md/md.c b/drivers/md/md.c index fc692b7128..f18d783785 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -510,36 +510,24 @@ static void mddev_delayed_delete(struct work_struct *ws); static void mddev_put(struct mddev *mddev) { - struct bio_set bs, sync_bs; - - memset(&bs, 0, sizeof(bs)); - memset(&sync_bs, 0, sizeof(sync_bs)); - if (!atomic_dec_and_lock(&mddev->active, &all_mddevs_lock)) return; + if (!mddev->raid_disks && list_empty(&mddev->disks) && mddev->ctime == 0 && !mddev->hold_active) { /* Array is not configured at all, and not held active, * so destroy it */ list_del_init(&mddev->all_mddevs); - bs = mddev->bio_set; - sync_bs = mddev->sync_set; - memset(&mddev->bio_set, 0, sizeof(mddev->bio_set)); - memset(&mddev->sync_set, 0, sizeof(mddev->sync_set)); - if (mddev->gendisk) { - /* We did a probe so need to clean up. Call - * queue_work inside the spinlock so that - * flush_workqueue() after mddev_find will - * succeed in waiting for the work to be done. - */ - INIT_WORK(&mddev->del_work, mddev_delayed_delete); - queue_work(md_misc_wq, &mddev->del_work); - } else - kfree(mddev); + + /* We did a probe so need to clean up. Call + * queue_work inside the spinlock so that + * flush_workqueue() after mddev_find will + * succeed in waiting for the work to be done. + */ + INIT_WORK(&mddev->del_work, mddev_delayed_delete); + queue_work(md_misc_wq, &mddev->del_work); } spin_unlock(&all_mddevs_lock); - bioset_exit(&bs); - bioset_exit(&sync_bs); } static void md_safemode_timeout(struct timer_list *t); However, that's not correct as is because mddev_delayed_put() calls kobject_put(), and the kobject isn't initialized when the mddev is first allocated, it's initialized when the gendisk is allocated... that isn't hard to fix but that's getting into real refactoring that I'll need to put actual work into testing. However, I have looked at where mddev_put() is called from and I think the stack usage is fine for now - unless I've missed something it's not called from e.g. under generic_make_request() anywhere, it's just called from sysfs code and ioctls and things like that, where the stack is going to be pretty much empty. ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [GIT PULL] Block changes for 4.18-rc 2018-06-04 19:04 ` Kent Overstreet @ 2018-06-05 0:42 ` Linus Torvalds 2018-06-05 0:44 ` Linus Torvalds 2018-06-05 0:56 ` Kent Overstreet 0 siblings, 2 replies; 17+ messages in thread From: Linus Torvalds @ 2018-06-05 0:42 UTC (permalink / raw) To: Kent Overstreet Cc: Jens Axboe, Tejun Heo, linux-block, NeilBrown, Arnd Bergmann [-- Attachment #1: Type: text/plain, Size: 629 bytes --] On Mon, Jun 4, 2018 at 12:04 PM Kent Overstreet <kent.overstreet@gmail.com> wrote: > > However, that's not correct as is because mddev_delayed_put() calls > kobject_put(), and the kobject isn't initialized when the mddev is first > allocated, it's initialized when the gendisk is allocated... that isn't hard to > fix but that's getting into real refactoring that I'll need to put actual work > into testing. Well, it also removes the bioset_exit() calls entirely. How about just the attached? It simply does it as two different cases, and adds the bioset_exit() calls to mddev_delayed_delete(). Hmm? Linus [-- Attachment #2: patch.diff --] [-- Type: text/x-patch, Size: 1870 bytes --] drivers/md/md.c | 23 ++++++++++------------- 1 file changed, 10 insertions(+), 13 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index fc692b7128bb..6a2494065ab2 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -510,11 +510,6 @@ static void mddev_delayed_delete(struct work_struct *ws); static void mddev_put(struct mddev *mddev) { - struct bio_set bs, sync_bs; - - memset(&bs, 0, sizeof(bs)); - memset(&sync_bs, 0, sizeof(sync_bs)); - if (!atomic_dec_and_lock(&mddev->active, &all_mddevs_lock)) return; if (!mddev->raid_disks && list_empty(&mddev->disks) && @@ -522,10 +517,6 @@ static void mddev_put(struct mddev *mddev) /* Array is not configured at all, and not held active, * so destroy it */ list_del_init(&mddev->all_mddevs); - bs = mddev->bio_set; - sync_bs = mddev->sync_set; - memset(&mddev->bio_set, 0, sizeof(mddev->bio_set)); - memset(&mddev->sync_set, 0, sizeof(mddev->sync_set)); if (mddev->gendisk) { /* We did a probe so need to clean up. Call * queue_work inside the spinlock so that @@ -534,12 +525,15 @@ static void mddev_put(struct mddev *mddev) */ INIT_WORK(&mddev->del_work, mddev_delayed_delete); queue_work(md_misc_wq, &mddev->del_work); - } else - kfree(mddev); + spin_unlock(&all_mddevs_lock); + return; + } } spin_unlock(&all_mddevs_lock); - bioset_exit(&bs); - bioset_exit(&sync_bs); + + bioset_exit(&mddev->bio_set); + bioset_exit(&mddev->sync_set); + kfree(mddev); } static void md_safemode_timeout(struct timer_list *t); @@ -5234,6 +5228,9 @@ static void mddev_delayed_delete(struct work_struct *ws) { struct mddev *mddev = container_of(ws, struct mddev, del_work); + bioset_exit(&mddev->bio_set); + bioset_exit(&mddev->sync_set); + sysfs_remove_group(&mddev->kobj, &md_bitmap_group); kobject_del(&mddev->kobj); kobject_put(&mddev->kobj); ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [GIT PULL] Block changes for 4.18-rc 2018-06-05 0:42 ` Linus Torvalds @ 2018-06-05 0:44 ` Linus Torvalds 2018-06-05 0:56 ` Kent Overstreet 1 sibling, 0 replies; 17+ messages in thread From: Linus Torvalds @ 2018-06-05 0:44 UTC (permalink / raw) To: Kent Overstreet Cc: Jens Axboe, Tejun Heo, linux-block, NeilBrown, Arnd Bergmann On Mon, Jun 4, 2018 at 5:42 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > How about just the attached? Note: it probably goes without saying that the patch was entirely untested, but it does build, and it does get rid of the insane stack use. Linus ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [GIT PULL] Block changes for 4.18-rc 2018-06-05 0:42 ` Linus Torvalds 2018-06-05 0:44 ` Linus Torvalds @ 2018-06-05 0:56 ` Kent Overstreet 2018-06-05 1:00 ` Linus Torvalds 2018-06-07 1:45 ` Jens Axboe 1 sibling, 2 replies; 17+ messages in thread From: Kent Overstreet @ 2018-06-05 0:56 UTC (permalink / raw) To: Linus Torvalds Cc: Jens Axboe, Tejun Heo, linux-block, NeilBrown, Arnd Bergmann On Mon, Jun 04, 2018 at 05:42:04PM -0700, Linus Torvalds wrote: > On Mon, Jun 4, 2018 at 12:04 PM Kent Overstreet > <kent.overstreet@gmail.com> wrote: > > > > However, that's not correct as is because mddev_delayed_put() calls > > kobject_put(), and the kobject isn't initialized when the mddev is first > > allocated, it's initialized when the gendisk is allocated... that isn't hard to > > fix but that's getting into real refactoring that I'll need to put actual work > > into testing. > > Well, it also removes the bioset_exit() calls entirely. Yeah, I realized that when I went back to finish that patch > > How about just the attached? > > It simply does it as two different cases, and adds the bioset_exit() > calls to mddev_delayed_delete(). Oh right, just taking advantage of the fact that just the queue_work() needs to be under the spinlock, not the actual free in the other case. I like your patch for a less invasive version, but I did finish and test my version, which deletes more code :) I've already gone to the trouble of coming up with a VM smoketest, so I can test yours too... I don't really have a strong opinion on which patch should go in. drivers/md/md.c | 53 ++++++++++++++++++----------------------------------- 1 file changed, 18 insertions(+), 35 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index fc692b7128..22203eba1e 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -84,6 +84,8 @@ static void autostart_arrays(int part); static LIST_HEAD(pers_list); static DEFINE_SPINLOCK(pers_lock); +static struct kobj_type md_ktype; + struct md_cluster_operations *md_cluster_ops; EXPORT_SYMBOL(md_cluster_ops); struct module *md_cluster_mod; @@ -510,11 +512,6 @@ static void mddev_delayed_delete(struct work_struct *ws); static void mddev_put(struct mddev *mddev) { - struct bio_set bs, sync_bs; - - memset(&bs, 0, sizeof(bs)); - memset(&sync_bs, 0, sizeof(sync_bs)); - if (!atomic_dec_and_lock(&mddev->active, &all_mddevs_lock)) return; if (!mddev->raid_disks && list_empty(&mddev->disks) && @@ -522,30 +519,23 @@ static void mddev_put(struct mddev *mddev) /* Array is not configured at all, and not held active, * so destroy it */ list_del_init(&mddev->all_mddevs); - bs = mddev->bio_set; - sync_bs = mddev->sync_set; - memset(&mddev->bio_set, 0, sizeof(mddev->bio_set)); - memset(&mddev->sync_set, 0, sizeof(mddev->sync_set)); - if (mddev->gendisk) { - /* We did a probe so need to clean up. Call - * queue_work inside the spinlock so that - * flush_workqueue() after mddev_find will - * succeed in waiting for the work to be done. - */ - INIT_WORK(&mddev->del_work, mddev_delayed_delete); - queue_work(md_misc_wq, &mddev->del_work); - } else - kfree(mddev); + + /* + * Call queue_work inside the spinlock so that + * flush_workqueue() after mddev_find will succeed in waiting + * for the work to be done. + */ + INIT_WORK(&mddev->del_work, mddev_delayed_delete); + queue_work(md_misc_wq, &mddev->del_work); } spin_unlock(&all_mddevs_lock); - bioset_exit(&bs); - bioset_exit(&sync_bs); } static void md_safemode_timeout(struct timer_list *t); void mddev_init(struct mddev *mddev) { + kobject_init(&mddev->kobj, &md_ktype); mutex_init(&mddev->open_mutex); mutex_init(&mddev->reconfig_mutex); mutex_init(&mddev->bitmap_info.mutex); @@ -5215,6 +5205,8 @@ static void md_free(struct kobject *ko) put_disk(mddev->gendisk); percpu_ref_exit(&mddev->writes_pending); + bioset_exit(&mddev->bio_set); + bioset_exit(&mddev->sync_set); kfree(mddev); } @@ -5348,8 +5340,7 @@ static int md_alloc(dev_t dev, char *name) mutex_lock(&mddev->open_mutex); add_disk(disk); - error = kobject_init_and_add(&mddev->kobj, &md_ktype, - &disk_to_dev(disk)->kobj, "%s", "md"); + error = kobject_add(&mddev->kobj, &disk_to_dev(disk)->kobj, "%s", "md"); if (error) { /* This isn't possible, but as kobject_init_and_add is marked * __must_check, we must do something with the result @@ -5506,7 +5497,7 @@ int md_run(struct mddev *mddev) if (!bioset_initialized(&mddev->sync_set)) { err = bioset_init(&mddev->sync_set, BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS); if (err) - goto abort; + return err; } spin_lock(&pers_lock); @@ -5519,8 +5510,7 @@ int md_run(struct mddev *mddev) else pr_warn("md: personality for level %s is not loaded!\n", mddev->clevel); - err = -EINVAL; - goto abort; + return -EINVAL; } spin_unlock(&pers_lock); if (mddev->level != pers->level) { @@ -5533,8 +5523,7 @@ int md_run(struct mddev *mddev) pers->start_reshape == NULL) { /* This personality cannot handle reshaping... */ module_put(pers->owner); - err = -EINVAL; - goto abort; + return -EINVAL; } if (pers->sync_request) { @@ -5603,7 +5592,7 @@ int md_run(struct mddev *mddev) mddev->private = NULL; module_put(pers->owner); bitmap_destroy(mddev); - goto abort; + return err; } if (mddev->queue) { bool nonrot = true; @@ -5665,12 +5654,6 @@ int md_run(struct mddev *mddev) sysfs_notify_dirent_safe(mddev->sysfs_action); sysfs_notify(&mddev->kobj, NULL, "degraded"); return 0; - -abort: - bioset_exit(&mddev->bio_set); - bioset_exit(&mddev->sync_set); - - return err; } EXPORT_SYMBOL_GPL(md_run); ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [GIT PULL] Block changes for 4.18-rc 2018-06-05 0:56 ` Kent Overstreet @ 2018-06-05 1:00 ` Linus Torvalds 2018-06-07 1:45 ` Jens Axboe 1 sibling, 0 replies; 17+ messages in thread From: Linus Torvalds @ 2018-06-05 1:00 UTC (permalink / raw) To: Kent Overstreet Cc: Jens Axboe, Tejun Heo, linux-block, NeilBrown, Arnd Bergmann On Mon, Jun 4, 2018 at 5:56 PM Kent Overstreet <kent.overstreet@gmail.com> wrote: > > I like your patch for a less invasive version, but I did finish and test my > version, which deletes more code :) I certainly won't object to that. Your patch looks fine, and looks like the right thing in the long run anyway. Plus, it's apparently even tested. What a concept. Regardless, I'll sleep better tonight that this will all be fixed one way or the other. Thanks, Linus ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [GIT PULL] Block changes for 4.18-rc 2018-06-05 0:56 ` Kent Overstreet 2018-06-05 1:00 ` Linus Torvalds @ 2018-06-07 1:45 ` Jens Axboe 1 sibling, 0 replies; 17+ messages in thread From: Jens Axboe @ 2018-06-07 1:45 UTC (permalink / raw) To: Kent Overstreet, Linus Torvalds Cc: Tejun Heo, linux-block, NeilBrown, Arnd Bergmann On 6/4/18 6:56 PM, Kent Overstreet wrote: > On Mon, Jun 04, 2018 at 05:42:04PM -0700, Linus Torvalds wrote: >> On Mon, Jun 4, 2018 at 12:04 PM Kent Overstreet >> <kent.overstreet@gmail.com> wrote: >>> >>> However, that's not correct as is because mddev_delayed_put() calls >>> kobject_put(), and the kobject isn't initialized when the mddev is first >>> allocated, it's initialized when the gendisk is allocated... that isn't hard to >>> fix but that's getting into real refactoring that I'll need to put actual work >>> into testing. >> >> Well, it also removes the bioset_exit() calls entirely. > > Yeah, I realized that when I went back to finish that patch >> >> How about just the attached? >> >> It simply does it as two different cases, and adds the bioset_exit() >> calls to mddev_delayed_delete(). > > Oh right, just taking advantage of the fact that just the queue_work() needs to > be under the spinlock, not the actual free in the other case. > > I like your patch for a less invasive version, but I did finish and test my > version, which deletes more code :) > > I've already gone to the trouble of coming up with a VM smoketest, so I can test > yours too... I don't really have a strong opinion on which patch should go in. Kent, care to submit a proper version? We should get this in. -- Jens Axboe ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [GIT PULL] Block changes for 4.18-rc 2018-06-04 15:54 ` Jens Axboe 2018-06-04 16:24 ` Linus Torvalds @ 2018-06-04 21:16 ` NeilBrown 2018-06-04 21:21 ` Kent Overstreet 1 sibling, 1 reply; 17+ messages in thread From: NeilBrown @ 2018-06-04 21:16 UTC (permalink / raw) To: Jens Axboe, Linus Torvalds, Kent Overstreet Cc: linux-block, Arnd Bergmann, Shaohua Li [-- Attachment #1: Type: text/plain, Size: 3067 bytes --] On Mon, Jun 04 2018, Jens Axboe wrote: > On 6/4/18 9:51 AM, Linus Torvalds wrote: >> On Sun, Jun 3, 2018 at 5:42 PM Jens Axboe <axboe@kernel.dk> wrote: >>> >>> drivers/md/md.c | 61 +-- >>> drivers/md/md.h | 4 +- >> >> So I've pulled this, but I get a new warning: >> >> drivers/md/md.c: In function ‘mddev_put’: >> drivers/md/md.c:543:1: warning: the frame size of 2112 bytes is >> larger than 2048 bytes [-Wframe-larger-than=] >> >> which seems to be due to commit afeee514ce7f ("md: convert to >> bioset_init()/mempool_init()"). >> >> As of that commit, mddev_put() now allocates *two* "struct bio_set" >> structures on the stack, and those really aren't small. Yes, part of >> it is that I do my test-builds with all that debug stuff, but those >> structures have several embedded mempool_t members, and they really >> are pretty sizable. >> >> And I _really_ do not think it's acceptable to have that kind of stack >> usage in there. I'm not sure you can trigger mddev_put() in the IO >> paths, but even without that I don't think it's acceptable. >> >> Also, the code simply looks like complete and utter garbage. It does >> >> bs = mddev->bio_set; >> sync_bs = mddev->sync_set; >> >> to _copy_ those structures, and then does bioset_exit() on them. WTF? >> >> Why the hell doesn't it just do biset_exit() on the originals instead, >> before freeing the mddev? >> >> I've pulled this, but this needs to be fixed. That is just broken >> garbage right now. No way in hell is it acceptable to waste 2kB of >> stack space on something idiotic like this. > > Works is already underway to get this fixed up. I agree that the > excessive stack usage isn't great, but the code in that function > is pretty miserable, as you say. That's not new with the patch, > the conversion just carried that forward. > > CC'ing Neil to get his input on how best to clean that up, I'd > be much more comfortable with that logic improved rather than > just catering to the stack usage issue. Also include Arnd, as > he had a test patch for it. Cc:ing Shaohua as well, as he is the current maintainer, and authored Commit: 7184ef8bab0c ("MD: fix sleep in atomic") which seems most relevant. The justification given in that patch is that bioset_free() will take a mutex. This must have been in destory_workqueue() when destroying bs->rescue_workqueue. md biosets don't have a work queue (BIOSET_NEED_RESCUER isn't set), so these calls will no longer take a mutex. So this commit can now be reverted. I think that will clean up this code suitably. I really should get back to BIOSET_NEED_RESCUER and see if I can discard it completely. Kent - the only remaining user is bcache, and the main reason I haven't removed it is that I cannot follow the code (or at least, I couldn't last time I tried). Are you able to see if it is still needed? Thanks, NeilBrown > > -- > Jens Axboe [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [GIT PULL] Block changes for 4.18-rc 2018-06-04 21:16 ` NeilBrown @ 2018-06-04 21:21 ` Kent Overstreet 2018-06-04 22:52 ` NeilBrown 0 siblings, 1 reply; 17+ messages in thread From: Kent Overstreet @ 2018-06-04 21:21 UTC (permalink / raw) To: NeilBrown Cc: Jens Axboe, Linus Torvalds, linux-block, Arnd Bergmann, Shaohua Li On Tue, Jun 05, 2018 at 07:16:51AM +1000, NeilBrown wrote: > I really should get back to BIOSET_NEED_RESCUER and see if I can discard > it completely. > Kent - the only remaining user is bcache, and the main reason I haven't > removed it is that I cannot follow the code (or at least, I couldn't > last time I tried). Are you able to see if it is still needed? Oh man, rescueurs make my head hurt. I remember you changed something so that they weren't needed most of the time, but I can't remember what you did - can you remind me what that was and whatever else you can remember offhand? ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [GIT PULL] Block changes for 4.18-rc 2018-06-04 21:21 ` Kent Overstreet @ 2018-06-04 22:52 ` NeilBrown 2018-06-05 0:34 ` Kent Overstreet 0 siblings, 1 reply; 17+ messages in thread From: NeilBrown @ 2018-06-04 22:52 UTC (permalink / raw) To: Kent Overstreet Cc: Jens Axboe, Linus Torvalds, linux-block, Arnd Bergmann, Shaohua Li [-- Attachment #1: Type: text/plain, Size: 1936 bytes --] On Mon, Jun 04 2018, Kent Overstreet wrote: > On Tue, Jun 05, 2018 at 07:16:51AM +1000, NeilBrown wrote: >> I really should get back to BIOSET_NEED_RESCUER and see if I can discard >> it completely. >> Kent - the only remaining user is bcache, and the main reason I haven't >> removed it is that I cannot follow the code (or at least, I couldn't >> last time I tried). Are you able to see if it is still needed? > > Oh man, rescueurs make my head hurt. I remember you changed something so that > they weren't needed most of the time, but I can't remember what you did - can > you remind me what that was and whatever else you can remember offhand? (and closures make my head hurt :-) There were two closely related changes. One is make sure that generic_make_request processed bios strictly in a depth-first order. The other is to ensure that all (stacking) block device drivers never block in their make_request_fn waiting for something that have themselves submitted. Together these mean that a thread in generic_make_request will never block waiting for a bio that might be queued in current->bio_list. The bio chosen by generic_make_request will be a leaf and won't depend on anything else in the list, and the handler for that request won't block after submitting a lower-level request. The second property is the important one for drivers and the one we need to ensure that bcache follows. A common bad pattern is something like: while (too_big(bio)) { split = bio_spilt(bio,...)) process(split) } process(bio) That needs to be more like if (too_big(bio)) { split = bio_split(bio,...); generic_make_request(bio); bio = split; } process(bio) so that generic_make_request() gets to do that while-loop and can control the order in which bios are handled. I cannot convince myself that bcache doesn't have something similar to that while loop (structured with closures). Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [GIT PULL] Block changes for 4.18-rc 2018-06-04 22:52 ` NeilBrown @ 2018-06-05 0:34 ` Kent Overstreet 0 siblings, 0 replies; 17+ messages in thread From: Kent Overstreet @ 2018-06-05 0:34 UTC (permalink / raw) To: NeilBrown Cc: Jens Axboe, Linus Torvalds, linux-block, Arnd Bergmann, Shaohua Li On Tue, Jun 05, 2018 at 08:52:32AM +1000, NeilBrown wrote: > On Mon, Jun 04 2018, Kent Overstreet wrote: > > > On Tue, Jun 05, 2018 at 07:16:51AM +1000, NeilBrown wrote: > >> I really should get back to BIOSET_NEED_RESCUER and see if I can discard > >> it completely. > >> Kent - the only remaining user is bcache, and the main reason I haven't > >> removed it is that I cannot follow the code (or at least, I couldn't > >> last time I tried). Are you able to see if it is still needed? > > > > Oh man, rescueurs make my head hurt. I remember you changed something so that > > they weren't needed most of the time, but I can't remember what you did - can > > you remind me what that was and whatever else you can remember offhand? > > (and closures make my head hurt :-) Fair enough :) > There were two closely related changes. > One is make sure that generic_make_request processed bios strictly > in a depth-first order. The other is to ensure that all (stacking) > block device drivers never block in their make_request_fn waiting > for something that have themselves submitted. > > Together these mean that a thread in generic_make_request will never > block waiting for a bio that might be queued in current->bio_list. > The bio chosen by generic_make_request will be a leaf and won't depend > on anything else in the list, and the handler for that request won't > block after submitting a lower-level request. > > The second property is the important one for drivers and the one we > need to ensure that bcache follows. > A common bad pattern is something like: > > while (too_big(bio)) { > split = bio_spilt(bio,...)) > process(split) > } > process(bio) > > That needs to be more like > if (too_big(bio)) { > split = bio_split(bio,...); > generic_make_request(bio); > bio = split; > } > process(bio) > > so that generic_make_request() gets to do that while-loop > and can control the order in which bios are handled. > > I cannot convince myself that bcache doesn't have something similar to > that while loop (structured with closures). Oh yeah, it's all coming back to me... so, the problem isn't really closures, it's that we don't know where we have to split until after we've already allocated all our state for the request and done almost all our work for the request - also, the state we allocate is per _incoming_ bio, not per split we submit. So for the read side, it looks like - allocate a struct search (man, so much of the naming in bcache is crap, it's painful going back from bcachefs) - traverse the btree to the start of this request - iterate over extents, splitting and submitting for each extent the original bio overlaps with so, converting it to your 2nd pattern would mean changing that code so that when we walk the btree and find an extent we overlap with, if the extent doesn't cover the whole bio, we split, submit the split, resubmit the original bio - and then return and _restart the entire btree traversal_, for each split. ouch. the write path is different, but the problem is roughly the same in that we don't know where we're going to need to split until we're deep inside the write path, allocating space for where the write will go - we split if the bucket we're currently writing to doesn't have enough space. But again, we don't find that out until well after allocating data structures for that write, taking locks for space allocation, etc. etc... So the approach of just resubmitting the rest of the bio just flat out won't work, we've already allocated our data structures and they're tied to that bio... it would require massive changes and deep surgery on how the entire control flow works... Also unconditionally doing this for every bio split (bailing out and restarting the btree traversal/write path) would be pretty crappy performance wise, e.g. when doing big sequential reads of data that was written all with random 4k writes... but... the problem we're trying to avoid is just blocking while allocating memory, while we've got bios blocked on current->bio_list, right? So another approach would be to checking when we're allocating a bio split if that allocation might deadlock - if so, do it with GFP_NOWAIT, and if it fails - just punt the read or write request to workqueue. this is something the code can actually do the way it's structured now... making use of closures :) cache_lookup() calls bch_btree_map_keys() to walk the extents the request covers, and restarts itself if cache_lookup_fn() returns -EAGAIN - this is because the btree lookup might require reading in btree nodes, so we can't block on a btree node read while running under generic_make_request(). So this (untested!) patch should be sufficient for the read side: diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index ae67f5fa80..9aed34d050 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -538,6 +538,14 @@ static int cache_lookup_fn(struct btree_op *op, struct btree *b, struct bkey *k) if (!KEY_SIZE(k)) return MAP_CONTINUE; + n = bio_next_split(bio, min_t(uint64_t, INT_MAX, + KEY_OFFSET(k) - bio->bi_iter.bi_sector), + !current->bio_list || bio_list_empty(current->bio_list) + ? GFP_NOIO : GFP_NOWAIT, + &s->d->bio_split); + if (!n) + return -EAGAIN; + /* XXX: figure out best pointer - for multiple cache devices */ ptr = 0; @@ -546,10 +554,6 @@ static int cache_lookup_fn(struct btree_op *op, struct btree *b, struct bkey *k) if (KEY_DIRTY(k)) s->read_dirty_data = true; - n = bio_next_split(bio, min_t(uint64_t, INT_MAX, - KEY_OFFSET(k) - bio->bi_iter.bi_sector), - GFP_NOIO, &s->d->bio_split); - bio_key = &container_of(n, struct bbio, bio)->key; bch_bkey_copy_single_ptr(bio_key, k, ptr); It's a bit more complicated for the write path - right now, we split after allocating space (bch_data_insert_start() -> bch_alloc_sectors()). To be able to punt to workqueue if the split fails, bch_alloc_sectors() needs to be broken up into bch_alloc_sectors_start() to pick and lock a write point and allocate a bucket if necessary, so that bch_data_insert_start() can find out how much space there is available and attempt the split before actually consuming that space. That's how the sector allocator already works in bcachefs though, so it's a straightforward change... ^ permalink raw reply related [flat|nested] 17+ messages in thread
end of thread, other threads:[~2018-06-07 1:45 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-06-04 0:42 [GIT PULL] Block changes for 4.18-rc Jens Axboe 2018-06-04 15:51 ` Linus Torvalds 2018-06-04 15:54 ` Jens Axboe 2018-06-04 16:24 ` Linus Torvalds 2018-06-04 18:20 ` Tejun Heo 2018-06-04 18:22 ` Linus Torvalds 2018-06-04 18:25 ` Jens Axboe 2018-06-04 19:04 ` Kent Overstreet 2018-06-05 0:42 ` Linus Torvalds 2018-06-05 0:44 ` Linus Torvalds 2018-06-05 0:56 ` Kent Overstreet 2018-06-05 1:00 ` Linus Torvalds 2018-06-07 1:45 ` Jens Axboe 2018-06-04 21:16 ` NeilBrown 2018-06-04 21:21 ` Kent Overstreet 2018-06-04 22:52 ` NeilBrown 2018-06-05 0:34 ` Kent Overstreet
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox