Linux block layer
 help / color / mirror / Atom feed
* Re: [PATCH v2 06/13] md: raid1: don't use bio's vec table to manage resync pages
From: Shaohua Li @ 2017-03-03 17:38 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	linux-block, Christoph Hellwig
In-Reply-To: <CACVXFVPQx0rQqQS0PG2MoYh+pGvVQb_H_+fMCGbWeVDM3JNcnQ@mail.gmail.com>

On Fri, Mar 03, 2017 at 10:11:31AM +0800, Ming Lei wrote:
> On Fri, Mar 3, 2017 at 1:48 AM, Shaohua Li <shli@kernel.org> wrote:
> > On Thu, Mar 02, 2017 at 10:25:10AM +0800, Ming Lei wrote:
> >> Hi Shaohua,
> >>
> >> On Wed, Mar 1, 2017 at 7:37 AM, Shaohua Li <shli@kernel.org> wrote:
> >> > On Tue, Feb 28, 2017 at 11:41:36PM +0800, Ming Lei wrote:
> >> >> Now we allocate one page array for managing resync pages, instead
> >> >> of using bio's vec table to do that, and the old way is very hacky
> >> >> and won't work any more if multipage bvec is enabled.
> >> >>
> >> >> The introduced cost is that we need to allocate (128 + 16) * raid_disks
> >> >> bytes per r1_bio, and it is fine because the inflight r1_bio for
> >> >> resync shouldn't be much, as pointed by Shaohua.
> >> >>
> >> >> Also the bio_reset() in raid1_sync_request() is removed because
> >> >> all bios are freshly new now and not necessary to reset any more.
> >> >>
> >> >> This patch can be thought as a cleanup too
> >> >>
> >> >> Suggested-by: Shaohua Li <shli@kernel.org>
> >> >> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
> >> >> ---
> >> >>  drivers/md/raid1.c | 83 ++++++++++++++++++++++++++++++++++--------------------
> >> >>  1 file changed, 53 insertions(+), 30 deletions(-)
> >> >>
> >> >> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> >> >> index c442b4657e2f..900144f39630 100644
> >> >> --- a/drivers/md/raid1.c
> >> >> +++ b/drivers/md/raid1.c
> >> >> @@ -77,6 +77,16 @@ static void lower_barrier(struct r1conf *conf, sector_t sector_nr);
> >> >>  #define raid1_log(md, fmt, args...)                          \
> >> >>       do { if ((md)->queue) blk_add_trace_msg((md)->queue, "raid1 " fmt, ##args); } while (0)
> >> >>
> >> >> +static inline struct resync_pages *get_resync_pages(struct bio *bio)
> >> >> +{
> >> >> +     return bio->bi_private;
> >> >> +}
> >> >> +
> >> >> +static inline struct r1bio *get_resync_r1bio(struct bio *bio)
> >> >> +{
> >> >> +     return get_resync_pages(bio)->raid_bio;
> >> >> +}
> >> >
> >> > This is a weird between bio, r1bio and the resync_pages. I'd like the pages are
> >>
> >> It is only a bit weird inside allocating and freeing r1bio, once all
> >> are allocated, you
> >> can see everthing is clean and simple:
> >>
> >>     - r1bio includes lots of bioes,
> >>     - and one bio is attached by one resync_pages via .bi_private
> >
> > I don't how complex to let r1bio pointer to the pages, but that's the nartual
> > way. r1bio owns the pages, not the pages own r1bio, so we should let r1bio
> > points to the pages. The bio.bi_private still points to r1bio.
> 
> Actually it is bio which owns the pages for doing its own I/O, and the only
> thing related with r10bio is that bios may share these pages, but using
> page refcount trick will make the relation quite implicit.
>
> The only reason to allocate all resync_pages together is for sake of efficiency,
> and just for avoiding to allocate one resync_pages one time for each bio.
> 
> We have to make .bi_private point to resync_pages(per bio), otherwise we
> can't fetch pages into one bio at all, thinking about where to store the index
> for each bio's pre-allocated pages, and it has to be per bio.

So the reason is we can't find the corresponding pages of the bio if bi_private
points to r1bio, right? Got it. We don't have many choices in this way. Ok, I
don't insist. Please add some comments in the get_resync_r1bio to describe how
the data structure is organized.

Thanks,
Shaohua

^ permalink raw reply

* Re: WARN_ON() when booting with nvme loopback over zram
From: Jens Axboe @ 2017-03-03 16:20 UTC (permalink / raw)
  To: Johannes Thumshirn, Linux Block Layer Mailinglist,
	Linux NVMe Mailinglist
In-Reply-To: <20170303161826.GA26324@linux-x5ow.site>

On 03/03/2017 09:18 AM, Johannes Thumshirn wrote:
> Hi,
> 
> I get the following WARN_ON when trying to establish a nvmf loopback device
> backed by zram.
> 
> My topmost commit is c82be9d2244aacea9851c86f4fb74694c99cd874

It's fixed in my for-linus, pull request went out to Linus yesterday.
So hopefully master should be fine with nvmf very shortly.

-- 
Jens Axboe

^ permalink raw reply

* WARN_ON() when booting with nvme loopback over zram
From: Johannes Thumshirn @ 2017-03-03 16:18 UTC (permalink / raw)
  To: Linux Block Layer Mailinglist, Linux NVMe Mailinglist

Hi,

I get the following WARN_ON when trying to establish a nvmf loopback device
backed by zram.

My topmost commit is c82be9d2244aacea9851c86f4fb74694c99cd874


+ nvmet_cfs=3D/sys/kernel/config/nvmet/
+ nvmet_subsystem=3Dnvmf-test
+ mkdir -p /sys/kernel/config/nvmet//subsystems/nvmf-test
+ echo 1
+ mkdir /sys/kernel/config/nvmet//subsystems/nvmf-test/namespaces/1
[    6.163905] nvmet: adding nsid 1 to subsystem nvmf-test
+ echo -n /dev/zram1
+ echo -n 1
+ mkdir /sys/kernel/config/nvmet//ports/1
+ echo loop
+ ln -s /sys/kernel/config/nvmet//subsystems/nvmf-test /sys/kernel/config/n=
vmet//ports/1/subsystems/nvmf-test
+ echo transport=3Dloop,nqn=3Dnvmf-test
[    6.175710] ------------[ cut here ]------------
[    6.176181] WARNING: CPU: 1 PID: 207 at block/blk-mq-tag.c:114 blk_mq_ge=
t_tag+0x4ec/0x580
[    6.176922] Modules linked in: nvme_loop nvmet nvme_fabrics nvme_core
[    6.177523] CPU: 1 PID: 207 Comm: sh Not tainted 4.10.0+ #413
[    6.178048] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS =
rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[    6.179089] Call Trace:
[    6.179331]  dump_stack+0x4d/0x68
[    6.179656]  __warn+0x107/0x130
[    6.179950]  warn_slowpath_null+0x18/0x20
[    6.180326]  blk_mq_get_tag+0x4ec/0x580
[    6.180679]  ? __blk_mq_tag_idle+0x40/0x40
[    6.181069]  ? save_stack_trace+0x16/0x20
[    6.181442]  ? wake_atomic_t_function+0x90/0x90
[    6.181512]  ? __vfs_write+0xd1/0x330
[    6.181512]  ? vfs_write+0x10e/0x270
[    6.181512]  ? SyS_write+0xa5/0x130
[    6.181512]  ? entry_SYSCALL_64_fastpath+0x13/0x94
[    6.181512]  ? blk_queue_enter+0x1ac/0x280
[    6.181512]  __blk_mq_alloc_request+0x1c/0x1b0
[    6.181512]  blk_mq_sched_get_request+0x333/0x420
[    6.181512]  ? __list_add_valid+0x2e/0xd0
[    6.181512]  blk_mq_alloc_request+0xb9/0x120
[    6.181512]  ? __blk_mq_alloc_request+0x1b0/0x1b0
[    6.181512]  ? kmemleak_disable+0x70/0x70
[    6.181512]  nvme_alloc_request+0x98/0xb0 [nvme_core]
[    6.181512]  __nvme_submit_sync_cmd+0x2c/0x110 [nvme_core]
[    6.181512]  nvmf_connect_admin_queue+0x216/0x2b0 [nvme_fabrics]
[    6.181512]  ? nvmf_log_connect_error.isra.0+0x130/0x130 [nvme_fabrics]
[    6.181512]  ? blk_mq_sched_init+0x2e/0x40
[    6.181512]  ? blk_mq_init_allocated_queue+0x791/0x7c0
[    6.181512]  nvme_loop_configure_admin_queue+0x168/0x270 [nvme_loop]
[    6.181512]  nvme_loop_create_ctrl+0x23e/0x8f8 [nvme_loop]
[    6.181512]  ? __delete_object+0x59/0xa0
[    6.181512]  ? delete_object_full+0x18/0x20
[    6.181512]  nvmf_dev_write+0xbb1/0xd77 [nvme_fabrics]
[    6.181512]  ? nvmf_check_required_opts.isra.2+0xa0/0xa0 [nvme_fabrics]
[    6.181512]  ? kasan_slab_free+0x12f/0x180
[    6.181512]  ? save_stack_trace+0x16/0x20
[    6.181512]  ? kasan_slab_free+0xae/0x180
[    6.181512]  ? kmem_cache_free+0x84/0x150
[    6.181512]  ? putname+0x7b/0x80
[    6.181512]  ? do_sys_open+0x23f/0x290
[    6.181512]  ? SyS_open+0x19/0x20
[    6.181512]  ? entry_SYSCALL_64_fastpath+0x13/0x94
[    6.181512]  ? __save_stack_trace+0x7e/0xd0
[    6.181512]  __vfs_write+0xd1/0x330
[    6.181512]  ? restore_nameidata+0x7a/0xa0
[    6.181512]  ? __vfs_read+0x320/0x320
[    6.181512]  ? ptep_set_access_flags+0x2b/0x50
[    6.181512]  ? __handle_mm_fault+0xc9d/0x14e0
[    6.181512]  ? vm_insert_page+0x320/0x320
[    6.181512]  ? locks_remove_posix+0x38/0x70
[    6.181512]  vfs_write+0x10e/0x270
[    6.181512]  SyS_write+0xa5/0x130
[    6.181512]  ? SyS_read+0x130/0x130
[    6.181512]  entry_SYSCALL_64_fastpath+0x13/0x94
[    6.181512] RIP: 0033:0x7fd6d0de8560
[    6.181512] RSP: 002b:00007ffd284d2278 EFLAGS: 00000246 ORIG_RAX: 000000=
0000000001
[    6.181512] RAX: ffffffffffffffda RBX: 000000000000001c RCX: 00007fd6d0d=
e8560
[    6.181512] RDX: 000000000000001d RSI: 00007fd6d194a000 RDI: 00000000000=
00001
[    6.181512] RBP: 00007fd6d10a9280 R08: 000000000000000a R09: 00007fd6d19=
4c700
[    6.181512] R10: 0000000001094b50 R11: 0000000000000246 R12: 00000000010=
94b50
[    6.181512] R13: 000000000000001c R14: 0000000000000000 R15: 00007ffd284=
d2228
[    6.203886] ---[ end trace 96b98033c328af9c ]---
[    6.204314] nvme nvme0: Connect command failed, error wo/DNR bit: -16395
sh: echo: write error: Resource temporarily unavailable
+ _fatal
+ echo 1
+ e[    6.226208] sysrq: SysRq : Power Off
cho o
+ sleep 2
[    7.880759] ACPI: Preparing to enter system sleep state S5
[    7.881327] reboot: Power down

Is this known or shall I re-test with Jens' latest tree?

Nice weekend,
	Joahnnes

-- =

Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N=FCrnberg
GF: Felix Imend=F6rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N=FCrnberg)
Key fingerprint =3D EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply

* Re: [PATCH 0/4] blk-mq: cleanup on all kinds of kobjects
From: Jens Axboe @ 2017-03-03 15:34 UTC (permalink / raw)
  To: Ming Lei, linux-kernel, linux-block, Christoph Hellwig,
	Omar Sandoval
In-Reply-To: <1487758442-5855-1-git-send-email-tom.leiming@gmail.com>

On 02/22/2017 03:13 AM, Ming Lei wrote:
> This patchset cleans up on kojects of request_queue.mq_kobj,
> sw queue's kobject and hw queue's kobject.
> 
> The 1st patch initialized kobject of request_queue and sw queue
> in blk_mq_init_allocated_queue(), so we can avoid to initialize
> these kobjects several times, and this patch fixes one kerne
> warning reported from Omar Sandoval.
> 
> The 2nd patch makes lifetime consitent between mq request queue/sw
> queue and their kobjects.
> 
> The 3rd patch makes lifetime consitent between hw queue and their
> kobjects.
> 
> The last patch is a followup of 3rd patch.

Thanks for looking into this, Ming. From a quick look, looks good
to me. I'm going to queue it up for some testing, and hope that
we can funnel this into 4.11 for the next pull request.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH] cfq-iosched: fix the delay of cfq_group's vdisktime under iops mode
From: Hou Tao @ 2017-03-03 13:20 UTC (permalink / raw)
  To: Jan Kara; +Cc: axboe, linux-block, jmoyer, stable, Vivek Goyal
In-Reply-To: <20170302102945.GA31792@quack2.suse.cz>


On 2017/3/2 18:29, Jan Kara wrote:
> On Wed 01-03-17 10:07:44, Hou Tao wrote:
>> When adding a cfq_group into the cfq service tree, we use CFQ_IDLE_DELAY
>> as the delay of cfq_group's vdisktime if there have been other cfq_groups
>> already.
>>
>> When cfq is under iops mode, commit 9a7f38c42c2b ("cfq-iosched: Convert
>> from jiffies to nanoseconds") could result in a large iops delay and
>> lead to an abnormal io schedule delay for the added cfq_group. To fix
>> it, we just need to revert to the old CFQ_IDLE_DELAY value: HZ / 5
>> when iops mode is enabled.
>>
>> Cc: <stable@vger.kernel.org> # 4.8+
>> Signed-off-by: Hou Tao <houtao1@huawei.com>
> 
> OK, I agree my commit broke the logic in this case. Thanks for the fix.
> Please add also tag:
> 
> Fixes: 9a7f38c42c2b92391d9dabaf9f51df7cfe5608e4
> 
> I somewhat disagree with the fix though. See below:
> 
>> +static inline u64 cfq_get_cfqg_vdisktime_delay(struct cfq_data *cfqd)
>> +{
>> +	if (!iops_mode(cfqd))
>> +		return CFQ_IDLE_DELAY;
>> +	else
>> +		return nsecs_to_jiffies64(CFQ_IDLE_DELAY);
>> +}
>> +
> 
> So using nsecs_to_jiffies64(CFQ_IDLE_DELAY) when in iops mode just does not
> make any sense. AFAIU the code in cfq_group_notify_queue_add() we just want
> to add the cfqg as the last one in the tree. So returning 1 from
> cfq_get_cfqg_vdisktime_delay() in iops mode should be fine as well.
Yes, nsecs_to_jiffies64(CFQ_IDLE_DELAY) is odd here, the better way is to
define a new macro with a value of 1 or 200 and use it directly, but I still
prefer to use 200 to be consistent with the no-hrtimer configuration.

> Frankly, vdisktime is in fixed-point precision shifted by
> CFQ_SERVICE_SHIFT so using CFQ_IDLE_DELAY does not make much sense in any
> case and just adding 1 to maximum vdisktime should be fine in all the
> cases. But that would require more testing whether I did not miss anything
> subtle.
Although the current implementation has done this, I don't think we should
add the cfq_group as the last one in the service tree. In some test cases,
I found that the delayed vdisktime of cfq_group is smaller than its last
vdisktime when the cfq_group was removed from the service_tree, and I think
it hurts the fairness. Maybe we can learn from CFS and calculate the delay
dynamically, but it would be the topic of another thread.

Regards,

Tao

> 
>>  static void
>>  cfq_group_notify_queue_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
>>  {
>> @@ -1380,7 +1388,8 @@ cfq_group_notify_queue_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
>>  	n = rb_last(&st->rb);
>>  	if (n) {
>>  		__cfqg = rb_entry_cfqg(n);
>> -		cfqg->vdisktime = __cfqg->vdisktime + CFQ_IDLE_DELAY;
>> +		cfqg->vdisktime = __cfqg->vdisktime +
>> +			cfq_get_cfqg_vdisktime_delay(cfqd);
>>  	} else
>>  		cfqg->vdisktime = st->min_vdisktime;
>>  	cfq_group_service_tree_add(st, cfqg);
>> -- 
>> 2.5.0
>>

^ permalink raw reply

* Re: [PATCH v2 09/13] md: raid1: use bio_segments_all()
From: Ming Lei @ 2017-03-03  6:22 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Jens Axboe, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	linux-block, Christoph Hellwig
In-Reply-To: <CACVXFVP1gNEsUFdh1eKW9d01azis2+GMNO_kZjN8_ZQEbe8BxQ@mail.gmail.com>

On Fri, Mar 3, 2017 at 10:20 AM, Ming Lei <tom.leiming@gmail.com> wrote:
> On Thu, Mar 2, 2017 at 3:52 PM, Shaohua Li <shli@kernel.org> wrote:
>> On Thu, Mar 02, 2017 at 10:34:25AM +0800, Ming Lei wrote:
>>> Hi Shaohua,
>>>
>>> On Wed, Mar 1, 2017 at 7:42 AM, Shaohua Li <shli@kernel.org> wrote:
>>> > On Tue, Feb 28, 2017 at 11:41:39PM +0800, Ming Lei wrote:
>>> >> Use this helper, instead of direct access to .bi_vcnt.
>>> >
>>> > what We really need to do for the behind IO is:
>>> > - allocate memory and copy bio data to the memory
>>> > - let behind bio do IO against the memory
>>> >
>>> > The behind bio doesn't need to have the exactly same bio_vec setting. If we
>>> > just track the new memory, we don't need use the bio_segments_all and access
>>> > bio_vec too.
>>>
>>> But we need to figure out how many vecs(each vec store one page) to be
>>> allocated for the cloned/behind bio, and that is the only value of
>>> bio_segments_all() here. Or you have idea to avoid that?
>>
>> As I said, the behind bio doesn't need to have the exactly same bio_vec
>> setting. We just allocate memory and copy original bio data to the memory,
>> then do IO against the new memory. The behind bio
>> segments == (bio->bi_iter.bi_size + PAGE_SIZE - 1) >> PAGE_SHIFT
>
> The equation isn't always correct, especially when bvec includes just
> part of page, and it is quite often in case of mkfs, in which one bvec often
> includes 512byte buffer.

Think it further, your idea could be workable and more clean, but the change
can be a bit big, looks we need to switch handling write behind into
the following way:

1) replace bio_clone_bioset_partial() with bio_allocate(nr_vecs), and 'nr_vecs'
is computed with your equation;

2) allocate 'nr_vecs' pages once and share them among all created bio in 1)

3) for each created bio, add each page into the bio via bio_add_page()

4) only for the 1st created bio, call bio_copy_data() to copy data from
master bio.

Let me know if you are OK with the above implementaion.


Thanks,
Ming Lei

^ permalink raw reply

* [PATCH] block: use put_io_context_active() to disassociate bio from a task
From: Hou Tao @ 2017-03-03  7:01 UTC (permalink / raw)
  To: linux-block; +Cc: axboe, tj

bio_associate_current() invokes get_io_context_active() to tell
CFQ scheduler that the current io_context is still issuing IOs
by increasing active_ref. When the bio is done, we also need
to invoke put_io_context_active() to decrease active_ref else
the associated io_context will always be active.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 block/bio.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index 5eec5e0..d8ed36f 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -2072,7 +2072,7 @@ EXPORT_SYMBOL_GPL(bio_associate_current);
 void bio_disassociate_task(struct bio *bio)
 {
 	if (bio->bi_ioc) {
-		put_io_context(bio->bi_ioc);
+		put_io_context_active(bio->bi_ioc);
 		bio->bi_ioc = NULL;
 	}
 	if (bio->bi_css) {
-- 
2.5.0

^ permalink raw reply related

* [GIT PULL] Block fixes for 4.11-rc1
From: Jens Axboe @ 2017-03-03  4:07 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org

Hi Linus,

A collection of fixes for this merge window, either fixes for existing
issues, or parts that were waiting for acks to come in. This pull
request contains:

- Allocation of nvme queues on the right node from Shaohua. This was
  ready long before the merge window, but waiting on an ack from Bjorn
  on the PCI bit. Now that we have that, the three patches can go in.

- Two fixes for blk-mq-sched with nvmeof, which uses hctx specific
  request allocations. This caused an oops. One part from Sagi, one part
  from Omar.

- A loop partition scan deadlock fix from Omar, fixing a regression in
  this merge window.

- A 3 patch series from Keith, closing up a hole on clearing out
  requests on shutdown/resume.

- A stable fix for nbd from Josef, fixing a leak of sockets.

- Two fixes for a regression in this window from Jan, fixing a problem
  with one of his earlier patches dealing with queue vs bdi life times.

- A fix for a regression with virtio-blk, causing an IO stall if
  scheduling is used. From me.

- A fix for an io context lock ordering problem. From me.

Please pull!


  git://git.kernel.dk/linux-block.git for-linus


----------------------------------------------------------------
Jan Kara (2):
      block: Initialize bd_bdi on inode initialization
      block: Move bdi_unregister() to del_gendisk()

Jens Axboe (2):
      block: don't call ioc_exit_icq() with the queue lock held for blk-mq
      blk-mq: ensure that bd->last is always set correctly

Josef Bacik (1):
      nbd: stop leaking sockets

Keith Busch (3):
      blk-mq: Export blk_mq_freeze_queue_wait
      blk-mq: Provide freeze queue timeout
      nvme: Complete all stuck requests

Omar Sandoval (4):
      blk-mq: make blk_mq_alloc_request_hctx() allocate a scheduler request
      blk-mq: kill blk_mq_set_alloc_data()
      blk-mq: move update of tags->rqs to __blk_mq_alloc_request()
      loop: fix LO_FLAGS_PARTSCAN hang

Sagi Grimberg (1):
      blk-mq-sched: Allocate sched reserved tags as specified in the original queue tagset

Shaohua Li (3):
      blk-mq: allocate blk_mq_tags and requests in correct node
      PCI: add an API to get node from vector
      nvme: allocate nvme_queue in correct node

 block/blk-core.c         |   1 -
 block/blk-ioc.c          |  44 ++++++++++++-----
 block/blk-mq-sched.c     |  16 +++----
 block/blk-mq-tag.c       |   2 +-
 block/blk-mq-tag.h       |   6 +++
 block/blk-mq.c           | 120 ++++++++++++++++++++++++++++++++++-------------
 block/blk-mq.h           |  10 ----
 block/blk-sysfs.c        |   2 -
 block/elevator.c         |   2 -
 block/genhd.c            |   5 ++
 drivers/block/loop.c     |  15 +++---
 drivers/block/nbd.c      |   4 +-
 drivers/nvme/host/core.c |  47 +++++++++++++++++++
 drivers/nvme/host/nvme.h |   4 ++
 drivers/nvme/host/pci.c  |  45 ++++++++++++++----
 drivers/pci/msi.c        |  16 +++++++
 fs/block_dev.c           |   6 ++-
 include/linux/blk-mq.h   |   3 ++
 include/linux/pci.h      |   6 +++
 19 files changed, 265 insertions(+), 89 deletions(-)

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH v2 06/13] md: raid1: don't use bio's vec table to manage resync pages
From: Ming Lei @ 2017-03-03  2:11 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Jens Axboe, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	linux-block, Christoph Hellwig
In-Reply-To: <20170302174803.ds45hdbzkofl5hkh@kernel.org>

On Fri, Mar 3, 2017 at 1:48 AM, Shaohua Li <shli@kernel.org> wrote:
> On Thu, Mar 02, 2017 at 10:25:10AM +0800, Ming Lei wrote:
>> Hi Shaohua,
>>
>> On Wed, Mar 1, 2017 at 7:37 AM, Shaohua Li <shli@kernel.org> wrote:
>> > On Tue, Feb 28, 2017 at 11:41:36PM +0800, Ming Lei wrote:
>> >> Now we allocate one page array for managing resync pages, instead
>> >> of using bio's vec table to do that, and the old way is very hacky
>> >> and won't work any more if multipage bvec is enabled.
>> >>
>> >> The introduced cost is that we need to allocate (128 + 16) * raid_disks
>> >> bytes per r1_bio, and it is fine because the inflight r1_bio for
>> >> resync shouldn't be much, as pointed by Shaohua.
>> >>
>> >> Also the bio_reset() in raid1_sync_request() is removed because
>> >> all bios are freshly new now and not necessary to reset any more.
>> >>
>> >> This patch can be thought as a cleanup too
>> >>
>> >> Suggested-by: Shaohua Li <shli@kernel.org>
>> >> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
>> >> ---
>> >>  drivers/md/raid1.c | 83 ++++++++++++++++++++++++++++++++++--------------------
>> >>  1 file changed, 53 insertions(+), 30 deletions(-)
>> >>
>> >> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> >> index c442b4657e2f..900144f39630 100644
>> >> --- a/drivers/md/raid1.c
>> >> +++ b/drivers/md/raid1.c
>> >> @@ -77,6 +77,16 @@ static void lower_barrier(struct r1conf *conf, sector_t sector_nr);
>> >>  #define raid1_log(md, fmt, args...)                          \
>> >>       do { if ((md)->queue) blk_add_trace_msg((md)->queue, "raid1 " fmt, ##args); } while (0)
>> >>
>> >> +static inline struct resync_pages *get_resync_pages(struct bio *bio)
>> >> +{
>> >> +     return bio->bi_private;
>> >> +}
>> >> +
>> >> +static inline struct r1bio *get_resync_r1bio(struct bio *bio)
>> >> +{
>> >> +     return get_resync_pages(bio)->raid_bio;
>> >> +}
>> >
>> > This is a weird between bio, r1bio and the resync_pages. I'd like the pages are
>>
>> It is only a bit weird inside allocating and freeing r1bio, once all
>> are allocated, you
>> can see everthing is clean and simple:
>>
>>     - r1bio includes lots of bioes,
>>     - and one bio is attached by one resync_pages via .bi_private
>
> I don't how complex to let r1bio pointer to the pages, but that's the nartual
> way. r1bio owns the pages, not the pages own r1bio, so we should let r1bio
> points to the pages. The bio.bi_private still points to r1bio.

Actually it is bio which owns the pages for doing its own I/O, and the only
thing related with r10bio is that bios may share these pages, but using
page refcount trick will make the relation quite implicit.

The only reason to allocate all resync_pages together is for sake of efficiency,
and just for avoiding to allocate one resync_pages one time for each bio.

We have to make .bi_private point to resync_pages(per bio), otherwise we
can't fetch pages into one bio at all, thinking about where to store the index
for each bio's pre-allocated pages, and it has to be per bio.

>
>> > embedded in r1bio. Maybe a pointer of r1bio to the pages. It's cleaner and more
>> > straightforward.
>>
>> In theory, the cleanest way is to allocate one resync_pages for each resync bio,
>> but that isn't efficient. That is why this patch allocates all
>> resync_pages together
>> in r1buf_pool_alloc(), and split them into bio.
>>
>> BTW, the only trick is just that the whole page array is stored in .bi_private
>> of the 1st bio, then we can avoid to add one pointer into r1bio.
>
> You will need to add pointer in resync_pages which points to r1bio

Yeah, that is just what this patchset is doing.

>
>> >
>> > I think the patch 6, 7 8 should be in a same patch. Otherwise bisect will be broken.
>> >
>>
>> No, it won't. Both patch 7 and patch 8 just replacing reading from bvec table
>> with reading from the pre-allocated page array.  Both should be fine, but the
>> latter is cleaner and simpler to do.
>
> Ok, sounds good, though I doubt it's really worthy.

Small patch with one purpose is always easy to review, do one thing,
do it better,
:-)


Thanks,
Ming Lei

^ permalink raw reply

* Re: [PATCH v2 13/13] md: raid10: avoid direct access to bvec table in handle_reshape_read_error
From: Ming Lei @ 2017-03-03  2:30 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Jens Axboe, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	linux-block, Christoph Hellwig
In-Reply-To: <20170302074753.6zsdspg5e67l24xu@kernel.org>

On Thu, Mar 2, 2017 at 3:47 PM, Shaohua Li <shli@kernel.org> wrote:
> On Thu, Mar 02, 2017 at 10:37:49AM +0800, Ming Lei wrote:
>> Hi Shaohua,
>>
>> On Wed, Mar 1, 2017 at 7:46 AM, Shaohua Li <shli@kernel.org> wrote:
>> > On Tue, Feb 28, 2017 at 11:41:43PM +0800, Ming Lei wrote:
>> >> The cost is 128bytes(8*16) stack space in kernel thread context, and
>> >> just use the bio helper to retrieve pages from bio.
>> >>
>> >> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
>> >> ---
>> >>  drivers/md/raid10.c | 12 ++++++++++--
>> >>  1 file changed, 10 insertions(+), 2 deletions(-)
>> >>
>> >> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>> >> index 0b97631e3905..6ffb64ab45f8 100644
>> >> --- a/drivers/md/raid10.c
>> >> +++ b/drivers/md/raid10.c
>> >> @@ -4670,7 +4670,15 @@ static int handle_reshape_read_error(struct mddev *mddev,
>> >>       struct r10bio *r10b = &on_stack.r10_bio;
>> >>       int slot = 0;
>> >>       int idx = 0;
>> >> -     struct bio_vec *bvec = r10_bio->master_bio->bi_io_vec;
>> >> +     struct bio_vec *bvl;
>> >> +     struct page *pages[RESYNC_PAGES];
>> >> +
>> >> +     /*
>> >> +      * This bio is allocated in reshape_request(), and size
>> >> +      * is still RESYNC_PAGES
>> >> +      */
>> >> +     bio_for_each_segment_all(bvl, r10_bio->master_bio, idx)
>> >> +             pages[idx] = bvl->bv_page;
>> >
>> > The reshape bio is doing IO against the memory we allocated for r10_bio, I'm
>> > wondering why we can't get the pages from r10_bio. In this way, we don't need
>> > access the bio_vec any more.
>>
>> Reshap read is special and the bio(r10_bio->master_bio) isn't allocated from
>> .r10buf_pool, please see reshape_request().
>
> Why does this matter? The bio is still doing IO to the memory allocated in
> r10_bio. bi_private->r10_bio->the pages.

Good question, :-)

>
> My guess why you think the reshape read is special is the bio->bi_private
> doesn't pointer to resync_pages. And since you let resync_pages pointer to
> r10_bio and not vice versa, we can't find r10_bio, correct? I think this is
> another reason why r10_bio embeds resync_pages (I mean a pointer to
> resync_pages). With this way I suggested, the reshape read isn't special at

We still can get the resync_pages for r10_bio->master_bio via
r10_bio->devs[0].bio in handle_reshape_read_error(), will do that in v3.

> all. If we allocate memory for r10_bio/r1_bio, we shouldn't need access any bio
> bvec.

Right.



Thanks,
Ming Lei

^ permalink raw reply

* Re: [PATCH v2 09/13] md: raid1: use bio_segments_all()
From: Ming Lei @ 2017-03-03  2:20 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Jens Axboe, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	linux-block, Christoph Hellwig
In-Reply-To: <20170302075202.ipnx64sodjzxisqs@kernel.org>

On Thu, Mar 2, 2017 at 3:52 PM, Shaohua Li <shli@kernel.org> wrote:
> On Thu, Mar 02, 2017 at 10:34:25AM +0800, Ming Lei wrote:
>> Hi Shaohua,
>>
>> On Wed, Mar 1, 2017 at 7:42 AM, Shaohua Li <shli@kernel.org> wrote:
>> > On Tue, Feb 28, 2017 at 11:41:39PM +0800, Ming Lei wrote:
>> >> Use this helper, instead of direct access to .bi_vcnt.
>> >
>> > what We really need to do for the behind IO is:
>> > - allocate memory and copy bio data to the memory
>> > - let behind bio do IO against the memory
>> >
>> > The behind bio doesn't need to have the exactly same bio_vec setting. If we
>> > just track the new memory, we don't need use the bio_segments_all and access
>> > bio_vec too.
>>
>> But we need to figure out how many vecs(each vec store one page) to be
>> allocated for the cloned/behind bio, and that is the only value of
>> bio_segments_all() here. Or you have idea to avoid that?
>
> As I said, the behind bio doesn't need to have the exactly same bio_vec
> setting. We just allocate memory and copy original bio data to the memory,
> then do IO against the new memory. The behind bio
> segments == (bio->bi_iter.bi_size + PAGE_SIZE - 1) >> PAGE_SHIFT

The equation isn't always correct, especially when bvec includes just
part of page, and it is quite often in case of mkfs, in which one bvec often
includes 512byte buffer.


Thanks,
Ming Lei

^ permalink raw reply

* Re: [PATCH v2 04/13] md: prepare for managing resync I/O pages in clean way
From: Ming Lei @ 2017-03-03  1:54 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Jens Axboe, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	linux-block, Christoph Hellwig
In-Reply-To: <20170302175543.ta2tmjveatjiiapc@kernel.org>

On Fri, Mar 3, 2017 at 1:55 AM, Shaohua Li <shli@kernel.org> wrote:
> On Thu, Mar 02, 2017 at 10:09:14AM +0800, Ming Lei wrote:
>> Hi Shaohua,
>>
>> On Wed, Mar 1, 2017 at 7:30 AM, Shaohua Li <shli@kernel.org> wrote:
>> > On Tue, Feb 28, 2017 at 11:41:34PM +0800, Ming Lei wrote:
>> >> Now resync I/O use bio's bec table to manage pages,
>> >> this way is very hacky, and may not work any more
>> >> once multipage bvec is introduced.
>> >>
>> >> So introduce helpers and new data structure for
>> >> managing resync I/O pages more cleanly.
>> >>
>> >> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
>> >> ---
>> >>  drivers/md/md.h | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>  1 file changed, 54 insertions(+)
>> >>
>> >> diff --git a/drivers/md/md.h b/drivers/md/md.h
>> >> index 1d63239a1be4..b5a638d85cb4 100644
>> >> --- a/drivers/md/md.h
>> >> +++ b/drivers/md/md.h
>> >> @@ -720,4 +720,58 @@ static inline void mddev_check_writesame(struct mddev *mddev, struct bio *bio)
>> >>  #define RESYNC_BLOCK_SIZE (64*1024)
>> >>  #define RESYNC_PAGES ((RESYNC_BLOCK_SIZE + PAGE_SIZE-1) / PAGE_SIZE)
>> >>
>> >> +/* for managing resync I/O pages */
>> >> +struct resync_pages {
>> >> +     unsigned        idx;    /* for get/put page from the pool */
>> >> +     void            *raid_bio;
>> >> +     struct page     *pages[RESYNC_PAGES];
>> >> +};
>> >
>> > I'd like this to be embedded into r1bio directly. Not sure if we really need a
>> > structure.
>>
>> There are two reasons we can't put this into r1bio:
>> - r1bio is used in both normal and resync I/O, not fair to allocate more
>> in normal I/O, and that is why this patch wouldn't like to touch r1bio or r10bio
>>
>> - the count of 'struct resync_pages' instance depends on raid_disks(raid1)
>> or copies(raid10), both can't be decided during compiling.
>
> Yes, I said it should be a pointer of r1bio pointing to the pages in other emails.

OK, got it, :-)

Actually I did that in my local tree, and finally I figured out not
necessary to introduce this pointor, which only usage is for freeing,
and we can get the
pointor from .bi_private of the 1st bio. And this stuff can be commented
clearly.

>
>> >
>> >> +}
>> >> +
>> >> +static inline bool resync_page_available(struct resync_pages *rp)
>> >> +{
>> >> +     return rp->idx < RESYNC_PAGES;
>> >> +}
>> >
>> > Then we don't need this obscure API.
>>
>> That is fine.
>>
>>
>> Thanks,
>> Ming Lei
>
>> >
>> >> +
>> >> +static inline int resync_alloc_pages(struct resync_pages *rp,
>> >> +                                  gfp_t gfp_flags)
>> >> +{
>> >> +     int i;
>> >> +
>> >> +     for (i = 0; i < RESYNC_PAGES; i++) {
>> >> +             rp->pages[i] = alloc_page(gfp_flags);
>> >> +             if (!rp->pages[i])
>> >> +                     goto out_free;
>> >> +     }
>> >> +
>> >> +     return 0;
>> >> +
>> >> + out_free:
>> >> +     while (--i >= 0)
>> >> +             __free_page(rp->pages[i]);
>> >> +     return -ENOMEM;
>> >> +}
>> >> +
>> >> +static inline void resync_free_pages(struct resync_pages *rp)
>> >> +{
>> >> +     int i;
>> >> +
>> >> +     for (i = 0; i < RESYNC_PAGES; i++)
>> >> +             __free_page(rp->pages[i]);
>> >
>> > Since we will use get_page, shouldn't this be put_page?
>>
>> You are right, will fix in v3.
>>
>> >
>> >> +}
>> >> +
>> >> +static inline void resync_get_all_pages(struct resync_pages *rp)
>> >> +{
>> >> +     int i;
>> >> +
>> >> +     for (i = 0; i < RESYNC_PAGES; i++)
>> >> +             get_page(rp->pages[i]);
>> >> +}
>> >> +
>> >> +static inline struct page *resync_fetch_page(struct resync_pages *rp)
>> >> +{
>> >> +     if (WARN_ON_ONCE(rp->idx >= RESYNC_PAGES))
>> >> +             return NULL;
>> >> +     return rp->pages[rp->idx++];
>> >
>> > I'd like the caller explicitly specify the index instead of a global variable
>> > to track it, which will make the code more understandable and less error prone.
>>
>> That is fine, but the index has to be per bio, and finally the index
>> has to be stored
>> in 'struct resync_pages', so every user has to call it in the following way:
>>
>>           resync_fetch_page(rp, rp->idx);
>>
>> then looks no benefit to pass index explicitly.
>
> I suppose the callers always iterate though page 0 to RESYNC_PAGES - 1. Then
> the callers can use an index by themselves. That will clearly show which page
> the callers are using. The resync_fetch_page() wrap is a blackbox we always
> need to refer to the definition.

OK, understood.


Thanks,
Ming Lei

^ permalink raw reply

* Re: [PATCH v2 05/13] md: raid1: simplify r1buf_pool_free()
From: Ming Lei @ 2017-03-03  1:57 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Jens Axboe, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	linux-block, Christoph Hellwig
In-Reply-To: <20170302174942.ad3lsrqgvi2erkrl@kernel.org>

On Fri, Mar 3, 2017 at 1:49 AM, Shaohua Li <shli@kernel.org> wrote:
> On Thu, Mar 02, 2017 at 10:11:54AM +0800, Ming Lei wrote:
>> Hi Shaohua,
>>
>> On Wed, Mar 1, 2017 at 7:31 AM, Shaohua Li <shli@kernel.org> wrote:
>> > On Tue, Feb 28, 2017 at 11:41:35PM +0800, Ming Lei wrote:
>> >> This patch gets each page's reference of each bio for resync,
>> >> then r1buf_pool_free() gets simplified a lot.
>> >>
>> >> The same policy has been taken in raid10's buf pool allocation/free
>> >> too.
>> >
>> > We are going to delete the code, this simplify isn't really required.
>>
>> Could you explain it a bit? Or I can put this patch into this patchset if you
>> can provide one?
>
> I mean you will replace the code soon, with the resync_pages stuff. So I
> thought this isn't really required.

OK,  one benefit of this patch is to make the following one easier to review,
but not a big deal, I may merge this into the following patch.


Thanks,
Ming Lei

^ permalink raw reply

* Re: [PATCH v2 13/13] md: raid10: avoid direct access to bvec table in handle_reshape_read_error
From: Shaohua Li @ 2017-03-02  7:47 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	linux-block, Christoph Hellwig
In-Reply-To: <CACVXFVN8aE_Cd=ssBuhmVRK5YYRe2q4+jXoxsfztFLZiimH_xA@mail.gmail.com>

On Thu, Mar 02, 2017 at 10:37:49AM +0800, Ming Lei wrote:
> Hi Shaohua,
> 
> On Wed, Mar 1, 2017 at 7:46 AM, Shaohua Li <shli@kernel.org> wrote:
> > On Tue, Feb 28, 2017 at 11:41:43PM +0800, Ming Lei wrote:
> >> The cost is 128bytes(8*16) stack space in kernel thread context, and
> >> just use the bio helper to retrieve pages from bio.
> >>
> >> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
> >> ---
> >>  drivers/md/raid10.c | 12 ++++++++++--
> >>  1 file changed, 10 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> >> index 0b97631e3905..6ffb64ab45f8 100644
> >> --- a/drivers/md/raid10.c
> >> +++ b/drivers/md/raid10.c
> >> @@ -4670,7 +4670,15 @@ static int handle_reshape_read_error(struct mddev *mddev,
> >>       struct r10bio *r10b = &on_stack.r10_bio;
> >>       int slot = 0;
> >>       int idx = 0;
> >> -     struct bio_vec *bvec = r10_bio->master_bio->bi_io_vec;
> >> +     struct bio_vec *bvl;
> >> +     struct page *pages[RESYNC_PAGES];
> >> +
> >> +     /*
> >> +      * This bio is allocated in reshape_request(), and size
> >> +      * is still RESYNC_PAGES
> >> +      */
> >> +     bio_for_each_segment_all(bvl, r10_bio->master_bio, idx)
> >> +             pages[idx] = bvl->bv_page;
> >
> > The reshape bio is doing IO against the memory we allocated for r10_bio, I'm
> > wondering why we can't get the pages from r10_bio. In this way, we don't need
> > access the bio_vec any more.
> 
> Reshap read is special and the bio(r10_bio->master_bio) isn't allocated from
> .r10buf_pool, please see reshape_request().

Why does this matter? The bio is still doing IO to the memory allocated in
r10_bio. bi_private->r10_bio->the pages.

My guess why you think the reshape read is special is the bio->bi_private
doesn't pointer to resync_pages. And since you let resync_pages pointer to
r10_bio and not vice versa, we can't find r10_bio, correct? I think this is
another reason why r10_bio embeds resync_pages (I mean a pointer to
resync_pages). With this way I suggested, the reshape read isn't special at
all. If we allocate memory for r10_bio/r1_bio, we shouldn't need access any bio
bvec.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v2] blkcg: allocate struct blkcg_gq outside request queue spinlock
From: Tahsin Erdogan @ 2017-03-02 22:33 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jens Axboe, linux-block, David Rientjes, linux-kernel
In-Reply-To: <20170302193205.GB8519@wtj.duckdns.org>

On Thu, Mar 2, 2017 at 11:32 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Tahsin.
>
> On Wed, Mar 01, 2017 at 03:43:19PM -0800, Tahsin Erdogan wrote:
>> @@ -258,18 +258,22 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
>>  struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
>> -                                 struct request_queue *q)
>> +                                 struct request_queue *q, bool wait_ok)
>
> I'm okay with this direction but it probably would be better if the
> parameter is gfp_mask and we branch on __GFP_WAIT in the function.

Agreed, gfp mask is better as a parameter.

>
>>  {
>>       struct blkcg_gq *blkg;
>>
>> @@ -300,7 +304,30 @@ struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
>>                       parent = blkcg_parent(parent);
>>               }
>>
>> -             blkg = blkg_create(pos, q, NULL);
>> +             if (wait_ok) {
>> +                     struct blkcg_gq *new_blkg;
>> +
>> +                     spin_unlock_irq(q->queue_lock);
>> +                     rcu_read_unlock();
>> +
>> +                     new_blkg = blkg_alloc(pos, q, GFP_KERNEL);
>> +
>> +                     rcu_read_lock();
>> +                     spin_lock_irq(q->queue_lock);
>> +
>> +                     if (unlikely(!new_blkg))
>> +                             return ERR_PTR(-ENOMEM);
>> +
>> +                     if (unlikely(blk_queue_bypass(q))) {
>> +                             blkg_free(new_blkg);
>> +                             return ERR_PTR(blk_queue_dying(q) ?
>> +                                                     -ENODEV : -EBUSY);
>> +                     }
>> +
>> +                     blkg = blkg_create(pos, q, new_blkg);
>> +             } else
>> +                     blkg = blkg_create(pos, q, NULL);
>
> So, while I'm okay with the approach, now we're creating a hybrid
> approach where we have both pre-allocation and allocation mode
> altering mechanisms.  If we're going to take this route, I think the
> right thing to do is passing down @gfp_mask all the way down to
> blkg_create().
>
>> @@ -789,6 +816,7 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
>>  {
>>       struct gendisk *disk;
>>       struct blkcg_gq *blkg;
>> +     struct request_queue *q;
>>       struct module *owner;
>>       unsigned int major, minor;
>>       int key_len, part, ret;
>> @@ -812,18 +840,27 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
>>               return -ENODEV;
>>       }
>>
>> +     q = disk->queue;
>> +
>>       rcu_read_lock();
>> -     spin_lock_irq(disk->queue->queue_lock);
>> +     spin_lock_irq(q->queue_lock);
>>
>> -     if (blkcg_policy_enabled(disk->queue, pol))
>> -             blkg = blkg_lookup_create(blkcg, disk->queue);
>> -     else
>> +     if (blkcg_policy_enabled(q, pol)) {
>> +             blkg = blkg_lookup_create(blkcg, q, true /* wait_ok */);
>> +
>> +             /*
>> +              * blkg_lookup_create() may have dropped and reacquired the
>> +              * queue lock. Check policy enabled state again.
>> +              */
>> +             if (!IS_ERR(blkg) && unlikely(!blkcg_policy_enabled(q, pol)))
>> +                     blkg = ERR_PTR(-EOPNOTSUPP);
>
> And let blkg_create() verify these conditions after releasing and
> regrabbing the lock.
>
> This also means that the init path can simply pass in GFP_KERNEL.

I tried that approach, but I encountered two issues that complicate things:

1) Pushing down blk_queue_bypass(q) check in blkg_create() doesn't
quite work because when blkcg_init_queue() calls blkg_create(), the
queue is still in bypassing mode.

2) Pushing down blkcg_policy_enabled() doesn't work well either,
because blkcg_init_queue() doesn't have a policy to pass down. We
could let it pass a NULL parameter but that would make blkg_create
more ugly.

^ permalink raw reply

* Re: [PATCH] blk-mq-sched: don't hold queue_lock when calling exit_icq
From: Paolo Valente @ 2017-03-02 20:31 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Omar Sandoval, linux-block, kernel-team
In-Reply-To: <d182fad4-d108-2d3d-06ea-72748ea166b4@fb.com>


> Il giorno 02 mar 2017, alle ore 19:25, Jens Axboe <axboe@fb.com> ha =
scritto:
>=20
> On 03/02/2017 11:07 AM, Paolo Valente wrote:
>>=20
>>> Il giorno 02 mar 2017, alle ore 17:13, Jens Axboe <axboe@fb.com> ha =
scritto:
>>>=20
>>> On 03/02/2017 09:07 AM, Paolo Valente wrote:
>>>>=20
>>>>> Il giorno 02 mar 2017, alle ore 16:13, Jens Axboe <axboe@fb.com> =
ha scritto:
>>>>>=20
>>>>> On 03/02/2017 08:00 AM, Jens Axboe wrote:
>>>>>> On 03/02/2017 03:28 AM, Paolo Valente wrote:
>>>>>>>=20
>>>>>>>> Il giorno 15 feb 2017, alle ore 19:04, Jens Axboe =
<axboe@fb.com> ha scritto:
>>>>>>>>=20
>>>>>>>> On 02/15/2017 10:58 AM, Jens Axboe wrote:
>>>>>>>>> On 02/15/2017 10:24 AM, Paolo Valente wrote:
>>>>>>>>>>=20
>>>>>>>>>>> Il giorno 10 feb 2017, alle ore 19:32, Omar Sandoval =
<osandov@osandov.com> ha scritto:
>>>>>>>>>>>=20
>>>>>>>>>>> From: Omar Sandoval <osandov@fb.com>
>>>>>>>>>>>=20
>>>>>>>>>>> None of the other blk-mq elevator hooks are called with this =
lock held.
>>>>>>>>>>> Additionally, it can lead to circular locking dependencies =
between
>>>>>>>>>>> queue_lock and the private scheduler lock.
>>>>>>>>>>>=20
>>>>>>>>>>=20
>>>>>>>>>> Hi Omar,
>>>>>>>>>> I'm sorry but it seems that a new potential deadlock has =
showed up.
>>>>>>>>>> See lockdep splat below.
>>>>>>>>>>=20
>>>>>>>>>> I've tried to think about different solutions than turning =
back to
>>>>>>>>>> deferring the body of exit_icq, but at no avail.
>>>>>>>>>=20
>>>>>>>>> Looks like a interaction between bfqd->lock and q->queue_lock. =
Since the
>>>>>>>>> core has no notion of you bfqd->lock, the naturally dependency =
here
>>>>>>>>> would be to nest bfqd->lock inside q->queue_lock. Is that =
possible for
>>>>>>>>> you?
>>>>>>>>>=20
>>>>>>>>> Looking at the code a bit, maybe it'd just be simpler to get =
rid of
>>>>>>>>> holding the queue lock for that spot. For the mq scheduler, we =
really
>>>>>>>>> don't want places where we invoke with that held already. Does =
the below
>>>>>>>>> work for you?
>>>>>>>>=20
>>>>>>>> Would need to remove one more lockdep assert. And only test =
this for
>>>>>>>> the mq parts, we'd need to spread a bit of love on the classic
>>>>>>>> scheduling icq exit path for this to work on that side.
>>>>>>>>=20
>>>>>>>=20
>>>>>>>=20
>>>>>>> Jens,
>>>>>>> here is the reply I anticipated in my previous email: after =
rebasing
>>>>>>> against master, I'm getting again the deadlock that this patch =
of
>>>>>>> yours solved (together with some changes in bfq-mq).  I thought =
you added a
>>>>>>> sort of equivalent commit (now) to the mainline branch.  Am I =
wrong?
>>>>>>=20
>>>>>> The patch I posted was never pulled to completion, it wasn't =
clear
>>>>>> to me if it fixed your issue or not. Maybe I missed a reply on
>>>>>> that?
>>>>>>=20
>>>>>> Let me take another stab at it today, I'll send you a version to =
test
>>>>>> on top of my for-linus branch.
>>>>>=20
>>>>> I took at look at my last posting, and I think it was only missing =
a
>>>>> lock grab in cfq, since we don't hold that for icq exit anymore.  =
Paolo,
>>>>> it'd be great if you could verify that this works. Not for bfq-mq =
(we
>>>>> already know it does), but for CFQ.
>>>>=20
>>>> No luck, sorry :(
>>>>=20
>>>> It looks like to have just not to take q->queue_lock in =
cfq_exit_icq.
>>>=20
>>> I was worried about that. How about the below? We need to grab the =
lock
>>> at some point for legacy scheduling, but the ordering should be =
correct.
>>>=20
>>>=20
>>=20
>> It seems to be: no more crashes or lockdep complains after several =
tests, boots and shutdowns :)
>=20
> Great! Can I add your Tested-by?

Sure!

Thanks,
Paolo

>=20
> --=20
> Jens Axboe

^ permalink raw reply

* Re: [PATCH] blk-mq-sched: don't hold queue_lock when calling exit_icq
From: Jens Axboe @ 2017-03-02 20:59 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: Paolo Valente, linux-block, kernel-team
In-Reply-To: <20170302205322.GB7792@vader>

On 03/02/2017 01:53 PM, Omar Sandoval wrote:
> On Thu, Mar 02, 2017 at 09:13:30AM -0700, Jens Axboe wrote:
>> I was worried about that. How about the below? We need to grab the lock
>> at some point for legacy scheduling, but the ordering should be correct.
> 
> Makes sense, now the locking is consistent with the other place we call
> ioc_exit_icq(). One nit below, and you can add
> 
> Reviewed-by: Omar Sandoval <osandov@fb.com>
> 
>> diff --git a/block/blk-ioc.c b/block/blk-ioc.c
>> index b12f9c87b4c3..6fd633b5d567 100644
>> --- a/block/blk-ioc.c
>> +++ b/block/blk-ioc.c
>> @@ -54,7 +54,7 @@ static void ioc_exit_icq(struct io_cq *icq)
>>  	icq->flags |= ICQ_EXITED;
>>  }
>>  
>> -/* Release an icq.  Called with both ioc and q locked. */
>> +/* Release an icq.  Called with ioc locked. */
> 
> For ioc_exit_icq(), we have the more explicit comment
> 
> /*
>  * Exit an icq. Called with both ioc and q locked for sq, only ioc locked for
>  * mq.
>  */
> 
> Could you document that here, too?

Done, I've synced the two comments now. Thanks for the review!

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH] blk-mq-sched: don't hold queue_lock when calling exit_icq
From: Omar Sandoval @ 2017-03-02 20:53 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Paolo Valente, linux-block, kernel-team
In-Reply-To: <75978d1d-8022-75c2-7799-aa65132fdcdd@fb.com>

On Thu, Mar 02, 2017 at 09:13:30AM -0700, Jens Axboe wrote:
> I was worried about that. How about the below? We need to grab the lock
> at some point for legacy scheduling, but the ordering should be correct.

Makes sense, now the locking is consistent with the other place we call
ioc_exit_icq(). One nit below, and you can add

Reviewed-by: Omar Sandoval <osandov@fb.com>

> diff --git a/block/blk-ioc.c b/block/blk-ioc.c
> index b12f9c87b4c3..6fd633b5d567 100644
> --- a/block/blk-ioc.c
> +++ b/block/blk-ioc.c
> @@ -54,7 +54,7 @@ static void ioc_exit_icq(struct io_cq *icq)
>  	icq->flags |= ICQ_EXITED;
>  }
>  
> -/* Release an icq.  Called with both ioc and q locked. */
> +/* Release an icq.  Called with ioc locked. */

For ioc_exit_icq(), we have the more explicit comment

/*
 * Exit an icq. Called with both ioc and q locked for sq, only ioc locked for
 * mq.
 */

Could you document that here, too?

>  static void ioc_destroy_icq(struct io_cq *icq)
>  {
>  	struct io_context *ioc = icq->ioc;
> @@ -62,7 +62,6 @@ static void ioc_destroy_icq(struct io_cq *icq)
>  	struct elevator_type *et = q->elevator->type;
>  
>  	lockdep_assert_held(&ioc->lock);
> -	lockdep_assert_held(q->queue_lock);
>  
>  	radix_tree_delete(&ioc->icq_tree, icq->q->id);
>  	hlist_del_init(&icq->ioc_node);
> @@ -222,24 +221,40 @@ void exit_io_context(struct task_struct *task)
>  	put_io_context_active(ioc);
>  }

^ permalink raw reply

* Re: [PATCH] blk-mq-sched: don't hold queue_lock when calling exit_icq
From: Jens Axboe @ 2017-03-02 18:25 UTC (permalink / raw)
  To: Paolo Valente; +Cc: Omar Sandoval, linux-block, kernel-team
In-Reply-To: <54B1C78E-F58F-4F59-A31B-2E9E64444773@linaro.org>

On 03/02/2017 11:07 AM, Paolo Valente wrote:
> 
>> Il giorno 02 mar 2017, alle ore 17:13, Jens Axboe <axboe@fb.com> ha scritto:
>>
>> On 03/02/2017 09:07 AM, Paolo Valente wrote:
>>>
>>>> Il giorno 02 mar 2017, alle ore 16:13, Jens Axboe <axboe@fb.com> ha scritto:
>>>>
>>>> On 03/02/2017 08:00 AM, Jens Axboe wrote:
>>>>> On 03/02/2017 03:28 AM, Paolo Valente wrote:
>>>>>>
>>>>>>> Il giorno 15 feb 2017, alle ore 19:04, Jens Axboe <axboe@fb.com> ha scritto:
>>>>>>>
>>>>>>> On 02/15/2017 10:58 AM, Jens Axboe wrote:
>>>>>>>> On 02/15/2017 10:24 AM, Paolo Valente wrote:
>>>>>>>>>
>>>>>>>>>> Il giorno 10 feb 2017, alle ore 19:32, Omar Sandoval <osandov@osandov.com> ha scritto:
>>>>>>>>>>
>>>>>>>>>> From: Omar Sandoval <osandov@fb.com>
>>>>>>>>>>
>>>>>>>>>> None of the other blk-mq elevator hooks are called with this lock held.
>>>>>>>>>> Additionally, it can lead to circular locking dependencies between
>>>>>>>>>> queue_lock and the private scheduler lock.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Omar,
>>>>>>>>> I'm sorry but it seems that a new potential deadlock has showed up.
>>>>>>>>> See lockdep splat below.
>>>>>>>>>
>>>>>>>>> I've tried to think about different solutions than turning back to
>>>>>>>>> deferring the body of exit_icq, but at no avail.
>>>>>>>>
>>>>>>>> Looks like a interaction between bfqd->lock and q->queue_lock. Since the
>>>>>>>> core has no notion of you bfqd->lock, the naturally dependency here
>>>>>>>> would be to nest bfqd->lock inside q->queue_lock. Is that possible for
>>>>>>>> you?
>>>>>>>>
>>>>>>>> Looking at the code a bit, maybe it'd just be simpler to get rid of
>>>>>>>> holding the queue lock for that spot. For the mq scheduler, we really
>>>>>>>> don't want places where we invoke with that held already. Does the below
>>>>>>>> work for you?
>>>>>>>
>>>>>>> Would need to remove one more lockdep assert. And only test this for
>>>>>>> the mq parts, we'd need to spread a bit of love on the classic
>>>>>>> scheduling icq exit path for this to work on that side.
>>>>>>>
>>>>>>
>>>>>>
>>>>>> Jens,
>>>>>> here is the reply I anticipated in my previous email: after rebasing
>>>>>> against master, I'm getting again the deadlock that this patch of
>>>>>> yours solved (together with some changes in bfq-mq).  I thought you added a
>>>>>> sort of equivalent commit (now) to the mainline branch.  Am I wrong?
>>>>>
>>>>> The patch I posted was never pulled to completion, it wasn't clear
>>>>> to me if it fixed your issue or not. Maybe I missed a reply on
>>>>> that?
>>>>>
>>>>> Let me take another stab at it today, I'll send you a version to test
>>>>> on top of my for-linus branch.
>>>>
>>>> I took at look at my last posting, and I think it was only missing a
>>>> lock grab in cfq, since we don't hold that for icq exit anymore.  Paolo,
>>>> it'd be great if you could verify that this works. Not for bfq-mq (we
>>>> already know it does), but for CFQ.
>>>
>>> No luck, sorry :(
>>>
>>> It looks like to have just not to take q->queue_lock in cfq_exit_icq.
>>
>> I was worried about that. How about the below? We need to grab the lock
>> at some point for legacy scheduling, but the ordering should be correct.
>>
>>
> 
> It seems to be: no more crashes or lockdep complains after several tests, boots and shutdowns :)

Great! Can I add your Tested-by?

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH v2] blkcg: allocate struct blkcg_gq outside request queue spinlock
From: Tejun Heo @ 2017-03-02 19:32 UTC (permalink / raw)
  To: Tahsin Erdogan; +Cc: Jens Axboe, linux-block, David Rientjes, linux-kernel
In-Reply-To: <20170301234319.29584-1-tahsin@google.com>

Hello, Tahsin.

On Wed, Mar 01, 2017 at 03:43:19PM -0800, Tahsin Erdogan wrote:
> @@ -258,18 +258,22 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
>  struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
> -				    struct request_queue *q)
> +				    struct request_queue *q, bool wait_ok)

I'm okay with this direction but it probably would be better if the
parameter is gfp_mask and we branch on __GFP_WAIT in the function.

>  {
>  	struct blkcg_gq *blkg;
>  
> @@ -300,7 +304,30 @@ struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
>  			parent = blkcg_parent(parent);
>  		}
>  
> -		blkg = blkg_create(pos, q, NULL);
> +		if (wait_ok) {
> +			struct blkcg_gq *new_blkg;
> +
> +			spin_unlock_irq(q->queue_lock);
> +			rcu_read_unlock();
> +
> +			new_blkg = blkg_alloc(pos, q, GFP_KERNEL);
> +
> +			rcu_read_lock();
> +			spin_lock_irq(q->queue_lock);
> +
> +			if (unlikely(!new_blkg))
> +				return ERR_PTR(-ENOMEM);
> +
> +			if (unlikely(blk_queue_bypass(q))) {
> +				blkg_free(new_blkg);
> +				return ERR_PTR(blk_queue_dying(q) ?
> +							-ENODEV : -EBUSY);
> +			}
> +
> +			blkg = blkg_create(pos, q, new_blkg);
> +		} else
> +			blkg = blkg_create(pos, q, NULL);

So, while I'm okay with the approach, now we're creating a hybrid
approach where we have both pre-allocation and allocation mode
altering mechanisms.  If we're going to take this route, I think the
right thing to do is passing down @gfp_mask all the way down to
blkg_create().

> @@ -789,6 +816,7 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
>  {
>  	struct gendisk *disk;
>  	struct blkcg_gq *blkg;
> +	struct request_queue *q;
>  	struct module *owner;
>  	unsigned int major, minor;
>  	int key_len, part, ret;
> @@ -812,18 +840,27 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
>  		return -ENODEV;
>  	}
>  
> +	q = disk->queue;
> +
>  	rcu_read_lock();
> -	spin_lock_irq(disk->queue->queue_lock);
> +	spin_lock_irq(q->queue_lock);
>  
> -	if (blkcg_policy_enabled(disk->queue, pol))
> -		blkg = blkg_lookup_create(blkcg, disk->queue);
> -	else
> +	if (blkcg_policy_enabled(q, pol)) {
> +		blkg = blkg_lookup_create(blkcg, q, true /* wait_ok */);
> +
> +		/*
> +		 * blkg_lookup_create() may have dropped and reacquired the
> +		 * queue lock. Check policy enabled state again.
> +		 */
> +		if (!IS_ERR(blkg) && unlikely(!blkcg_policy_enabled(q, pol)))
> +			blkg = ERR_PTR(-EOPNOTSUPP);

And let blkg_create() verify these conditions after releasing and
regrabbing the lock.

This also means that the init path can simply pass in GFP_KERNEL.

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH v2 06/13] md: raid1: don't use bio's vec table to manage resync pages
From: Shaohua Li @ 2017-03-02 17:48 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	linux-block, Christoph Hellwig
In-Reply-To: <CACVXFVPqFyvsdwjdcHbKm8J8Ni=d5isSxguRVJZoF3m2kaO98Q@mail.gmail.com>

On Thu, Mar 02, 2017 at 10:25:10AM +0800, Ming Lei wrote:
> Hi Shaohua,
> 
> On Wed, Mar 1, 2017 at 7:37 AM, Shaohua Li <shli@kernel.org> wrote:
> > On Tue, Feb 28, 2017 at 11:41:36PM +0800, Ming Lei wrote:
> >> Now we allocate one page array for managing resync pages, instead
> >> of using bio's vec table to do that, and the old way is very hacky
> >> and won't work any more if multipage bvec is enabled.
> >>
> >> The introduced cost is that we need to allocate (128 + 16) * raid_disks
> >> bytes per r1_bio, and it is fine because the inflight r1_bio for
> >> resync shouldn't be much, as pointed by Shaohua.
> >>
> >> Also the bio_reset() in raid1_sync_request() is removed because
> >> all bios are freshly new now and not necessary to reset any more.
> >>
> >> This patch can be thought as a cleanup too
> >>
> >> Suggested-by: Shaohua Li <shli@kernel.org>
> >> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
> >> ---
> >>  drivers/md/raid1.c | 83 ++++++++++++++++++++++++++++++++++--------------------
> >>  1 file changed, 53 insertions(+), 30 deletions(-)
> >>
> >> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> >> index c442b4657e2f..900144f39630 100644
> >> --- a/drivers/md/raid1.c
> >> +++ b/drivers/md/raid1.c
> >> @@ -77,6 +77,16 @@ static void lower_barrier(struct r1conf *conf, sector_t sector_nr);
> >>  #define raid1_log(md, fmt, args...)                          \
> >>       do { if ((md)->queue) blk_add_trace_msg((md)->queue, "raid1 " fmt, ##args); } while (0)
> >>
> >> +static inline struct resync_pages *get_resync_pages(struct bio *bio)
> >> +{
> >> +     return bio->bi_private;
> >> +}
> >> +
> >> +static inline struct r1bio *get_resync_r1bio(struct bio *bio)
> >> +{
> >> +     return get_resync_pages(bio)->raid_bio;
> >> +}
> >
> > This is a weird between bio, r1bio and the resync_pages. I'd like the pages are
> 
> It is only a bit weird inside allocating and freeing r1bio, once all
> are allocated, you
> can see everthing is clean and simple:
> 
>     - r1bio includes lots of bioes,
>     - and one bio is attached by one resync_pages via .bi_private

I don't how complex to let r1bio pointer to the pages, but that's the nartual
way. r1bio owns the pages, not the pages own r1bio, so we should let r1bio
points to the pages. The bio.bi_private still points to r1bio.

> > embedded in r1bio. Maybe a pointer of r1bio to the pages. It's cleaner and more
> > straightforward.
> 
> In theory, the cleanest way is to allocate one resync_pages for each resync bio,
> but that isn't efficient. That is why this patch allocates all
> resync_pages together
> in r1buf_pool_alloc(), and split them into bio.
> 
> BTW, the only trick is just that the whole page array is stored in .bi_private
> of the 1st bio, then we can avoid to add one pointer into r1bio.

You will need to add pointer in resync_pages which points to r1bio
 
> >
> > I think the patch 6, 7 8 should be in a same patch. Otherwise bisect will be broken.
> >
> 
> No, it won't. Both patch 7 and patch 8 just replacing reading from bvec table
> with reading from the pre-allocated page array.  Both should be fine, but the
> latter is cleaner and simpler to do.

Ok, sounds good, though I doubt it's really worthy.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH] blk-mq-sched: don't hold queue_lock when calling exit_icq
From: Jens Axboe @ 2017-03-02 16:13 UTC (permalink / raw)
  To: Paolo Valente; +Cc: Omar Sandoval, linux-block, kernel-team
In-Reply-To: <6C03BF2A-0902-431A-A545-933F56D3E134@linaro.org>

On 03/02/2017 09:07 AM, Paolo Valente wrote:
> 
>> Il giorno 02 mar 2017, alle ore 16:13, Jens Axboe <axboe@fb.com> ha scritto:
>>
>> On 03/02/2017 08:00 AM, Jens Axboe wrote:
>>> On 03/02/2017 03:28 AM, Paolo Valente wrote:
>>>>
>>>>> Il giorno 15 feb 2017, alle ore 19:04, Jens Axboe <axboe@fb.com> ha scritto:
>>>>>
>>>>> On 02/15/2017 10:58 AM, Jens Axboe wrote:
>>>>>> On 02/15/2017 10:24 AM, Paolo Valente wrote:
>>>>>>>
>>>>>>>> Il giorno 10 feb 2017, alle ore 19:32, Omar Sandoval <osandov@osandov.com> ha scritto:
>>>>>>>>
>>>>>>>> From: Omar Sandoval <osandov@fb.com>
>>>>>>>>
>>>>>>>> None of the other blk-mq elevator hooks are called with this lock held.
>>>>>>>> Additionally, it can lead to circular locking dependencies between
>>>>>>>> queue_lock and the private scheduler lock.
>>>>>>>>
>>>>>>>
>>>>>>> Hi Omar,
>>>>>>> I'm sorry but it seems that a new potential deadlock has showed up.
>>>>>>> See lockdep splat below.
>>>>>>>
>>>>>>> I've tried to think about different solutions than turning back to
>>>>>>> deferring the body of exit_icq, but at no avail.
>>>>>>
>>>>>> Looks like a interaction between bfqd->lock and q->queue_lock. Since the
>>>>>> core has no notion of you bfqd->lock, the naturally dependency here
>>>>>> would be to nest bfqd->lock inside q->queue_lock. Is that possible for
>>>>>> you?
>>>>>>
>>>>>> Looking at the code a bit, maybe it'd just be simpler to get rid of
>>>>>> holding the queue lock for that spot. For the mq scheduler, we really
>>>>>> don't want places where we invoke with that held already. Does the below
>>>>>> work for you?
>>>>>
>>>>> Would need to remove one more lockdep assert. And only test this for
>>>>> the mq parts, we'd need to spread a bit of love on the classic
>>>>> scheduling icq exit path for this to work on that side.
>>>>>
>>>>
>>>>
>>>> Jens,
>>>> here is the reply I anticipated in my previous email: after rebasing
>>>> against master, I'm getting again the deadlock that this patch of
>>>> yours solved (together with some changes in bfq-mq).  I thought you added a
>>>> sort of equivalent commit (now) to the mainline branch.  Am I wrong?
>>>
>>> The patch I posted was never pulled to completion, it wasn't clear
>>> to me if it fixed your issue or not. Maybe I missed a reply on
>>> that?
>>>
>>> Let me take another stab at it today, I'll send you a version to test
>>> on top of my for-linus branch.
>>
>> I took at look at my last posting, and I think it was only missing a
>> lock grab in cfq, since we don't hold that for icq exit anymore.  Paolo,
>> it'd be great if you could verify that this works. Not for bfq-mq (we
>> already know it does), but for CFQ.
> 
> No luck, sorry :(
> 
> It looks like to have just not to take q->queue_lock in cfq_exit_icq.

I was worried about that. How about the below? We need to grab the lock
at some point for legacy scheduling, but the ordering should be correct.


diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index b12f9c87b4c3..6fd633b5d567 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -54,7 +54,7 @@ static void ioc_exit_icq(struct io_cq *icq)
 	icq->flags |= ICQ_EXITED;
 }
 
-/* Release an icq.  Called with both ioc and q locked. */
+/* Release an icq.  Called with ioc locked. */
 static void ioc_destroy_icq(struct io_cq *icq)
 {
 	struct io_context *ioc = icq->ioc;
@@ -62,7 +62,6 @@ static void ioc_destroy_icq(struct io_cq *icq)
 	struct elevator_type *et = q->elevator->type;
 
 	lockdep_assert_held(&ioc->lock);
-	lockdep_assert_held(q->queue_lock);
 
 	radix_tree_delete(&ioc->icq_tree, icq->q->id);
 	hlist_del_init(&icq->ioc_node);
@@ -222,24 +221,40 @@ void exit_io_context(struct task_struct *task)
 	put_io_context_active(ioc);
 }
 
+static void __ioc_clear_queue(struct list_head *icq_list)
+{
+	unsigned long flags;
+
+	while (!list_empty(icq_list)) {
+		struct io_cq *icq = list_entry(icq_list->next,
+					       struct io_cq, q_node);
+		struct io_context *ioc = icq->ioc;
+
+		spin_lock_irqsave(&ioc->lock, flags);
+		ioc_destroy_icq(icq);
+		spin_unlock_irqrestore(&ioc->lock, flags);
+	}
+}
+
 /**
  * ioc_clear_queue - break any ioc association with the specified queue
  * @q: request_queue being cleared
  *
- * Walk @q->icq_list and exit all io_cq's.  Must be called with @q locked.
+ * Walk @q->icq_list and exit all io_cq's.
  */
 void ioc_clear_queue(struct request_queue *q)
 {
-	lockdep_assert_held(q->queue_lock);
+	LIST_HEAD(icq_list);
 
-	while (!list_empty(&q->icq_list)) {
-		struct io_cq *icq = list_entry(q->icq_list.next,
-					       struct io_cq, q_node);
-		struct io_context *ioc = icq->ioc;
+	spin_lock_irq(q->queue_lock);
+	list_splice_init(&q->icq_list, &icq_list);
 
-		spin_lock(&ioc->lock);
-		ioc_destroy_icq(icq);
-		spin_unlock(&ioc->lock);
+	if (q->mq_ops) {
+		spin_unlock_irq(q->queue_lock);
+		__ioc_clear_queue(&icq_list);
+	} else {
+		__ioc_clear_queue(&icq_list);
+		spin_unlock_irq(q->queue_lock);
 	}
 }
 
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 002af836aa87..c44b321335f3 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -815,9 +815,7 @@ static void blk_release_queue(struct kobject *kobj)
 	blkcg_exit_queue(q);
 
 	if (q->elevator) {
-		spin_lock_irq(q->queue_lock);
 		ioc_clear_queue(q);
-		spin_unlock_irq(q->queue_lock);
 		elevator_exit(q->elevator);
 	}
 
diff --git a/block/elevator.c b/block/elevator.c
index ac1c9f481a98..01139f549b5b 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -983,9 +983,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 		if (old_registered)
 			elv_unregister_queue(q);
 
-		spin_lock_irq(q->queue_lock);
 		ioc_clear_queue(q);
-		spin_unlock_irq(q->queue_lock);
 	}
 
 	/* allocate, init and register new elevator */

-- 
Jens Axboe

^ permalink raw reply related

* Re: [PATCH] blk-mq-sched: don't hold queue_lock when calling exit_icq
From: Paolo Valente @ 2017-03-02 18:07 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Omar Sandoval, linux-block, kernel-team
In-Reply-To: <75978d1d-8022-75c2-7799-aa65132fdcdd@fb.com>


> Il giorno 02 mar 2017, alle ore 17:13, Jens Axboe <axboe@fb.com> ha =
scritto:
>=20
> On 03/02/2017 09:07 AM, Paolo Valente wrote:
>>=20
>>> Il giorno 02 mar 2017, alle ore 16:13, Jens Axboe <axboe@fb.com> ha =
scritto:
>>>=20
>>> On 03/02/2017 08:00 AM, Jens Axboe wrote:
>>>> On 03/02/2017 03:28 AM, Paolo Valente wrote:
>>>>>=20
>>>>>> Il giorno 15 feb 2017, alle ore 19:04, Jens Axboe <axboe@fb.com> =
ha scritto:
>>>>>>=20
>>>>>> On 02/15/2017 10:58 AM, Jens Axboe wrote:
>>>>>>> On 02/15/2017 10:24 AM, Paolo Valente wrote:
>>>>>>>>=20
>>>>>>>>> Il giorno 10 feb 2017, alle ore 19:32, Omar Sandoval =
<osandov@osandov.com> ha scritto:
>>>>>>>>>=20
>>>>>>>>> From: Omar Sandoval <osandov@fb.com>
>>>>>>>>>=20
>>>>>>>>> None of the other blk-mq elevator hooks are called with this =
lock held.
>>>>>>>>> Additionally, it can lead to circular locking dependencies =
between
>>>>>>>>> queue_lock and the private scheduler lock.
>>>>>>>>>=20
>>>>>>>>=20
>>>>>>>> Hi Omar,
>>>>>>>> I'm sorry but it seems that a new potential deadlock has showed =
up.
>>>>>>>> See lockdep splat below.
>>>>>>>>=20
>>>>>>>> I've tried to think about different solutions than turning back =
to
>>>>>>>> deferring the body of exit_icq, but at no avail.
>>>>>>>=20
>>>>>>> Looks like a interaction between bfqd->lock and q->queue_lock. =
Since the
>>>>>>> core has no notion of you bfqd->lock, the naturally dependency =
here
>>>>>>> would be to nest bfqd->lock inside q->queue_lock. Is that =
possible for
>>>>>>> you?
>>>>>>>=20
>>>>>>> Looking at the code a bit, maybe it'd just be simpler to get rid =
of
>>>>>>> holding the queue lock for that spot. For the mq scheduler, we =
really
>>>>>>> don't want places where we invoke with that held already. Does =
the below
>>>>>>> work for you?
>>>>>>=20
>>>>>> Would need to remove one more lockdep assert. And only test this =
for
>>>>>> the mq parts, we'd need to spread a bit of love on the classic
>>>>>> scheduling icq exit path for this to work on that side.
>>>>>>=20
>>>>>=20
>>>>>=20
>>>>> Jens,
>>>>> here is the reply I anticipated in my previous email: after =
rebasing
>>>>> against master, I'm getting again the deadlock that this patch of
>>>>> yours solved (together with some changes in bfq-mq).  I thought =
you added a
>>>>> sort of equivalent commit (now) to the mainline branch.  Am I =
wrong?
>>>>=20
>>>> The patch I posted was never pulled to completion, it wasn't clear
>>>> to me if it fixed your issue or not. Maybe I missed a reply on
>>>> that?
>>>>=20
>>>> Let me take another stab at it today, I'll send you a version to =
test
>>>> on top of my for-linus branch.
>>>=20
>>> I took at look at my last posting, and I think it was only missing a
>>> lock grab in cfq, since we don't hold that for icq exit anymore.  =
Paolo,
>>> it'd be great if you could verify that this works. Not for bfq-mq =
(we
>>> already know it does), but for CFQ.
>>=20
>> No luck, sorry :(
>>=20
>> It looks like to have just not to take q->queue_lock in cfq_exit_icq.
>=20
> I was worried about that. How about the below? We need to grab the =
lock
> at some point for legacy scheduling, but the ordering should be =
correct.
>=20
>=20

It seems to be: no more crashes or lockdep complains after several =
tests, boots and shutdowns :)

Thanks,
Paolo


> diff --git a/block/blk-ioc.c b/block/blk-ioc.c
> index b12f9c87b4c3..6fd633b5d567 100644
> --- a/block/blk-ioc.c
> +++ b/block/blk-ioc.c
> @@ -54,7 +54,7 @@ static void ioc_exit_icq(struct io_cq *icq)
> 	icq->flags |=3D ICQ_EXITED;
> }
>=20
> -/* Release an icq.  Called with both ioc and q locked. */
> +/* Release an icq.  Called with ioc locked. */
> static void ioc_destroy_icq(struct io_cq *icq)
> {
> 	struct io_context *ioc =3D icq->ioc;
> @@ -62,7 +62,6 @@ static void ioc_destroy_icq(struct io_cq *icq)
> 	struct elevator_type *et =3D q->elevator->type;
>=20
> 	lockdep_assert_held(&ioc->lock);
> -	lockdep_assert_held(q->queue_lock);
>=20
> 	radix_tree_delete(&ioc->icq_tree, icq->q->id);
> 	hlist_del_init(&icq->ioc_node);
> @@ -222,24 +221,40 @@ void exit_io_context(struct task_struct *task)
> 	put_io_context_active(ioc);
> }
>=20
> +static void __ioc_clear_queue(struct list_head *icq_list)
> +{
> +	unsigned long flags;
> +
> +	while (!list_empty(icq_list)) {
> +		struct io_cq *icq =3D list_entry(icq_list->next,
> +					       struct io_cq, q_node);
> +		struct io_context *ioc =3D icq->ioc;
> +
> +		spin_lock_irqsave(&ioc->lock, flags);
> +		ioc_destroy_icq(icq);
> +		spin_unlock_irqrestore(&ioc->lock, flags);
> +	}
> +}
> +
> /**
>  * ioc_clear_queue - break any ioc association with the specified =
queue
>  * @q: request_queue being cleared
>  *
> - * Walk @q->icq_list and exit all io_cq's.  Must be called with @q =
locked.
> + * Walk @q->icq_list and exit all io_cq's.
>  */
> void ioc_clear_queue(struct request_queue *q)
> {
> -	lockdep_assert_held(q->queue_lock);
> +	LIST_HEAD(icq_list);
>=20
> -	while (!list_empty(&q->icq_list)) {
> -		struct io_cq *icq =3D list_entry(q->icq_list.next,
> -					       struct io_cq, q_node);
> -		struct io_context *ioc =3D icq->ioc;
> +	spin_lock_irq(q->queue_lock);
> +	list_splice_init(&q->icq_list, &icq_list);
>=20
> -		spin_lock(&ioc->lock);
> -		ioc_destroy_icq(icq);
> -		spin_unlock(&ioc->lock);
> +	if (q->mq_ops) {
> +		spin_unlock_irq(q->queue_lock);
> +		__ioc_clear_queue(&icq_list);
> +	} else {
> +		__ioc_clear_queue(&icq_list);
> +		spin_unlock_irq(q->queue_lock);
> 	}
> }
>=20
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 002af836aa87..c44b321335f3 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -815,9 +815,7 @@ static void blk_release_queue(struct kobject =
*kobj)
> 	blkcg_exit_queue(q);
>=20
> 	if (q->elevator) {
> -		spin_lock_irq(q->queue_lock);
> 		ioc_clear_queue(q);
> -		spin_unlock_irq(q->queue_lock);
> 		elevator_exit(q->elevator);
> 	}
>=20
> diff --git a/block/elevator.c b/block/elevator.c
> index ac1c9f481a98..01139f549b5b 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -983,9 +983,7 @@ static int elevator_switch(struct request_queue =
*q, struct elevator_type *new_e)
> 		if (old_registered)
> 			elv_unregister_queue(q);
>=20
> -		spin_lock_irq(q->queue_lock);
> 		ioc_clear_queue(q);
> -		spin_unlock_irq(q->queue_lock);
> 	}
>=20
> 	/* allocate, init and register new elevator */
>=20
> --=20
> Jens Axboe

^ permalink raw reply

* Re: [PATCH] blk-mq-sched: don't hold queue_lock when calling exit_icq
From: Paolo Valente @ 2017-03-02 16:07 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Omar Sandoval, linux-block, kernel-team
In-Reply-To: <80de4554-a14b-4ce6-7c6e-ec66f70b14b1@fb.com>


> Il giorno 02 mar 2017, alle ore 16:13, Jens Axboe <axboe@fb.com> ha =
scritto:
>=20
> On 03/02/2017 08:00 AM, Jens Axboe wrote:
>> On 03/02/2017 03:28 AM, Paolo Valente wrote:
>>>=20
>>>> Il giorno 15 feb 2017, alle ore 19:04, Jens Axboe <axboe@fb.com> ha =
scritto:
>>>>=20
>>>> On 02/15/2017 10:58 AM, Jens Axboe wrote:
>>>>> On 02/15/2017 10:24 AM, Paolo Valente wrote:
>>>>>>=20
>>>>>>> Il giorno 10 feb 2017, alle ore 19:32, Omar Sandoval =
<osandov@osandov.com> ha scritto:
>>>>>>>=20
>>>>>>> From: Omar Sandoval <osandov@fb.com>
>>>>>>>=20
>>>>>>> None of the other blk-mq elevator hooks are called with this =
lock held.
>>>>>>> Additionally, it can lead to circular locking dependencies =
between
>>>>>>> queue_lock and the private scheduler lock.
>>>>>>>=20
>>>>>>=20
>>>>>> Hi Omar,
>>>>>> I'm sorry but it seems that a new potential deadlock has showed =
up.
>>>>>> See lockdep splat below.
>>>>>>=20
>>>>>> I've tried to think about different solutions than turning back =
to
>>>>>> deferring the body of exit_icq, but at no avail.
>>>>>=20
>>>>> Looks like a interaction between bfqd->lock and q->queue_lock. =
Since the
>>>>> core has no notion of you bfqd->lock, the naturally dependency =
here
>>>>> would be to nest bfqd->lock inside q->queue_lock. Is that possible =
for
>>>>> you?
>>>>>=20
>>>>> Looking at the code a bit, maybe it'd just be simpler to get rid =
of
>>>>> holding the queue lock for that spot. For the mq scheduler, we =
really
>>>>> don't want places where we invoke with that held already. Does the =
below
>>>>> work for you?
>>>>=20
>>>> Would need to remove one more lockdep assert. And only test this =
for
>>>> the mq parts, we'd need to spread a bit of love on the classic
>>>> scheduling icq exit path for this to work on that side.
>>>>=20
>>>=20
>>>=20
>>> Jens,
>>> here is the reply I anticipated in my previous email: after rebasing
>>> against master, I'm getting again the deadlock that this patch of
>>> yours solved (together with some changes in bfq-mq).  I thought you =
added a
>>> sort of equivalent commit (now) to the mainline branch.  Am I wrong?
>>=20
>> The patch I posted was never pulled to completion, it wasn't clear
>> to me if it fixed your issue or not. Maybe I missed a reply on
>> that?
>>=20
>> Let me take another stab at it today, I'll send you a version to test
>> on top of my for-linus branch.
>=20
> I took at look at my last posting, and I think it was only missing a
> lock grab in cfq, since we don't hold that for icq exit anymore.  =
Paolo,
> it'd be great if you could verify that this works. Not for bfq-mq (we
> already know it does), but for CFQ.

No luck, sorry :(

It looks like to have just not to take q->queue_lock in cfq_exit_icq.

[    9.329501] =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=

[    9.336132] [ INFO: possible recursive locking detected ]
[    9.343801] 4.10.0-bfq-mq+ #56 Not tainted
[    9.353359] ---------------------------------------------
[    9.354101] ata_id/160 is trying to acquire lock:
[    9.354750]  (&(&q->__queue_lock)->rlock){..-...}, at: =
[<ffffffffaf48501a>] cfq_exit_icq+0x2a/0x80
[    9.355986]=20
[    9.355986] but task is already holding lock:
[    9.364221]  (&(&q->__queue_lock)->rlock){..-...}, at: =
[<ffffffffaf45a686>] put_io_context_active+0x86/0xe0
[    9.382980]=20
[    9.382980] other info that might help us debug this:
[    9.386460]  Possible unsafe locking scenario:
[    9.386460]=20
[    9.397070]        CPU0
[    9.397468]        ----
[    9.397811]   lock(&(&q->__queue_lock)->rlock);
[    9.398440]   lock(&(&q->__queue_lock)->rlock);
[    9.406265]=20
[    9.406265]  *** DEADLOCK ***
[    9.406265]=20
[    9.409589]  May be due to missing lock nesting notation
[    9.409589]=20
[    9.413040] 2 locks held by ata_id/160:
[    9.413576]  #0:  (&(&ioc->lock)->rlock/1){......}, at: =
[<ffffffffaf45a638>] put_io_context_active+0x38/0xe0
[    9.418069]  #1:  (&(&q->__queue_lock)->rlock){..-...}, at: =
[<ffffffffaf45a686>] put_io_context_active+0x86/0xe0
[    9.441118]=20
[    9.441118] stack backtrace:
[    9.445113] CPU: 0 PID: 160 Comm: ata_id Not tainted 4.10.0-bfq-mq+ =
#56
[    9.446210] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS =
VirtualBox 12/01/2006
[    9.457223] Call Trace:
[    9.457636]  dump_stack+0x85/0xc2
[    9.458183]  __lock_acquire+0x1834/0x1890
[    9.458839]  ? __lock_acquire+0x4af/0x1890
[    9.464206]  lock_acquire+0x11b/0x220
[    9.471703]  ? cfq_exit_icq+0x2a/0x80
[    9.472225]  _raw_spin_lock+0x3d/0x80
[    9.472784]  ? cfq_exit_icq+0x2a/0x80
[    9.476336]  cfq_exit_icq+0x2a/0x80
[    9.486750]  ioc_exit_icq+0x38/0x50
[    9.487245]  put_io_context_active+0x92/0xe0
[    9.488049]  exit_io_context+0x48/0x50
[    9.488423]  do_exit+0x7a8/0xc80
[    9.488797]  do_group_exit+0x50/0xd0
[    9.496655]  SyS_exit_group+0x14/0x20
[    9.497170]  entry_SYSCALL_64_fastpath+0x23/0xc6
[    9.497808] RIP: 0033:0x7f224d1c1b98
[    9.498302] RSP: 002b:00007ffd342666f8 EFLAGS: 00000246 ORIG_RAX: =
00000000000000e7
[    9.499360] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: =
00007f224d1c1b98
[    9.511706] RDX: 0000000000000000 RSI: 000000000000003c RDI: =
0000000000000000
[    9.516390] RBP: 00007f224db02690 R08: 00000000000000e7 R09: =
ffffffffffffff90
[    9.523044] R10: 00007f224d6d7250 R11: 0000000000000246 R12: =
0000000000000016
[    9.524015] R13: 0000000000000001 R14: 000055938a1c3010 R15: =
00007ffd34266170

Thanks,
Paolo

>  Run with lockdep enabled and see if
> it spits out anything. We'll hit this exit path very quickly, so the
> test doesn't need to be anything exhaustive.
>=20
>=20
> diff --git a/block/blk-ioc.c b/block/blk-ioc.c
> index b12f9c87b4c3..546ff8f81ede 100644
> --- a/block/blk-ioc.c
> +++ b/block/blk-ioc.c
> @@ -54,7 +54,7 @@ static void ioc_exit_icq(struct io_cq *icq)
> 	icq->flags |=3D ICQ_EXITED;
> }
>=20
> -/* Release an icq.  Called with both ioc and q locked. */
> +/* Release an icq.  Called with ioc locked. */
> static void ioc_destroy_icq(struct io_cq *icq)
> {
> 	struct io_context *ioc =3D icq->ioc;
> @@ -62,7 +62,6 @@ static void ioc_destroy_icq(struct io_cq *icq)
> 	struct elevator_type *et =3D q->elevator->type;
>=20
> 	lockdep_assert_held(&ioc->lock);
> -	lockdep_assert_held(q->queue_lock);
>=20
> 	radix_tree_delete(&ioc->icq_tree, icq->q->id);
> 	hlist_del_init(&icq->ioc_node);
> @@ -222,25 +221,34 @@ void exit_io_context(struct task_struct *task)
> 	put_io_context_active(ioc);
> }
>=20
> +static void __ioc_clear_queue(struct list_head *icq_list)
> +{
> +	while (!list_empty(icq_list)) {
> +		struct io_cq *icq =3D list_entry(icq_list->next,
> +					       struct io_cq, q_node);
> +		struct io_context *ioc =3D icq->ioc;
> +
> +		spin_lock_irq(&ioc->lock);
> +		ioc_destroy_icq(icq);
> +		spin_unlock_irq(&ioc->lock);
> +	}
> +}
> +
> /**
>  * ioc_clear_queue - break any ioc association with the specified =
queue
>  * @q: request_queue being cleared
>  *
> - * Walk @q->icq_list and exit all io_cq's.  Must be called with @q =
locked.
> + * Walk @q->icq_list and exit all io_cq's.
>  */
> void ioc_clear_queue(struct request_queue *q)
> {
> -	lockdep_assert_held(q->queue_lock);
> +	LIST_HEAD(icq_list);
>=20
> -	while (!list_empty(&q->icq_list)) {
> -		struct io_cq *icq =3D list_entry(q->icq_list.next,
> -					       struct io_cq, q_node);
> -		struct io_context *ioc =3D icq->ioc;
> +	spin_lock_irq(q->queue_lock);
> +	list_splice_init(&q->icq_list, &icq_list);
> +	spin_unlock_irq(q->queue_lock);
>=20
> -		spin_lock(&ioc->lock);
> -		ioc_destroy_icq(icq);
> -		spin_unlock(&ioc->lock);
> -	}
> +	__ioc_clear_queue(&icq_list);
> }
>=20
> int create_task_io_context(struct task_struct *task, gfp_t gfp_flags, =
int node)
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 002af836aa87..c44b321335f3 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -815,9 +815,7 @@ static void blk_release_queue(struct kobject =
*kobj)
> 	blkcg_exit_queue(q);
>=20
> 	if (q->elevator) {
> -		spin_lock_irq(q->queue_lock);
> 		ioc_clear_queue(q);
> -		spin_unlock_irq(q->queue_lock);
> 		elevator_exit(q->elevator);
> 	}
>=20
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 137944777859..4a71c79de037 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -3658,6 +3658,8 @@ static void cfq_exit_icq(struct io_cq *icq)
> 	struct cfq_io_cq *cic =3D icq_to_cic(icq);
> 	struct cfq_data *cfqd =3D cic_to_cfqd(cic);
>=20
> +	spin_lock(cfqd->queue->queue_lock);
> +
> 	if (cic_to_cfqq(cic, false)) {
> 		cfq_exit_cfqq(cfqd, cic_to_cfqq(cic, false));
> 		cic_set_cfqq(cic, NULL, false);
> @@ -3667,6 +3669,8 @@ static void cfq_exit_icq(struct io_cq *icq)
> 		cfq_exit_cfqq(cfqd, cic_to_cfqq(cic, true));
> 		cic_set_cfqq(cic, NULL, true);
> 	}
> +
> +	spin_unlock(cfqd->queue->queue_lock);
> }
>=20
> static void cfq_init_prio_data(struct cfq_queue *cfqq, struct =
cfq_io_cq *cic)
> diff --git a/block/elevator.c b/block/elevator.c
> index ac1c9f481a98..01139f549b5b 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -983,9 +983,7 @@ static int elevator_switch(struct request_queue =
*q, struct elevator_type *new_e)
> 		if (old_registered)
> 			elv_unregister_queue(q);
>=20
> -		spin_lock_irq(q->queue_lock);
> 		ioc_clear_queue(q);
> -		spin_unlock_irq(q->queue_lock);
> 	}
>=20
> 	/* allocate, init and register new elevator */
>=20
> --=20
> Jens Axboe

^ permalink raw reply

* Re: [PATCH v2 09/13] md: raid1: use bio_segments_all()
From: Shaohua Li @ 2017-03-02  7:52 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	linux-block, Christoph Hellwig
In-Reply-To: <CACVXFVOLLJ1Og7ezOsDTbuBsJg+qxJdqmf0MO+VDwDZiSoFn1g@mail.gmail.com>

On Thu, Mar 02, 2017 at 10:34:25AM +0800, Ming Lei wrote:
> Hi Shaohua,
> 
> On Wed, Mar 1, 2017 at 7:42 AM, Shaohua Li <shli@kernel.org> wrote:
> > On Tue, Feb 28, 2017 at 11:41:39PM +0800, Ming Lei wrote:
> >> Use this helper, instead of direct access to .bi_vcnt.
> >
> > what We really need to do for the behind IO is:
> > - allocate memory and copy bio data to the memory
> > - let behind bio do IO against the memory
> >
> > The behind bio doesn't need to have the exactly same bio_vec setting. If we
> > just track the new memory, we don't need use the bio_segments_all and access
> > bio_vec too.
> 
> But we need to figure out how many vecs(each vec store one page) to be
> allocated for the cloned/behind bio, and that is the only value of
> bio_segments_all() here. Or you have idea to avoid that?

As I said, the behind bio doesn't need to have the exactly same bio_vec
setting. We just allocate memory and copy original bio data to the memory,
then do IO against the new memory. The behind bio
segments == (bio->bi_iter.bi_size + PAGE_SIZE - 1) >> PAGE_SHIFT

Thanks,
Shaohua

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox