Linux block layer
 help / color / mirror / Atom feed
* Re: [PATCH] blk-mq-sched: don't hold queue_lock when calling exit_icq
From: Paolo Valente @ 2017-03-02 16:07 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Omar Sandoval, linux-block, kernel-team
In-Reply-To: <80de4554-a14b-4ce6-7c6e-ec66f70b14b1@fb.com>


> Il giorno 02 mar 2017, alle ore 16:13, Jens Axboe <axboe@fb.com> ha =
scritto:
>=20
> On 03/02/2017 08:00 AM, Jens Axboe wrote:
>> On 03/02/2017 03:28 AM, Paolo Valente wrote:
>>>=20
>>>> Il giorno 15 feb 2017, alle ore 19:04, Jens Axboe <axboe@fb.com> ha =
scritto:
>>>>=20
>>>> On 02/15/2017 10:58 AM, Jens Axboe wrote:
>>>>> On 02/15/2017 10:24 AM, Paolo Valente wrote:
>>>>>>=20
>>>>>>> Il giorno 10 feb 2017, alle ore 19:32, Omar Sandoval =
<osandov@osandov.com> ha scritto:
>>>>>>>=20
>>>>>>> From: Omar Sandoval <osandov@fb.com>
>>>>>>>=20
>>>>>>> None of the other blk-mq elevator hooks are called with this =
lock held.
>>>>>>> Additionally, it can lead to circular locking dependencies =
between
>>>>>>> queue_lock and the private scheduler lock.
>>>>>>>=20
>>>>>>=20
>>>>>> Hi Omar,
>>>>>> I'm sorry but it seems that a new potential deadlock has showed =
up.
>>>>>> See lockdep splat below.
>>>>>>=20
>>>>>> I've tried to think about different solutions than turning back =
to
>>>>>> deferring the body of exit_icq, but at no avail.
>>>>>=20
>>>>> Looks like a interaction between bfqd->lock and q->queue_lock. =
Since the
>>>>> core has no notion of you bfqd->lock, the naturally dependency =
here
>>>>> would be to nest bfqd->lock inside q->queue_lock. Is that possible =
for
>>>>> you?
>>>>>=20
>>>>> Looking at the code a bit, maybe it'd just be simpler to get rid =
of
>>>>> holding the queue lock for that spot. For the mq scheduler, we =
really
>>>>> don't want places where we invoke with that held already. Does the =
below
>>>>> work for you?
>>>>=20
>>>> Would need to remove one more lockdep assert. And only test this =
for
>>>> the mq parts, we'd need to spread a bit of love on the classic
>>>> scheduling icq exit path for this to work on that side.
>>>>=20
>>>=20
>>>=20
>>> Jens,
>>> here is the reply I anticipated in my previous email: after rebasing
>>> against master, I'm getting again the deadlock that this patch of
>>> yours solved (together with some changes in bfq-mq).  I thought you =
added a
>>> sort of equivalent commit (now) to the mainline branch.  Am I wrong?
>>=20
>> The patch I posted was never pulled to completion, it wasn't clear
>> to me if it fixed your issue or not. Maybe I missed a reply on
>> that?
>>=20
>> Let me take another stab at it today, I'll send you a version to test
>> on top of my for-linus branch.
>=20
> I took at look at my last posting, and I think it was only missing a
> lock grab in cfq, since we don't hold that for icq exit anymore.  =
Paolo,
> it'd be great if you could verify that this works. Not for bfq-mq (we
> already know it does), but for CFQ.

No luck, sorry :(

It looks like to have just not to take q->queue_lock in cfq_exit_icq.

[    9.329501] =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=

[    9.336132] [ INFO: possible recursive locking detected ]
[    9.343801] 4.10.0-bfq-mq+ #56 Not tainted
[    9.353359] ---------------------------------------------
[    9.354101] ata_id/160 is trying to acquire lock:
[    9.354750]  (&(&q->__queue_lock)->rlock){..-...}, at: =
[<ffffffffaf48501a>] cfq_exit_icq+0x2a/0x80
[    9.355986]=20
[    9.355986] but task is already holding lock:
[    9.364221]  (&(&q->__queue_lock)->rlock){..-...}, at: =
[<ffffffffaf45a686>] put_io_context_active+0x86/0xe0
[    9.382980]=20
[    9.382980] other info that might help us debug this:
[    9.386460]  Possible unsafe locking scenario:
[    9.386460]=20
[    9.397070]        CPU0
[    9.397468]        ----
[    9.397811]   lock(&(&q->__queue_lock)->rlock);
[    9.398440]   lock(&(&q->__queue_lock)->rlock);
[    9.406265]=20
[    9.406265]  *** DEADLOCK ***
[    9.406265]=20
[    9.409589]  May be due to missing lock nesting notation
[    9.409589]=20
[    9.413040] 2 locks held by ata_id/160:
[    9.413576]  #0:  (&(&ioc->lock)->rlock/1){......}, at: =
[<ffffffffaf45a638>] put_io_context_active+0x38/0xe0
[    9.418069]  #1:  (&(&q->__queue_lock)->rlock){..-...}, at: =
[<ffffffffaf45a686>] put_io_context_active+0x86/0xe0
[    9.441118]=20
[    9.441118] stack backtrace:
[    9.445113] CPU: 0 PID: 160 Comm: ata_id Not tainted 4.10.0-bfq-mq+ =
#56
[    9.446210] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS =
VirtualBox 12/01/2006
[    9.457223] Call Trace:
[    9.457636]  dump_stack+0x85/0xc2
[    9.458183]  __lock_acquire+0x1834/0x1890
[    9.458839]  ? __lock_acquire+0x4af/0x1890
[    9.464206]  lock_acquire+0x11b/0x220
[    9.471703]  ? cfq_exit_icq+0x2a/0x80
[    9.472225]  _raw_spin_lock+0x3d/0x80
[    9.472784]  ? cfq_exit_icq+0x2a/0x80
[    9.476336]  cfq_exit_icq+0x2a/0x80
[    9.486750]  ioc_exit_icq+0x38/0x50
[    9.487245]  put_io_context_active+0x92/0xe0
[    9.488049]  exit_io_context+0x48/0x50
[    9.488423]  do_exit+0x7a8/0xc80
[    9.488797]  do_group_exit+0x50/0xd0
[    9.496655]  SyS_exit_group+0x14/0x20
[    9.497170]  entry_SYSCALL_64_fastpath+0x23/0xc6
[    9.497808] RIP: 0033:0x7f224d1c1b98
[    9.498302] RSP: 002b:00007ffd342666f8 EFLAGS: 00000246 ORIG_RAX: =
00000000000000e7
[    9.499360] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: =
00007f224d1c1b98
[    9.511706] RDX: 0000000000000000 RSI: 000000000000003c RDI: =
0000000000000000
[    9.516390] RBP: 00007f224db02690 R08: 00000000000000e7 R09: =
ffffffffffffff90
[    9.523044] R10: 00007f224d6d7250 R11: 0000000000000246 R12: =
0000000000000016
[    9.524015] R13: 0000000000000001 R14: 000055938a1c3010 R15: =
00007ffd34266170

Thanks,
Paolo

>  Run with lockdep enabled and see if
> it spits out anything. We'll hit this exit path very quickly, so the
> test doesn't need to be anything exhaustive.
>=20
>=20
> diff --git a/block/blk-ioc.c b/block/blk-ioc.c
> index b12f9c87b4c3..546ff8f81ede 100644
> --- a/block/blk-ioc.c
> +++ b/block/blk-ioc.c
> @@ -54,7 +54,7 @@ static void ioc_exit_icq(struct io_cq *icq)
> 	icq->flags |=3D ICQ_EXITED;
> }
>=20
> -/* Release an icq.  Called with both ioc and q locked. */
> +/* Release an icq.  Called with ioc locked. */
> static void ioc_destroy_icq(struct io_cq *icq)
> {
> 	struct io_context *ioc =3D icq->ioc;
> @@ -62,7 +62,6 @@ static void ioc_destroy_icq(struct io_cq *icq)
> 	struct elevator_type *et =3D q->elevator->type;
>=20
> 	lockdep_assert_held(&ioc->lock);
> -	lockdep_assert_held(q->queue_lock);
>=20
> 	radix_tree_delete(&ioc->icq_tree, icq->q->id);
> 	hlist_del_init(&icq->ioc_node);
> @@ -222,25 +221,34 @@ void exit_io_context(struct task_struct *task)
> 	put_io_context_active(ioc);
> }
>=20
> +static void __ioc_clear_queue(struct list_head *icq_list)
> +{
> +	while (!list_empty(icq_list)) {
> +		struct io_cq *icq =3D list_entry(icq_list->next,
> +					       struct io_cq, q_node);
> +		struct io_context *ioc =3D icq->ioc;
> +
> +		spin_lock_irq(&ioc->lock);
> +		ioc_destroy_icq(icq);
> +		spin_unlock_irq(&ioc->lock);
> +	}
> +}
> +
> /**
>  * ioc_clear_queue - break any ioc association with the specified =
queue
>  * @q: request_queue being cleared
>  *
> - * Walk @q->icq_list and exit all io_cq's.  Must be called with @q =
locked.
> + * Walk @q->icq_list and exit all io_cq's.
>  */
> void ioc_clear_queue(struct request_queue *q)
> {
> -	lockdep_assert_held(q->queue_lock);
> +	LIST_HEAD(icq_list);
>=20
> -	while (!list_empty(&q->icq_list)) {
> -		struct io_cq *icq =3D list_entry(q->icq_list.next,
> -					       struct io_cq, q_node);
> -		struct io_context *ioc =3D icq->ioc;
> +	spin_lock_irq(q->queue_lock);
> +	list_splice_init(&q->icq_list, &icq_list);
> +	spin_unlock_irq(q->queue_lock);
>=20
> -		spin_lock(&ioc->lock);
> -		ioc_destroy_icq(icq);
> -		spin_unlock(&ioc->lock);
> -	}
> +	__ioc_clear_queue(&icq_list);
> }
>=20
> int create_task_io_context(struct task_struct *task, gfp_t gfp_flags, =
int node)
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 002af836aa87..c44b321335f3 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -815,9 +815,7 @@ static void blk_release_queue(struct kobject =
*kobj)
> 	blkcg_exit_queue(q);
>=20
> 	if (q->elevator) {
> -		spin_lock_irq(q->queue_lock);
> 		ioc_clear_queue(q);
> -		spin_unlock_irq(q->queue_lock);
> 		elevator_exit(q->elevator);
> 	}
>=20
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 137944777859..4a71c79de037 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -3658,6 +3658,8 @@ static void cfq_exit_icq(struct io_cq *icq)
> 	struct cfq_io_cq *cic =3D icq_to_cic(icq);
> 	struct cfq_data *cfqd =3D cic_to_cfqd(cic);
>=20
> +	spin_lock(cfqd->queue->queue_lock);
> +
> 	if (cic_to_cfqq(cic, false)) {
> 		cfq_exit_cfqq(cfqd, cic_to_cfqq(cic, false));
> 		cic_set_cfqq(cic, NULL, false);
> @@ -3667,6 +3669,8 @@ static void cfq_exit_icq(struct io_cq *icq)
> 		cfq_exit_cfqq(cfqd, cic_to_cfqq(cic, true));
> 		cic_set_cfqq(cic, NULL, true);
> 	}
> +
> +	spin_unlock(cfqd->queue->queue_lock);
> }
>=20
> static void cfq_init_prio_data(struct cfq_queue *cfqq, struct =
cfq_io_cq *cic)
> diff --git a/block/elevator.c b/block/elevator.c
> index ac1c9f481a98..01139f549b5b 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -983,9 +983,7 @@ static int elevator_switch(struct request_queue =
*q, struct elevator_type *new_e)
> 		if (old_registered)
> 			elv_unregister_queue(q);
>=20
> -		spin_lock_irq(q->queue_lock);
> 		ioc_clear_queue(q);
> -		spin_unlock_irq(q->queue_lock);
> 	}
>=20
> 	/* allocate, init and register new elevator */
>=20
> --=20
> Jens Axboe

^ permalink raw reply

* Re: [PATCH v2 09/13] md: raid1: use bio_segments_all()
From: Shaohua Li @ 2017-03-02  7:52 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	linux-block, Christoph Hellwig
In-Reply-To: <CACVXFVOLLJ1Og7ezOsDTbuBsJg+qxJdqmf0MO+VDwDZiSoFn1g@mail.gmail.com>

On Thu, Mar 02, 2017 at 10:34:25AM +0800, Ming Lei wrote:
> Hi Shaohua,
> 
> On Wed, Mar 1, 2017 at 7:42 AM, Shaohua Li <shli@kernel.org> wrote:
> > On Tue, Feb 28, 2017 at 11:41:39PM +0800, Ming Lei wrote:
> >> Use this helper, instead of direct access to .bi_vcnt.
> >
> > what We really need to do for the behind IO is:
> > - allocate memory and copy bio data to the memory
> > - let behind bio do IO against the memory
> >
> > The behind bio doesn't need to have the exactly same bio_vec setting. If we
> > just track the new memory, we don't need use the bio_segments_all and access
> > bio_vec too.
> 
> But we need to figure out how many vecs(each vec store one page) to be
> allocated for the cloned/behind bio, and that is the only value of
> bio_segments_all() here. Or you have idea to avoid that?

As I said, the behind bio doesn't need to have the exactly same bio_vec
setting. We just allocate memory and copy original bio data to the memory,
then do IO against the new memory. The behind bio
segments == (bio->bi_iter.bi_size + PAGE_SIZE - 1) >> PAGE_SHIFT

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v2 04/13] md: prepare for managing resync I/O pages in clean way
From: Shaohua Li @ 2017-03-02 17:55 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	linux-block, Christoph Hellwig
In-Reply-To: <CACVXFVMJ-h+TwKT8eVbg3AxUjBzzEsCV0SeuDUaQ3FvFdfni3w@mail.gmail.com>

On Thu, Mar 02, 2017 at 10:09:14AM +0800, Ming Lei wrote:
> Hi Shaohua,
> 
> On Wed, Mar 1, 2017 at 7:30 AM, Shaohua Li <shli@kernel.org> wrote:
> > On Tue, Feb 28, 2017 at 11:41:34PM +0800, Ming Lei wrote:
> >> Now resync I/O use bio's bec table to manage pages,
> >> this way is very hacky, and may not work any more
> >> once multipage bvec is introduced.
> >>
> >> So introduce helpers and new data structure for
> >> managing resync I/O pages more cleanly.
> >>
> >> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
> >> ---
> >>  drivers/md/md.h | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 54 insertions(+)
> >>
> >> diff --git a/drivers/md/md.h b/drivers/md/md.h
> >> index 1d63239a1be4..b5a638d85cb4 100644
> >> --- a/drivers/md/md.h
> >> +++ b/drivers/md/md.h
> >> @@ -720,4 +720,58 @@ static inline void mddev_check_writesame(struct mddev *mddev, struct bio *bio)
> >>  #define RESYNC_BLOCK_SIZE (64*1024)
> >>  #define RESYNC_PAGES ((RESYNC_BLOCK_SIZE + PAGE_SIZE-1) / PAGE_SIZE)
> >>
> >> +/* for managing resync I/O pages */
> >> +struct resync_pages {
> >> +     unsigned        idx;    /* for get/put page from the pool */
> >> +     void            *raid_bio;
> >> +     struct page     *pages[RESYNC_PAGES];
> >> +};
> >
> > I'd like this to be embedded into r1bio directly. Not sure if we really need a
> > structure.
> 
> There are two reasons we can't put this into r1bio:
> - r1bio is used in both normal and resync I/O, not fair to allocate more
> in normal I/O, and that is why this patch wouldn't like to touch r1bio or r10bio
> 
> - the count of 'struct resync_pages' instance depends on raid_disks(raid1)
> or copies(raid10), both can't be decided during compiling.

Yes, I said it should be a pointer of r1bio pointing to the pages in other emails.
 
> >
> >> +}
> >> +
> >> +static inline bool resync_page_available(struct resync_pages *rp)
> >> +{
> >> +     return rp->idx < RESYNC_PAGES;
> >> +}
> >
> > Then we don't need this obscure API.
> 
> That is fine.
> 
> 
> Thanks,
> Ming Lei
 
> >
> >> +
> >> +static inline int resync_alloc_pages(struct resync_pages *rp,
> >> +                                  gfp_t gfp_flags)
> >> +{
> >> +     int i;
> >> +
> >> +     for (i = 0; i < RESYNC_PAGES; i++) {
> >> +             rp->pages[i] = alloc_page(gfp_flags);
> >> +             if (!rp->pages[i])
> >> +                     goto out_free;
> >> +     }
> >> +
> >> +     return 0;
> >> +
> >> + out_free:
> >> +     while (--i >= 0)
> >> +             __free_page(rp->pages[i]);
> >> +     return -ENOMEM;
> >> +}
> >> +
> >> +static inline void resync_free_pages(struct resync_pages *rp)
> >> +{
> >> +     int i;
> >> +
> >> +     for (i = 0; i < RESYNC_PAGES; i++)
> >> +             __free_page(rp->pages[i]);
> >
> > Since we will use get_page, shouldn't this be put_page?
> 
> You are right, will fix in v3.
> 
> >
> >> +}
> >> +
> >> +static inline void resync_get_all_pages(struct resync_pages *rp)
> >> +{
> >> +     int i;
> >> +
> >> +     for (i = 0; i < RESYNC_PAGES; i++)
> >> +             get_page(rp->pages[i]);
> >> +}
> >> +
> >> +static inline struct page *resync_fetch_page(struct resync_pages *rp)
> >> +{
> >> +     if (WARN_ON_ONCE(rp->idx >= RESYNC_PAGES))
> >> +             return NULL;
> >> +     return rp->pages[rp->idx++];
> >
> > I'd like the caller explicitly specify the index instead of a global variable
> > to track it, which will make the code more understandable and less error prone.
> 
> That is fine, but the index has to be per bio, and finally the index
> has to be stored
> in 'struct resync_pages', so every user has to call it in the following way:
> 
>           resync_fetch_page(rp, rp->idx);
> 
> then looks no benefit to pass index explicitly.

I suppose the callers always iterate though page 0 to RESYNC_PAGES - 1. Then
the callers can use an index by themselves. That will clearly show which page
the callers are using. The resync_fetch_page() wrap is a blackbox we always
need to refer to the definition.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH v2 05/13] md: raid1: simplify r1buf_pool_free()
From: Shaohua Li @ 2017-03-02 17:49 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	linux-block, Christoph Hellwig
In-Reply-To: <CACVXFVO3Nk4NgwTyD5eY9fYHAoshbGx=vnTFFQ_VpmED55=Bjg@mail.gmail.com>

On Thu, Mar 02, 2017 at 10:11:54AM +0800, Ming Lei wrote:
> Hi Shaohua,
> 
> On Wed, Mar 1, 2017 at 7:31 AM, Shaohua Li <shli@kernel.org> wrote:
> > On Tue, Feb 28, 2017 at 11:41:35PM +0800, Ming Lei wrote:
> >> This patch gets each page's reference of each bio for resync,
> >> then r1buf_pool_free() gets simplified a lot.
> >>
> >> The same policy has been taken in raid10's buf pool allocation/free
> >> too.
> >
> > We are going to delete the code, this simplify isn't really required.
> 
> Could you explain it a bit? Or I can put this patch into this patchset if you
> can provide one?

I mean you will replace the code soon, with the resync_pages stuff. So I
thought this isn't really required.

Thanks,
Shaohua

^ permalink raw reply

* [PATCH] block: Initialize bd_bdi on inode initialization
From: Jan Kara @ 2017-03-02 15:50 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Al Viro, Dmitry Vyukov, linux-block, Tejun Heo, Jan Kara

So far we initialized bd_bdi only in bdget(). That is fine for normal
bdev inodes however for the special case of the root inode of
blockdev_superblock that function is never called and thus bd_bdi is
left uninitialized. As a result bdev_evict_inode() may oops doing
bdi_put(root->bd_bdi) on that inode as can be seen when doing:

mount -t bdev none /mnt

Fix the problem by initializing bd_bdi when first allocating the inode
and then reinitializing bd_bdi in bdev_evict_inode().

Thanks to syzkaller team for finding the problem.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Fixes: b1d2dc5659b41741f5a29b2ade76ffb4e5bb13d8
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/block_dev.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 77c30f15a02c..2eca00ec4370 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -870,6 +870,7 @@ static void init_once(void *foo)
 #ifdef CONFIG_SYSFS
 	INIT_LIST_HEAD(&bdev->bd_holder_disks);
 #endif
+	bdev->bd_bdi = &noop_backing_dev_info;
 	inode_init_once(&ei->vfs_inode);
 	/* Initialize mutex for freeze. */
 	mutex_init(&bdev->bd_fsfreeze_mutex);
@@ -884,8 +885,10 @@ static void bdev_evict_inode(struct inode *inode)
 	spin_lock(&bdev_lock);
 	list_del_init(&bdev->bd_list);
 	spin_unlock(&bdev_lock);
-	if (bdev->bd_bdi != &noop_backing_dev_info)
+	if (bdev->bd_bdi != &noop_backing_dev_info) {
 		bdi_put(bdev->bd_bdi);
+		bdev->bd_bdi = &noop_backing_dev_info;
+	}
 }
 
 static const struct super_operations bdev_sops = {
@@ -988,7 +991,6 @@ struct block_device *bdget(dev_t dev)
 		bdev->bd_contains = NULL;
 		bdev->bd_super = NULL;
 		bdev->bd_inode = inode;
-		bdev->bd_bdi = &noop_backing_dev_info;
 		bdev->bd_block_size = i_blocksize(inode);
 		bdev->bd_part_count = 0;
 		bdev->bd_invalidated = 0;
-- 
2.10.2

^ permalink raw reply related

* Re: [PATCH] blk-mq-sched: don't hold queue_lock when calling exit_icq
From: Jens Axboe @ 2017-03-02 15:13 UTC (permalink / raw)
  To: Paolo Valente; +Cc: Omar Sandoval, linux-block, kernel-team
In-Reply-To: <dc4dc011-7038-2c05-b1c0-f6f7a07fa9ae@fb.com>

On 03/02/2017 08:00 AM, Jens Axboe wrote:
> On 03/02/2017 03:28 AM, Paolo Valente wrote:
>>
>>> Il giorno 15 feb 2017, alle ore 19:04, Jens Axboe <axboe@fb.com> ha scritto:
>>>
>>> On 02/15/2017 10:58 AM, Jens Axboe wrote:
>>>> On 02/15/2017 10:24 AM, Paolo Valente wrote:
>>>>>
>>>>>> Il giorno 10 feb 2017, alle ore 19:32, Omar Sandoval <osandov@osandov.com> ha scritto:
>>>>>>
>>>>>> From: Omar Sandoval <osandov@fb.com>
>>>>>>
>>>>>> None of the other blk-mq elevator hooks are called with this lock held.
>>>>>> Additionally, it can lead to circular locking dependencies between
>>>>>> queue_lock and the private scheduler lock.
>>>>>>
>>>>>
>>>>> Hi Omar,
>>>>> I'm sorry but it seems that a new potential deadlock has showed up.
>>>>> See lockdep splat below.
>>>>>
>>>>> I've tried to think about different solutions than turning back to
>>>>> deferring the body of exit_icq, but at no avail.
>>>>
>>>> Looks like a interaction between bfqd->lock and q->queue_lock. Since the
>>>> core has no notion of you bfqd->lock, the naturally dependency here
>>>> would be to nest bfqd->lock inside q->queue_lock. Is that possible for
>>>> you?
>>>>
>>>> Looking at the code a bit, maybe it'd just be simpler to get rid of
>>>> holding the queue lock for that spot. For the mq scheduler, we really
>>>> don't want places where we invoke with that held already. Does the below
>>>> work for you?
>>>
>>> Would need to remove one more lockdep assert. And only test this for
>>> the mq parts, we'd need to spread a bit of love on the classic
>>> scheduling icq exit path for this to work on that side.
>>>
>>
>>
>> Jens,
>> here is the reply I anticipated in my previous email: after rebasing
>> against master, I'm getting again the deadlock that this patch of
>> yours solved (together with some changes in bfq-mq).  I thought you added a
>> sort of equivalent commit (now) to the mainline branch.  Am I wrong?
> 
> The patch I posted was never pulled to completion, it wasn't clear
> to me if it fixed your issue or not. Maybe I missed a reply on
> that?
> 
> Let me take another stab at it today, I'll send you a version to test
> on top of my for-linus branch.

I took at look at my last posting, and I think it was only missing a
lock grab in cfq, since we don't hold that for icq exit anymore.  Paolo,
it'd be great if you could verify that this works. Not for bfq-mq (we
already know it does), but for CFQ. Run with lockdep enabled and see if
it spits out anything. We'll hit this exit path very quickly, so the
test doesn't need to be anything exhaustive.


diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index b12f9c87b4c3..546ff8f81ede 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -54,7 +54,7 @@ static void ioc_exit_icq(struct io_cq *icq)
 	icq->flags |= ICQ_EXITED;
 }
 
-/* Release an icq.  Called with both ioc and q locked. */
+/* Release an icq.  Called with ioc locked. */
 static void ioc_destroy_icq(struct io_cq *icq)
 {
 	struct io_context *ioc = icq->ioc;
@@ -62,7 +62,6 @@ static void ioc_destroy_icq(struct io_cq *icq)
 	struct elevator_type *et = q->elevator->type;
 
 	lockdep_assert_held(&ioc->lock);
-	lockdep_assert_held(q->queue_lock);
 
 	radix_tree_delete(&ioc->icq_tree, icq->q->id);
 	hlist_del_init(&icq->ioc_node);
@@ -222,25 +221,34 @@ void exit_io_context(struct task_struct *task)
 	put_io_context_active(ioc);
 }
 
+static void __ioc_clear_queue(struct list_head *icq_list)
+{
+	while (!list_empty(icq_list)) {
+		struct io_cq *icq = list_entry(icq_list->next,
+					       struct io_cq, q_node);
+		struct io_context *ioc = icq->ioc;
+
+		spin_lock_irq(&ioc->lock);
+		ioc_destroy_icq(icq);
+		spin_unlock_irq(&ioc->lock);
+	}
+}
+
 /**
  * ioc_clear_queue - break any ioc association with the specified queue
  * @q: request_queue being cleared
  *
- * Walk @q->icq_list and exit all io_cq's.  Must be called with @q locked.
+ * Walk @q->icq_list and exit all io_cq's.
  */
 void ioc_clear_queue(struct request_queue *q)
 {
-	lockdep_assert_held(q->queue_lock);
+	LIST_HEAD(icq_list);
 
-	while (!list_empty(&q->icq_list)) {
-		struct io_cq *icq = list_entry(q->icq_list.next,
-					       struct io_cq, q_node);
-		struct io_context *ioc = icq->ioc;
+	spin_lock_irq(q->queue_lock);
+	list_splice_init(&q->icq_list, &icq_list);
+	spin_unlock_irq(q->queue_lock);
 
-		spin_lock(&ioc->lock);
-		ioc_destroy_icq(icq);
-		spin_unlock(&ioc->lock);
-	}
+	__ioc_clear_queue(&icq_list);
 }
 
 int create_task_io_context(struct task_struct *task, gfp_t gfp_flags, int node)
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 002af836aa87..c44b321335f3 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -815,9 +815,7 @@ static void blk_release_queue(struct kobject *kobj)
 	blkcg_exit_queue(q);
 
 	if (q->elevator) {
-		spin_lock_irq(q->queue_lock);
 		ioc_clear_queue(q);
-		spin_unlock_irq(q->queue_lock);
 		elevator_exit(q->elevator);
 	}
 
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 137944777859..4a71c79de037 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -3658,6 +3658,8 @@ static void cfq_exit_icq(struct io_cq *icq)
 	struct cfq_io_cq *cic = icq_to_cic(icq);
 	struct cfq_data *cfqd = cic_to_cfqd(cic);
 
+	spin_lock(cfqd->queue->queue_lock);
+
 	if (cic_to_cfqq(cic, false)) {
 		cfq_exit_cfqq(cfqd, cic_to_cfqq(cic, false));
 		cic_set_cfqq(cic, NULL, false);
@@ -3667,6 +3669,8 @@ static void cfq_exit_icq(struct io_cq *icq)
 		cfq_exit_cfqq(cfqd, cic_to_cfqq(cic, true));
 		cic_set_cfqq(cic, NULL, true);
 	}
+
+	spin_unlock(cfqd->queue->queue_lock);
 }
 
 static void cfq_init_prio_data(struct cfq_queue *cfqq, struct cfq_io_cq *cic)
diff --git a/block/elevator.c b/block/elevator.c
index ac1c9f481a98..01139f549b5b 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -983,9 +983,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 		if (old_registered)
 			elv_unregister_queue(q);
 
-		spin_lock_irq(q->queue_lock);
 		ioc_clear_queue(q);
-		spin_unlock_irq(q->queue_lock);
 	}
 
 	/* allocate, init and register new elevator */

-- 
Jens Axboe

^ permalink raw reply related

* Re: [PATCH 3/8] nowait aio: return if direct write will trigger writeback
From: Jan Kara @ 2017-03-02 15:22 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jan Kara, Christoph Hellwig, Goldwyn Rodrigues, jack,
	linux-fsdevel, linux-block, linux-btrfs, linux-ext4, linux-xfs,
	Goldwyn Rodrigues
In-Reply-To: <20170302141245.GO16328@bombadil.infradead.org>

On Thu 02-03-17 06:12:45, Matthew Wilcox wrote:
> On Thu, Mar 02, 2017 at 11:38:45AM +0100, Jan Kara wrote:
> > On Wed 01-03-17 07:38:57, Christoph Hellwig wrote:
> > > On Tue, Feb 28, 2017 at 07:46:06PM -0800, Matthew Wilcox wrote:
> > > > But what's going to kick these pages out of cache?  Shouldn't we rather
> > > > find the pages, kick them out if clean, start writeback if not, and *then*
> > > > return -EAGAIN?
> > > 
> > > As pointed out in the last round of these patches I think we really
> > > need to pass a flags argument to filemap_write_and_wait_range to
> > > communicate the non-blocking nature and only return -EAGAIN if we'd
> > > block.  As a bonus that can indeed start to kick the pages out.
> > 
> > Aren't flags to filemap_write_and_wait_range() unnecessary complication?
> > Realistically, most users wanting performance from AIO DIO so badly that
> > they bother with this API won't have any pages to write / evict. If they do
> > by some bad accident, they can fall back to standard "blocking" AIO DIO.
> > So I don't see much value in teaching filemap_write_and_wait_range() about
> > a non-blocking mode...
> 
> That lets me execute a DoS against a user using this API.  All I have
> to do is open the file they're using read-only and read a byte from it.
> Page goes into page-cache, and they'll only get -EAGAIN from calling
> this syscall until the page ages out.

It will not be a DoS. This non-blocking AIO can always return EAGAIN when
it feels like it and the caller is required to fall back to a blocking
version in that case if he wants to guarantee forward progress. It is just
a performance optimization which allows user (database) to submit IO from a
computation thread instead of having to offload it to an IO thread...

> Also, I don't understand why this is a flag.  Isn't the point of AIO to
> be non-blocking?  Why isn't this just a change to how we do AIO?

Because this is an API change and the caller has to implement some handling
to guarantee a forward progress of non-blocking IO...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [WIP BRANCH] cgroups support in bfq-mq WIP branch
From: Jens Axboe @ 2017-03-02 15:19 UTC (permalink / raw)
  To: Paolo Valente
  Cc: ulf.hansson@linaro.org, tj@kernel.org,
	linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
	broonie@kernel.org, linus.walleij@linaro.org, Bart Van Assche
In-Reply-To: <3410FACF-D8AF-41FA-B38B-9EFAE6B301F0@linaro.org>

On 03/02/2017 03:15 AM, Paolo Valente wrote:
> 
>> Il giorno 25 feb 2017, alle ore 19:52, Jens Axboe <axboe@kernel.dk> ha scritto:
>>
>> On 02/25/2017 10:44 AM, Paolo Valente wrote:
>>> Hi,
>>> I've just completed cgroups support, and I'd like to highlight the
>>> main blk-mq issue that I have found along the way.  I have pushed the
>>> commit that completes the support for cgroups to the usual WIP branch
>>> [1].  Before moving to this issue, I have preliminary question about
>>> the scheduler name, since I'm about to start preparing the patch
>>> series for submission.  So far, I have used bfq-mq as a temporary
>>> name.  Are we fine with it, or should I change it, for example, to
>>> just bfq?  Jens?
>>
>> Just call it 'bfq', that doesn't conflict with anything that's
>> in the kernel already.
>>
> 
> ok
> 
>>> I've found a sort of circular dependency in blk-mq, related to
>>> scheduler initialization.  To describe both the issue and how I've
>>> addressed it, I'm pasting the message of the new commit.
>>
>> Rebase your patches on top of Linus current master, some of them
>> will need to change and some can be dropped.
>>
> 
> Done, but the last deadlock issue shows up again :( To help you get
> context, I'm going to reply to the email in which your sent the patch that
> solved it.

OK, I got that sent to you. When you have tested it, just add it as
a prep patch in your series. If it works for you, then let me know
and I'll add your Tested-by: to that patch and post it for more
thorough review.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH] blk-mq-sched: don't hold queue_lock when calling exit_icq
From: Jens Axboe @ 2017-03-02 15:00 UTC (permalink / raw)
  To: Paolo Valente; +Cc: Omar Sandoval, linux-block, kernel-team
In-Reply-To: <1720FEB5-FBBA-4EAA-8292-E820AA15389D@linaro.org>

On 03/02/2017 03:28 AM, Paolo Valente wrote:
> 
>> Il giorno 15 feb 2017, alle ore 19:04, Jens Axboe <axboe@fb.com> ha scritto:
>>
>> On 02/15/2017 10:58 AM, Jens Axboe wrote:
>>> On 02/15/2017 10:24 AM, Paolo Valente wrote:
>>>>
>>>>> Il giorno 10 feb 2017, alle ore 19:32, Omar Sandoval <osandov@osandov.com> ha scritto:
>>>>>
>>>>> From: Omar Sandoval <osandov@fb.com>
>>>>>
>>>>> None of the other blk-mq elevator hooks are called with this lock held.
>>>>> Additionally, it can lead to circular locking dependencies between
>>>>> queue_lock and the private scheduler lock.
>>>>>
>>>>
>>>> Hi Omar,
>>>> I'm sorry but it seems that a new potential deadlock has showed up.
>>>> See lockdep splat below.
>>>>
>>>> I've tried to think about different solutions than turning back to
>>>> deferring the body of exit_icq, but at no avail.
>>>
>>> Looks like a interaction between bfqd->lock and q->queue_lock. Since the
>>> core has no notion of you bfqd->lock, the naturally dependency here
>>> would be to nest bfqd->lock inside q->queue_lock. Is that possible for
>>> you?
>>>
>>> Looking at the code a bit, maybe it'd just be simpler to get rid of
>>> holding the queue lock for that spot. For the mq scheduler, we really
>>> don't want places where we invoke with that held already. Does the below
>>> work for you?
>>
>> Would need to remove one more lockdep assert. And only test this for
>> the mq parts, we'd need to spread a bit of love on the classic
>> scheduling icq exit path for this to work on that side.
>>
> 
> 
> Jens,
> here is the reply I anticipated in my previous email: after rebasing
> against master, I'm getting again the deadlock that this patch of
> yours solved (together with some changes in bfq-mq).  I thought you added a
> sort of equivalent commit (now) to the mainline branch.  Am I wrong?

The patch I posted was never pulled to completion, it wasn't clear
to me if it fixed your issue or not. Maybe I missed a reply on
that?

Let me take another stab at it today, I'll send you a version to test
on top of my for-linus branch.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH 3/8] nowait aio: return if direct write will trigger writeback
From: Matthew Wilcox @ 2017-03-02 14:12 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Goldwyn Rodrigues, jack, linux-fsdevel,
	linux-block, linux-btrfs, linux-ext4, linux-xfs,
	Goldwyn Rodrigues
In-Reply-To: <20170302103845.GB31792@quack2.suse.cz>

On Thu, Mar 02, 2017 at 11:38:45AM +0100, Jan Kara wrote:
> On Wed 01-03-17 07:38:57, Christoph Hellwig wrote:
> > On Tue, Feb 28, 2017 at 07:46:06PM -0800, Matthew Wilcox wrote:
> > > But what's going to kick these pages out of cache?  Shouldn't we rather
> > > find the pages, kick them out if clean, start writeback if not, and *then*
> > > return -EAGAIN?
> > 
> > As pointed out in the last round of these patches I think we really
> > need to pass a flags argument to filemap_write_and_wait_range to
> > communicate the non-blocking nature and only return -EAGAIN if we'd
> > block.  As a bonus that can indeed start to kick the pages out.
> 
> Aren't flags to filemap_write_and_wait_range() unnecessary complication?
> Realistically, most users wanting performance from AIO DIO so badly that
> they bother with this API won't have any pages to write / evict. If they do
> by some bad accident, they can fall back to standard "blocking" AIO DIO.
> So I don't see much value in teaching filemap_write_and_wait_range() about
> a non-blocking mode...

That lets me execute a DoS against a user using this API.  All I have
to do is open the file they're using read-only and read a byte from it.
Page goes into page-cache, and they'll only get -EAGAIN from calling
this syscall until the page ages out.

Also, I don't understand why this is a flag.  Isn't the point of AIO to
be non-blocking?  Why isn't this just a change to how we do AIO?

^ permalink raw reply

* Re: [PATCH] blk-mq-sched: don't hold queue_lock when calling exit_icq
From: Paolo Valente @ 2017-03-02 10:28 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Omar Sandoval, linux-block, kernel-team
In-Reply-To: <9fe1b3a9-2a80-8891-1044-f83558e28d15@fb.com>


> Il giorno 15 feb 2017, alle ore 19:04, Jens Axboe <axboe@fb.com> ha =
scritto:
>=20
> On 02/15/2017 10:58 AM, Jens Axboe wrote:
>> On 02/15/2017 10:24 AM, Paolo Valente wrote:
>>>=20
>>>> Il giorno 10 feb 2017, alle ore 19:32, Omar Sandoval =
<osandov@osandov.com> ha scritto:
>>>>=20
>>>> From: Omar Sandoval <osandov@fb.com>
>>>>=20
>>>> None of the other blk-mq elevator hooks are called with this lock =
held.
>>>> Additionally, it can lead to circular locking dependencies between
>>>> queue_lock and the private scheduler lock.
>>>>=20
>>>=20
>>> Hi Omar,
>>> I'm sorry but it seems that a new potential deadlock has showed up.
>>> See lockdep splat below.
>>>=20
>>> I've tried to think about different solutions than turning back to
>>> deferring the body of exit_icq, but at no avail.
>>=20
>> Looks like a interaction between bfqd->lock and q->queue_lock. Since =
the
>> core has no notion of you bfqd->lock, the naturally dependency here
>> would be to nest bfqd->lock inside q->queue_lock. Is that possible =
for
>> you?
>>=20
>> Looking at the code a bit, maybe it'd just be simpler to get rid of
>> holding the queue lock for that spot. For the mq scheduler, we really
>> don't want places where we invoke with that held already. Does the =
below
>> work for you?
>=20
> Would need to remove one more lockdep assert. And only test this for
> the mq parts, we'd need to spread a bit of love on the classic
> scheduling icq exit path for this to work on that side.
>=20


Jens,
here is the reply I anticipated in my previous email: after rebasing
against master, I'm getting again the deadlock that this patch of
yours solved (together with some changes in bfq-mq).  I thought you =
added a
sort of equivalent commit (now) to the mainline branch.  Am I wrong?

Thanks,
Paolo

> diff --git a/block/blk-ioc.c b/block/blk-ioc.c
> index b12f9c87b4c3..546ff8f81ede 100644
> --- a/block/blk-ioc.c
> +++ b/block/blk-ioc.c
> @@ -54,7 +54,7 @@ static void ioc_exit_icq(struct io_cq *icq)
> 	icq->flags |=3D ICQ_EXITED;
> }
>=20
> -/* Release an icq.  Called with both ioc and q locked. */
> +/* Release an icq.  Called with ioc locked. */
> static void ioc_destroy_icq(struct io_cq *icq)
> {
> 	struct io_context *ioc =3D icq->ioc;
> @@ -62,7 +62,6 @@ static void ioc_destroy_icq(struct io_cq *icq)
> 	struct elevator_type *et =3D q->elevator->type;
>=20
> 	lockdep_assert_held(&ioc->lock);
> -	lockdep_assert_held(q->queue_lock);
>=20
> 	radix_tree_delete(&ioc->icq_tree, icq->q->id);
> 	hlist_del_init(&icq->ioc_node);
> @@ -222,25 +221,34 @@ void exit_io_context(struct task_struct *task)
> 	put_io_context_active(ioc);
> }
>=20
> +static void __ioc_clear_queue(struct list_head *icq_list)
> +{
> +	while (!list_empty(icq_list)) {
> +		struct io_cq *icq =3D list_entry(icq_list->next,
> +					       struct io_cq, q_node);
> +		struct io_context *ioc =3D icq->ioc;
> +
> +		spin_lock_irq(&ioc->lock);
> +		ioc_destroy_icq(icq);
> +		spin_unlock_irq(&ioc->lock);
> +	}
> +}
> +
> /**
>  * ioc_clear_queue - break any ioc association with the specified =
queue
>  * @q: request_queue being cleared
>  *
> - * Walk @q->icq_list and exit all io_cq's.  Must be called with @q =
locked.
> + * Walk @q->icq_list and exit all io_cq's.
>  */
> void ioc_clear_queue(struct request_queue *q)
> {
> -	lockdep_assert_held(q->queue_lock);
> +	LIST_HEAD(icq_list);
>=20
> -	while (!list_empty(&q->icq_list)) {
> -		struct io_cq *icq =3D list_entry(q->icq_list.next,
> -					       struct io_cq, q_node);
> -		struct io_context *ioc =3D icq->ioc;
> +	spin_lock_irq(q->queue_lock);
> +	list_splice_init(&q->icq_list, &icq_list);
> +	spin_unlock_irq(q->queue_lock);
>=20
> -		spin_lock(&ioc->lock);
> -		ioc_destroy_icq(icq);
> -		spin_unlock(&ioc->lock);
> -	}
> +	__ioc_clear_queue(&icq_list);
> }
>=20
> int create_task_io_context(struct task_struct *task, gfp_t gfp_flags, =
int node)
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 070d81bae1d5..1944aa1cb899 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -815,9 +815,7 @@ static void blk_release_queue(struct kobject =
*kobj)
> 	blkcg_exit_queue(q);
>=20
> 	if (q->elevator) {
> -		spin_lock_irq(q->queue_lock);
> 		ioc_clear_queue(q);
> -		spin_unlock_irq(q->queue_lock);
> 		elevator_exit(q->elevator);
> 	}
>=20
> diff --git a/block/elevator.c b/block/elevator.c
> index a25bdd90b270..aaa1e9836512 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -985,9 +985,7 @@ static int elevator_switch(struct request_queue =
*q, struct elevator_type *new_e)
> 		if (old_registered)
> 			elv_unregister_queue(q);
>=20
> -		spin_lock_irq(q->queue_lock);
> 		ioc_clear_queue(q);
> -		spin_unlock_irq(q->queue_lock);
> 	}
>=20
> 	/* allocate, init and register new elevator */
>=20
> --=20
> Jens Axboe

^ permalink raw reply

* Re: [PATCH 3/8] nowait aio: return if direct write will trigger writeback
From: Jan Kara @ 2017-03-02 10:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matthew Wilcox, Goldwyn Rodrigues, jack, linux-fsdevel,
	linux-block, linux-btrfs, linux-ext4, linux-xfs,
	Goldwyn Rodrigues
In-Reply-To: <20170301153857.GC30631@infradead.org>

On Wed 01-03-17 07:38:57, Christoph Hellwig wrote:
> On Tue, Feb 28, 2017 at 07:46:06PM -0800, Matthew Wilcox wrote:
> > Ugh, this is pretty inefficient.  If that's all you want to know, then
> > using the radix tree directly will be far more efficient than spinning
> > up all the pagevec machinery only to discard the pages found.
> > 
> > But what's going to kick these pages out of cache?  Shouldn't we rather
> > find the pages, kick them out if clean, start writeback if not, and *then*
> > return -EAGAIN?
> > 
> > So maybe we want to spin up the pagevec machinery after all so we can
> > do that extra work?
> 
> As pointed out in the last round of these patches I think we really
> need to pass a flags argument to filemap_write_and_wait_range to
> communicate the non-blocking nature and only return -EAGAIN if we'd
> block.  As a bonus that can indeed start to kick the pages out.

Aren't flags to filemap_write_and_wait_range() unnecessary complication?
Realistically, most users wanting performance from AIO DIO so badly that
they bother with this API won't have any pages to write / evict. If they do
by some bad accident, they can fall back to standard "blocking" AIO DIO.
So I don't see much value in teaching filemap_write_and_wait_range() about
a non-blocking mode...

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH] cfq-iosched: fix the delay of cfq_group's vdisktime under iops mode
From: Jan Kara @ 2017-03-02 10:29 UTC (permalink / raw)
  To: Hou Tao; +Cc: axboe, linux-block, jmoyer, jack, stable
In-Reply-To: <1488334064-34883-1-git-send-email-houtao1@huawei.com>

On Wed 01-03-17 10:07:44, Hou Tao wrote:
> When adding a cfq_group into the cfq service tree, we use CFQ_IDLE_DELAY
> as the delay of cfq_group's vdisktime if there have been other cfq_groups
> already.
> 
> When cfq is under iops mode, commit 9a7f38c42c2b ("cfq-iosched: Convert
> from jiffies to nanoseconds") could result in a large iops delay and
> lead to an abnormal io schedule delay for the added cfq_group. To fix
> it, we just need to revert to the old CFQ_IDLE_DELAY value: HZ / 5
> when iops mode is enabled.
> 
> Cc: <stable@vger.kernel.org> # 4.8+
> Signed-off-by: Hou Tao <houtao1@huawei.com>

OK, I agree my commit broke the logic in this case. Thanks for the fix.
Please add also tag:

Fixes: 9a7f38c42c2b92391d9dabaf9f51df7cfe5608e4

I somewhat disagree with the fix though. See below:

> +static inline u64 cfq_get_cfqg_vdisktime_delay(struct cfq_data *cfqd)
> +{
> +	if (!iops_mode(cfqd))
> +		return CFQ_IDLE_DELAY;
> +	else
> +		return nsecs_to_jiffies64(CFQ_IDLE_DELAY);
> +}
> +

So using nsecs_to_jiffies64(CFQ_IDLE_DELAY) when in iops mode just does not
make any sense. AFAIU the code in cfq_group_notify_queue_add() we just want
to add the cfqg as the last one in the tree. So returning 1 from
cfq_get_cfqg_vdisktime_delay() in iops mode should be fine as well.

Frankly, vdisktime is in fixed-point precision shifted by
CFQ_SERVICE_SHIFT so using CFQ_IDLE_DELAY does not make much sense in any
case and just adding 1 to maximum vdisktime should be fine in all the
cases. But that would require more testing whether I did not miss anything
subtle.

								Honza

>  static void
>  cfq_group_notify_queue_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
>  {
> @@ -1380,7 +1388,8 @@ cfq_group_notify_queue_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
>  	n = rb_last(&st->rb);
>  	if (n) {
>  		__cfqg = rb_entry_cfqg(n);
> -		cfqg->vdisktime = __cfqg->vdisktime + CFQ_IDLE_DELAY;
> +		cfqg->vdisktime = __cfqg->vdisktime +
> +			cfq_get_cfqg_vdisktime_delay(cfqd);
>  	} else
>  		cfqg->vdisktime = st->min_vdisktime;
>  	cfq_group_service_tree_add(st, cfqg);
> -- 
> 2.5.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [WIP BRANCH] cgroups support in bfq-mq WIP branch
From: Paolo Valente @ 2017-03-02 10:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: ulf.hansson@linaro.org, tj@kernel.org,
	linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
	broonie@kernel.org, linus.walleij@linaro.org, Bart Van Assche
In-Reply-To: <97eadcb9-5049-399a-e5a7-d4f8b821756b@kernel.dk>


> Il giorno 25 feb 2017, alle ore 19:52, Jens Axboe <axboe@kernel.dk> ha =
scritto:
>=20
> On 02/25/2017 10:44 AM, Paolo Valente wrote:
>> Hi,
>> I've just completed cgroups support, and I'd like to highlight the
>> main blk-mq issue that I have found along the way.  I have pushed the
>> commit that completes the support for cgroups to the usual WIP branch
>> [1].  Before moving to this issue, I have preliminary question about
>> the scheduler name, since I'm about to start preparing the patch
>> series for submission.  So far, I have used bfq-mq as a temporary
>> name.  Are we fine with it, or should I change it, for example, to
>> just bfq?  Jens?
>=20
> Just call it 'bfq', that doesn't conflict with anything that's
> in the kernel already.
>=20

ok

>> I've found a sort of circular dependency in blk-mq, related to
>> scheduler initialization.  To describe both the issue and how I've
>> addressed it, I'm pasting the message of the new commit.
>=20
> Rebase your patches on top of Linus current master, some of them
> will need to change and some can be dropped.
>=20

Done, but the last deadlock issue shows up again :( To help you get
context, I'm going to reply to the email in which your sent the patch =
that
solved it.


> And disentangle it completely from the old bfq, I don't want to see
> nasty stuff like includes of .c files with prior defines modifying
> behavior of functions.
>=20

Of course.

> When that's done, get it posted for review asap. I would imagine
> we will go through a few postings and review cycles, and if we're
> targeting 4.12 with this, then we should get the ball rolling
> on that side.
>=20

I was about to to submit, but bumped into the above regression.

Thanks,
Paolo

> --=20
> Jens Axboe
>=20

^ permalink raw reply

* Re: [PATCH 3/3] nvme: Complete all stuck requests
From: Sagi Grimberg @ 2017-03-02  7:15 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe, Christoph Hellwig, linux-nvme,
	linux-block
  Cc: Marc MERLIN, Jens Axboe
In-Reply-To: <1488396132-11369-4-git-send-email-keith.busch@intel.com>

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply

* Re: blk_integrity_revalidate() clears BDI_CAP_STABLE_WRITES
From: Martin K. Petersen @ 2017-03-02  3:24 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Jens Axboe, Sagi Grimberg, Martin K. Petersen, Mike Snitzer,
	linux-nvme, Christoph Hellwig, linux-block, Ceph Development,
	Dan Williams
In-Reply-To: <CAOi1vP_g-MMe9XvWDgkVL4R6UjckMQVC_CDF9+hQEpZbxwx0Jw@mail.gmail.com>

>>>>> "Ilya" == Ilya Dryomov <idryomov@gmail.com> writes:

Ilya,

Ilya> Given the above, I'm not sure what the baseline is --
Ilya> blk_integrity code isn't invoked for data-only lbafs.  Could nvme
Ilya> folks please look at this?  rbd regression caused by integrity
Ilya> revalidate change goes back to 4.4 and I'd really like to get it
Ilya> fixed.  The proposed patch is attached to my earlier reply on
Ilya> linux-block.

I've had a couple of fires to attend to this week. I'll try and give
your patch a spin on my FC setup tomorrow or Friday.

-- 
Martin K. Petersen	Oracle Linux Engineering

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply

* Re: [PATCH 1/2] loop: fix LO_FLAGS_PARTSCAN hang
From: Jens Axboe @ 2017-03-02  2:26 UTC (permalink / raw)
  To: Omar Sandoval, linux-block; +Cc: kernel-team, Ming Lei, stable
In-Reply-To: <7e888cad9e6c35835559281d3ab9e05ea48836d0.1488393750.git.osandov@fb.com>

On 03/01/2017 11:42 AM, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> loop_reread_partitions() needs to do I/O, but we just froze the queue,
> so we end up waiting forever. This can easily be reproduced with losetup
> -P. Fix it by moving the reread to after we unfreeze the queue.

Thanks Omar, applied. That's a brown paper bag bug.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH 2/2] blk-mq-debugfs: add q->mq_freeze_depth output
From: Omar Sandoval @ 2017-03-02  3:07 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, kernel-team
In-Reply-To: <6fb18656-799e-6221-6b82-44f14cd59418@fb.com>

On Wed, Mar 01, 2017 at 07:25:59PM -0700, Jens Axboe wrote:
> On 03/01/2017 11:42 AM, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > This can be used to diagnose freeze-related hangs.
> 
> Do we really need this? For the loop case, it was painfully
> obvious that it was a count issue. And I can't think of any
> cases where it would have helped us, outside of rewriting
> core parts (like the first edition of the shadow requests
> for the scheduler).
> 
> Not strongly opposed to it, feel free to convince me.

I don't feel a burning need for it, let's just drop it.

^ permalink raw reply

* Re: [PATCH 1/2] loop: fix LO_FLAGS_PARTSCAN hang
From: Ming Lei @ 2017-03-02  1:57 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: Jens Axboe, linux-block, FB Kernel Team, stable
In-Reply-To: <7e888cad9e6c35835559281d3ab9e05ea48836d0.1488393750.git.osandov@fb.com>

On Thu, Mar 2, 2017 at 2:42 AM, Omar Sandoval <osandov@osandov.com> wrote:
> From: Omar Sandoval <osandov@fb.com>
>
> loop_reread_partitions() needs to do I/O, but we just froze the queue,
> so we end up waiting forever. This can easily be reproduced with losetup
> -P. Fix it by moving the reread to after we unfreeze the queue.
>
> Fixes: ecdd09597a57 ("block/loop: fix race between I/O and set_status")
> Reported-by: Tejun Heo <tj@kernel.org>
> Cc: Ming Lei <tom.leiming@gmail.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Omar Sandoval <osandov@fb.com>

Reviewed-by: Ming Lei <tom.leiming@gmail.com>

Thanks,
Ming

^ permalink raw reply

* Re: [PATCH 2/2] blk-mq-debugfs: add q->mq_freeze_depth output
From: Jens Axboe @ 2017-03-02  2:25 UTC (permalink / raw)
  To: Omar Sandoval, linux-block; +Cc: kernel-team
In-Reply-To: <d4ab26006c6b454db94d548df10da3ce8908c8a1.1488393750.git.osandov@fb.com>

On 03/01/2017 11:42 AM, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> This can be used to diagnose freeze-related hangs.

Do we really need this? For the loop case, it was painfully
obvious that it was a count issue. And I can't think of any
cases where it would have helped us, outside of rewriting
core parts (like the first edition of the shadow requests
for the scheduler).

Not strongly opposed to it, feel free to convince me.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH v2 13/13] md: raid10: avoid direct access to bvec table in handle_reshape_read_error
From: Ming Lei @ 2017-03-02  2:37 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Jens Axboe, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	linux-block, Christoph Hellwig
In-Reply-To: <20170228234627.3h6tospuz33yhdig@kernel.org>

Hi Shaohua,

On Wed, Mar 1, 2017 at 7:46 AM, Shaohua Li <shli@kernel.org> wrote:
> On Tue, Feb 28, 2017 at 11:41:43PM +0800, Ming Lei wrote:
>> The cost is 128bytes(8*16) stack space in kernel thread context, and
>> just use the bio helper to retrieve pages from bio.
>>
>> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
>> ---
>>  drivers/md/raid10.c | 12 ++++++++++--
>>  1 file changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>> index 0b97631e3905..6ffb64ab45f8 100644
>> --- a/drivers/md/raid10.c
>> +++ b/drivers/md/raid10.c
>> @@ -4670,7 +4670,15 @@ static int handle_reshape_read_error(struct mddev *mddev,
>>       struct r10bio *r10b = &on_stack.r10_bio;
>>       int slot = 0;
>>       int idx = 0;
>> -     struct bio_vec *bvec = r10_bio->master_bio->bi_io_vec;
>> +     struct bio_vec *bvl;
>> +     struct page *pages[RESYNC_PAGES];
>> +
>> +     /*
>> +      * This bio is allocated in reshape_request(), and size
>> +      * is still RESYNC_PAGES
>> +      */
>> +     bio_for_each_segment_all(bvl, r10_bio->master_bio, idx)
>> +             pages[idx] = bvl->bv_page;
>
> The reshape bio is doing IO against the memory we allocated for r10_bio, I'm
> wondering why we can't get the pages from r10_bio. In this way, we don't need
> access the bio_vec any more.

Reshap read is special and the bio(r10_bio->master_bio) isn't allocated from
.r10buf_pool, please see reshape_request().

Thanks,
Ming Lei

^ permalink raw reply

* Re: [PATCH v2 09/13] md: raid1: use bio_segments_all()
From: Ming Lei @ 2017-03-02  2:34 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Jens Axboe, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	linux-block, Christoph Hellwig
In-Reply-To: <20170228234212.3wjwxsegg4v57nxu@kernel.org>

Hi Shaohua,

On Wed, Mar 1, 2017 at 7:42 AM, Shaohua Li <shli@kernel.org> wrote:
> On Tue, Feb 28, 2017 at 11:41:39PM +0800, Ming Lei wrote:
>> Use this helper, instead of direct access to .bi_vcnt.
>
> what We really need to do for the behind IO is:
> - allocate memory and copy bio data to the memory
> - let behind bio do IO against the memory
>
> The behind bio doesn't need to have the exactly same bio_vec setting. If we
> just track the new memory, we don't need use the bio_segments_all and access
> bio_vec too.

But we need to figure out how many vecs(each vec store one page) to be
allocated for the cloned/behind bio, and that is the only value of
bio_segments_all() here. Or you have idea to avoid that?

Thanks,
Ming

^ permalink raw reply

* Re: [PATCH v2 06/13] md: raid1: don't use bio's vec table to manage resync pages
From: Ming Lei @ 2017-03-02  2:25 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Jens Axboe, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	linux-block, Christoph Hellwig
In-Reply-To: <20170228233716.v3izfseg3dpdyexc@kernel.org>

Hi Shaohua,

On Wed, Mar 1, 2017 at 7:37 AM, Shaohua Li <shli@kernel.org> wrote:
> On Tue, Feb 28, 2017 at 11:41:36PM +0800, Ming Lei wrote:
>> Now we allocate one page array for managing resync pages, instead
>> of using bio's vec table to do that, and the old way is very hacky
>> and won't work any more if multipage bvec is enabled.
>>
>> The introduced cost is that we need to allocate (128 + 16) * raid_disks
>> bytes per r1_bio, and it is fine because the inflight r1_bio for
>> resync shouldn't be much, as pointed by Shaohua.
>>
>> Also the bio_reset() in raid1_sync_request() is removed because
>> all bios are freshly new now and not necessary to reset any more.
>>
>> This patch can be thought as a cleanup too
>>
>> Suggested-by: Shaohua Li <shli@kernel.org>
>> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
>> ---
>>  drivers/md/raid1.c | 83 ++++++++++++++++++++++++++++++++++--------------------
>>  1 file changed, 53 insertions(+), 30 deletions(-)
>>
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index c442b4657e2f..900144f39630 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -77,6 +77,16 @@ static void lower_barrier(struct r1conf *conf, sector_t sector_nr);
>>  #define raid1_log(md, fmt, args...)                          \
>>       do { if ((md)->queue) blk_add_trace_msg((md)->queue, "raid1 " fmt, ##args); } while (0)
>>
>> +static inline struct resync_pages *get_resync_pages(struct bio *bio)
>> +{
>> +     return bio->bi_private;
>> +}
>> +
>> +static inline struct r1bio *get_resync_r1bio(struct bio *bio)
>> +{
>> +     return get_resync_pages(bio)->raid_bio;
>> +}
>
> This is a weird between bio, r1bio and the resync_pages. I'd like the pages are

It is only a bit weird inside allocating and freeing r1bio, once all
are allocated, you
can see everthing is clean and simple:

    - r1bio includes lots of bioes,
    - and one bio is attached by one resync_pages via .bi_private

> embedded in r1bio. Maybe a pointer of r1bio to the pages. It's cleaner and more
> straightforward.

In theory, the cleanest way is to allocate one resync_pages for each resync bio,
but that isn't efficient. That is why this patch allocates all
resync_pages together
in r1buf_pool_alloc(), and split them into bio.

BTW, the only trick is just that the whole page array is stored in .bi_private
of the 1st bio, then we can avoid to add one pointer into r1bio.

>
> I think the patch 6, 7 8 should be in a same patch. Otherwise bisect will be broken.
>

No, it won't. Both patch 7 and patch 8 just replacing reading from bvec table
with reading from the pre-allocated page array.  Both should be fine, but the
latter is cleaner and simpler to do.

Thanks,
Ming

^ permalink raw reply

* Re: [PATCH v2 05/13] md: raid1: simplify r1buf_pool_free()
From: Ming Lei @ 2017-03-02  2:11 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Jens Axboe, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	linux-block, Christoph Hellwig
In-Reply-To: <20170228233105.npftquaf3fr5sfpl@kernel.org>

Hi Shaohua,

On Wed, Mar 1, 2017 at 7:31 AM, Shaohua Li <shli@kernel.org> wrote:
> On Tue, Feb 28, 2017 at 11:41:35PM +0800, Ming Lei wrote:
>> This patch gets each page's reference of each bio for resync,
>> then r1buf_pool_free() gets simplified a lot.
>>
>> The same policy has been taken in raid10's buf pool allocation/free
>> too.
>
> We are going to delete the code, this simplify isn't really required.

Could you explain it a bit? Or I can put this patch into this patchset if you
can provide one?

Thanks,
Ming Lei

^ permalink raw reply

* Re: [PATCH v2 04/13] md: prepare for managing resync I/O pages in clean way
From: Ming Lei @ 2017-03-02  2:09 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Jens Axboe, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	linux-block, Christoph Hellwig
In-Reply-To: <20170228233011.rdqtde22zwsimbz7@kernel.org>

Hi Shaohua,

On Wed, Mar 1, 2017 at 7:30 AM, Shaohua Li <shli@kernel.org> wrote:
> On Tue, Feb 28, 2017 at 11:41:34PM +0800, Ming Lei wrote:
>> Now resync I/O use bio's bec table to manage pages,
>> this way is very hacky, and may not work any more
>> once multipage bvec is introduced.
>>
>> So introduce helpers and new data structure for
>> managing resync I/O pages more cleanly.
>>
>> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
>> ---
>>  drivers/md/md.h | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 54 insertions(+)
>>
>> diff --git a/drivers/md/md.h b/drivers/md/md.h
>> index 1d63239a1be4..b5a638d85cb4 100644
>> --- a/drivers/md/md.h
>> +++ b/drivers/md/md.h
>> @@ -720,4 +720,58 @@ static inline void mddev_check_writesame(struct mddev *mddev, struct bio *bio)
>>  #define RESYNC_BLOCK_SIZE (64*1024)
>>  #define RESYNC_PAGES ((RESYNC_BLOCK_SIZE + PAGE_SIZE-1) / PAGE_SIZE)
>>
>> +/* for managing resync I/O pages */
>> +struct resync_pages {
>> +     unsigned        idx;    /* for get/put page from the pool */
>> +     void            *raid_bio;
>> +     struct page     *pages[RESYNC_PAGES];
>> +};
>
> I'd like this to be embedded into r1bio directly. Not sure if we really need a
> structure.

There are two reasons we can't put this into r1bio:
- r1bio is used in both normal and resync I/O, not fair to allocate more
in normal I/O, and that is why this patch wouldn't like to touch r1bio or r10bio

- the count of 'struct resync_pages' instance depends on raid_disks(raid1)
or copies(raid10), both can't be decided during compiling.

>
>> +
>> +static inline int resync_alloc_pages(struct resync_pages *rp,
>> +                                  gfp_t gfp_flags)
>> +{
>> +     int i;
>> +
>> +     for (i = 0; i < RESYNC_PAGES; i++) {
>> +             rp->pages[i] = alloc_page(gfp_flags);
>> +             if (!rp->pages[i])
>> +                     goto out_free;
>> +     }
>> +
>> +     return 0;
>> +
>> + out_free:
>> +     while (--i >= 0)
>> +             __free_page(rp->pages[i]);
>> +     return -ENOMEM;
>> +}
>> +
>> +static inline void resync_free_pages(struct resync_pages *rp)
>> +{
>> +     int i;
>> +
>> +     for (i = 0; i < RESYNC_PAGES; i++)
>> +             __free_page(rp->pages[i]);
>
> Since we will use get_page, shouldn't this be put_page?

You are right, will fix in v3.

>
>> +}
>> +
>> +static inline void resync_get_all_pages(struct resync_pages *rp)
>> +{
>> +     int i;
>> +
>> +     for (i = 0; i < RESYNC_PAGES; i++)
>> +             get_page(rp->pages[i]);
>> +}
>> +
>> +static inline struct page *resync_fetch_page(struct resync_pages *rp)
>> +{
>> +     if (WARN_ON_ONCE(rp->idx >= RESYNC_PAGES))
>> +             return NULL;
>> +     return rp->pages[rp->idx++];
>
> I'd like the caller explicitly specify the index instead of a global variable
> to track it, which will make the code more understandable and less error prone.

That is fine, but the index has to be per bio, and finally the index
has to be stored
in 'struct resync_pages', so every user has to call it in the following way:

          resync_fetch_page(rp, rp->idx);

then looks no benefit to pass index explicitly.

>
>> +}
>> +
>> +static inline bool resync_page_available(struct resync_pages *rp)
>> +{
>> +     return rp->idx < RESYNC_PAGES;
>> +}
>
> Then we don't need this obscure API.

That is fine.


Thanks,
Ming Lei

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox