Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* Re: [PATCH] brd: make rd_size static
From: Jens Axboe @ 2017-03-11 22:29 UTC (permalink / raw)
  To: Jason Yan, linux-block; +Cc: miaoxie, zhaohongjiang
In-Reply-To: <1489131159-840-1-git-send-email-yanaijie@huawei.com>

On 03/10/2017 12:32 AM, Jason Yan wrote:
> Fixes the following sparse warning:
> 
> drivers/block/brd.c:411:15: warning: symbol 'rd_size' was not declared.
> Should it be static?

If you do a search on this topic, you'll find others that attempted
to do the same. Arm uses it for tag parsing, for some reason, your
patch below would break it.

It'd be great if this was fixed up for real, though.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH 0/5] Updates following recent generic_make_request improvement
From: Jens Axboe @ 2017-03-11 22:32 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-block, Mike Snitzer, LKML, linux-raid,
	device-mapper development, Mikulas Patocka, Pavel Machek,
	Jack Wang, Lars Ellenberg, Kent Overstreet
In-Reply-To: <148912539296.4002.219258660543808741.stgit@noble>

On 03/09/2017 11:00 PM, NeilBrown wrote:
> This is a rebase of the series I sent earlier, based on the
> very latest from Linus, which included my first patch.
> 
> The first fixes a problem that patch introduced, and so should go to
> Linux promptly.
> The others are more general improvements and can go in the normal
> course of events.
> 
> It is possible that the changes to btrfs and xfs can just be dropped
> as a subsequent patch will be needed to revert them anyway.  They are
> there only to be able to say that "blk: make the bioset
> rescue_workqueue optional." doesn't change any functionality at all.

I have applied the first patch for this series, and added the Fixes
tag. We'll let the rest simmer a bit, then aim for 4.12 for those.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH v5] blkcg: allocate struct blkcg_gq outside request queue spinlock
From: Jens Axboe @ 2017-03-11 22:42 UTC (permalink / raw)
  To: Tahsin Erdogan, Tejun Heo; +Cc: linux-block, David Rientjes, linux-kernel
In-Reply-To: <20170309080531.9048-1-tahsin@google.com>

> @@ -185,31 +187,53 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
>  		goto err_free_blkg;
>  	}
>  
> +	if (drop_locks) {
> +		spin_unlock_irq(q->queue_lock);
> +		rcu_read_unlock();
> +	}

I have a general dislike for code like that, where you conditionally
drop locks. And this one seems even worse, since the knowledge of
whether the locks should/could be dropped or not is embedded in the gfp
flags.

> +/**
> + * blkg_lookup_create - lookup blkg, try to create one if not there
> + *
> + * Performs an initial queue bypass check and then passes control to
> + * __blkg_lookup_create().
> + */
> +struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
> +				    struct request_queue *q, gfp_t gfp,
> +				    const struct blkcg_policy *pol)
> +{
> +	WARN_ON_ONCE(!rcu_read_lock_held());
> +	lockdep_assert_held(q->queue_lock);

This seems problematic, as blkcg_bio_issue_check() calls with the rcu
read lock held.

-- 
Jens Axboe

^ permalink raw reply

* Re: 4.11.0-rc1 boot resulted in WARNING: CPU: 14 PID: 1722 at fs/sysfs/dir.c:31 .sysfs_warn_dup+0x78/0xb0
From: Jens Axboe @ 2017-03-11 22:46 UTC (permalink / raw)
  To: Brian Foster, Abdul Haleem
  Cc: linux-xfs, mpe, linuxppc-dev, linux-kernel, linux-block
In-Reply-To: <20170309125919.GA16713@bfoster.bfoster>

On 03/09/2017 05:59 AM, Brian Foster wrote:
> cc linux-block
> 
> On Thu, Mar 09, 2017 at 04:20:06PM +0530, Abdul Haleem wrote:
>> On Wed, 2017-03-08 at 08:17 -0500, Brian Foster wrote:
>>> On Tue, Mar 07, 2017 at 10:01:04PM +0530, Abdul Haleem wrote:
>>>>
>>>> Hi,
>>>>
>>>> Today's mainline (4.11.0-rc1) booted with warnings on Power7 LPAR.
>>>>
>>>> Issue is not reproducible all the time.

Is that still the case with -git as of yesterday? Check that you
have this merge:

34bbce9e344b47e8871273409632f525973afad4

in your tree.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH v5] blkcg: allocate struct blkcg_gq outside request queue spinlock
From: Jens Axboe @ 2017-03-11 22:52 UTC (permalink / raw)
  To: Tahsin Erdogan, Tejun Heo; +Cc: linux-block, David Rientjes, linux-kernel
In-Reply-To: <bb97baf4-b08a-b909-1bff-7c933b95be4e@kernel.dk>

On 03/11/2017 03:42 PM, Jens Axboe wrote:
>> @@ -185,31 +187,53 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
>>  		goto err_free_blkg;
>>  	}
>>  
>> +	if (drop_locks) {
>> +		spin_unlock_irq(q->queue_lock);
>> +		rcu_read_unlock();
>> +	}
> 
> I have a general dislike for code like that, where you conditionally
> drop locks. And this one seems even worse, since the knowledge of
> whether the locks should/could be dropped or not is embedded in the gfp
> flags.

Talked to Tejun about this as well, and we both agree that the splitting
this into separate init/alloc paths would be much cleaner. I can't
apply the current patch, sorry, it's just too ugly to live.

>> +/**
>> + * blkg_lookup_create - lookup blkg, try to create one if not there
>> + *
>> + * Performs an initial queue bypass check and then passes control to
>> + * __blkg_lookup_create().
>> + */
>> +struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
>> +				    struct request_queue *q, gfp_t gfp,
>> +				    const struct blkcg_policy *pol)
>> +{
>> +	WARN_ON_ONCE(!rcu_read_lock_held());
>> +	lockdep_assert_held(q->queue_lock);
> 
> This seems problematic, as blkcg_bio_issue_check() calls with the rcu
> read lock held.

Brain fart, that part is fine.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH v5] blkcg: allocate struct blkcg_gq outside request queue spinlock
From: Tahsin Erdogan @ 2017-03-12  4:35 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Tejun Heo, linux-block, David Rientjes, linux-kernel
In-Reply-To: <e7d7aaab-3a32-61f6-44dd-3680d7f0c8db@kernel.dk>

On Sat, Mar 11, 2017 at 2:52 PM, Jens Axboe <axboe@kernel.dk> wrote:

>
> Talked to Tejun about this as well, and we both agree that the splitting
> this into separate init/alloc paths would be much cleaner. I can't
> apply the current patch, sorry, it's just too ugly to live.

Do you mean, you prefer the approach that was taken in v1 patch or
something else?

^ permalink raw reply

* Re: [dm-devel] [PATCH 0/5] Updates following recent generic_make_request improvement
From: NeilBrown @ 2017-03-12 21:52 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-raid, Mike Snitzer, LKML, linux-block,
	device-mapper development, Mikulas Patocka, Pavel Machek,
	Jack Wang, Lars Ellenberg, Kent Overstreet
In-Reply-To: <3c5e786a-ba42-ada8-c943-17e8b72c63d1@kernel.dk>

[-- Attachment #1: Type: text/plain, Size: 866 bytes --]

On Sat, Mar 11 2017, Jens Axboe wrote:

> On 03/09/2017 11:00 PM, NeilBrown wrote:
>> This is a rebase of the series I sent earlier, based on the
>> very latest from Linus, which included my first patch.
>> 
>> The first fixes a problem that patch introduced, and so should go to
>> Linux promptly.
>> The others are more general improvements and can go in the normal
>> course of events.
>> 
>> It is possible that the changes to btrfs and xfs can just be dropped
>> as a subsequent patch will be needed to revert them anyway.  They are
>> there only to be able to say that "blk: make the bioset
>> rescue_workqueue optional." doesn't change any functionality at all.
>
> I have applied the first patch for this series, and added the Fixes
> tag. We'll let the rest simmer a bit, then aim for 4.12 for those.
>

Sounds good, thanks.
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH] brd: make rd_size static
From: Jason Yan @ 2017-03-13  0:56 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: miaoxie, zhaohongjiang
In-Reply-To: <4dd2367b-61b9-d390-5bb1-955e90fef0b0@kernel.dk>


On 2017/3/12 6:29, Jens Axboe wrote:
> On 03/10/2017 12:32 AM, Jason Yan wrote:
>> Fixes the following sparse warning:
>>
>> drivers/block/brd.c:411:15: warning: symbol 'rd_size' was not declared.
>> Should it be static?
>
> If you do a search on this topic, you'll find others that attempted
> to do the same. Arm uses it for tag parsing, for some reason, your
> patch below would break it.
>
> It'd be great if this was fixed up for real, though.
>

how about fix this like this, looks ugly but works:

#ifndef CONFIG_ARM
static
#endif

^ permalink raw reply

* Re: [PATCH] brd: make rd_size static
From: Jens Axboe @ 2017-03-13  2:56 UTC (permalink / raw)
  To: Jason Yan, linux-block; +Cc: miaoxie, zhaohongjiang
In-Reply-To: <58C5EE51.4000703@huawei.com>

On 03/12/2017 06:56 PM, Jason Yan wrote:
> 
> On 2017/3/12 6:29, Jens Axboe wrote:
>> On 03/10/2017 12:32 AM, Jason Yan wrote:
>>> Fixes the following sparse warning:
>>>
>>> drivers/block/brd.c:411:15: warning: symbol 'rd_size' was not declared.
>>> Should it be static?
>>
>> If you do a search on this topic, you'll find others that attempted
>> to do the same. Arm uses it for tag parsing, for some reason, your
>> patch below would break it.
>>
>> It'd be great if this was fixed up for real, though.
>>
> 
> how about fix this like this, looks ugly but works:
> 
> #ifndef CONFIG_ARM
> static
> #endif

No, that solves nothing, and is arguably worse. Either leave it
alone, perhaps with a comment to prevent others from making the
same error, or fix it for real.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH rfc 04/10] block: Add a non-selective polling interface
From: Sagi Grimberg @ 2017-03-13  8:15 UTC (permalink / raw)
  To: Bart Van Assche, linux-rdma@vger.kernel.org,
	linux-block@vger.kernel.org, linux-nvme@lists.infradead.org,
	target-devel@vger.kernel.org
In-Reply-To: <1489076712.2597.1.camel@sandisk.com>


>> +int blk_mq_poll_batch(struct request_queue *q, unsigned int batch)
>> +{
>> +	struct blk_mq_hw_ctx *hctx;
>> +
>> +	if (!q->mq_ops || !q->mq_ops->poll_batch)
>> +		return 0;
>> +
>> +	hctx = blk_mq_map_queue(q, smp_processor_id());
>> +	return q->mq_ops->poll_batch(hctx, batch);
>> +}
>> +EXPORT_SYMBOL_GPL(blk_mq_poll_batch);
>
> A new exported function without any documentation? Wow.

I just copied blk_mq_poll export...

> Please add a header
> above this function that documents at least which other completion processing
> code can execute concurrently with this function and from which contexts the
> other completion processing code can be called (e.g. blk_mq_poll() and
> .complete()).

I can do that, I'll document blk_mq_poll too..

> Why to return if (!q->mq_ops || !q->mq_ops->poll_batch)? Shouldn't that be a
> WARN_ON_ONCE() instead? I think it is an error to calling blk_mq_poll_batch()
> against a queue that does not define .poll_batch().

Not really, we don't know if the block driver actually supports 
poll_batch (or poll for that matter). Instead of conditioning in the
call-site, we condition within the call.

Its not really a bug, its harmless.

> Additionally, I think making the hardware context an argument of this function
> instead of using blk_mq_map_queue(q, smp_processor_id()) would make this
> function much more versatile.

What do you mean? remember that the callers interface to the device is
a request queue, it doesn't even know if its a blk-mq device. Can you
explain in more details what you would like to see?

^ permalink raw reply

* Re: [PATCH rfc 06/10] IB/cq: Don't force IB_POLL_DIRECT poll context for ib_process_cq_direct
From: Sagi Grimberg @ 2017-03-13  8:24 UTC (permalink / raw)
  To: Bart Van Assche, linux-rdma@vger.kernel.org,
	linux-block@vger.kernel.org, linux-nvme@lists.infradead.org,
	target-devel@vger.kernel.org
In-Reply-To: <1489077040.2597.3.camel@sandisk.com>


>> polling the completion queue directly does not interfere
>> with the existing polling logic, hence drop the requirement.
>>
>> This can be used for polling mode ULPs.
>>
>> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
>> ---
>>  drivers/infiniband/core/cq.c | 2 --
>>  1 file changed, 2 deletions(-)
>>
>> diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
>> index 21d1a38af489..7f6ae0ecb0c5 100644
>> --- a/drivers/infiniband/core/cq.c
>> +++ b/drivers/infiniband/core/cq.c
>> @@ -64,8 +64,6 @@ static int __ib_process_cq(struct ib_cq *cq, int budget)
>>   */
>>  int ib_process_cq_direct(struct ib_cq *cq, int budget)
>>  {
>> -	WARN_ON_ONCE(cq->poll_ctx != IB_POLL_DIRECT);
>> -
>>  	return __ib_process_cq(cq, budget);
>>  }
>>  EXPORT_SYMBOL(ib_process_cq_direct);
>
> Using ib_process_cq_direct() for queues that have another type than
> IB_POLL_DIRECT may trigger concurrent CQ processing.

Correct.

> Before this patch
> the completions from each CQ were processed sequentially. That's a big
> change so I think this should be mentioned clearly in the header above
> ib_process_cq_direct().

Note that I now see that the cq lock is not sufficient for mutual
exclusion here because we're referencing cq->wc array outside of it.

There are three options I see here:
1. we'll need to allocate a different wc array for polling mode,
perhaps a smaller one?
2. Export __ib_process_cq (in some form) with an option to pass in a wc
array.
3. Simply not support non-selective polling but it seems like a shame...

Any thoughts?

^ permalink raw reply

* Re: [PATCH rfc 04/10] block: Add a non-selective polling interface
From: Sagi Grimberg @ 2017-03-13  8:26 UTC (permalink / raw)
  To: Damien Le Moal, Johannes Thumshirn, linux-block, linux-nvme,
	linux-rdma, target-devel
In-Reply-To: <c5ef3bd9-42dc-53c9-d2f7-2e49399f7a8d@wdc.com>


>> Quite some additional newlines and I'm not really fond of the
>> ->poll_batch() name. It's a bit confusing with ->poll() and we also have
>> irq_poll(). But the only thing that would come to my mind is
>> complete_batch() which "races" with ->complete().
>
> What about ->check_completions()? After all, that is what both ->poll()
> and ->poll_batch do but with a different stop condition, no ?
> So it would also be easy to merge the two: both tag and batch are
> unsigned int which could be called "cookie", and add a flag to tell how
> to interpret it (as a tag or a batch limit).
> e.g. something like:
>
> +typedef int (check_completions_fn)(struct blk_mq_hw_ctx *,
> 		enum blk_mq_check_flags, /* flag (TAG or BATCH) */
> 		unsigned int); /* Target tag or batch limit */
>

I'd rather not to unite poll/poll_batch, but if this is something
that people want I can definitely do it.

^ permalink raw reply

* blk_init_queue vs blk_alloc_queue
From: Umesh Patel @ 2017-03-13  9:14 UTC (permalink / raw)
  To: linux-block@vger.kernel.org

Hello,

I Would like to know difference between blk_init_queue and blk_alloc_queue =
API.
Can anyone explain when to use what ?


Thanks

^ permalink raw reply

* Re: [PATCH 06/16] mmc: core: replace waitqueue with worker
From: Adrian Hunter @ 2017-03-13  9:25 UTC (permalink / raw)
  To: Jens Axboe, Linus Walleij
  Cc: linux-mmc@vger.kernel.org, Ulf Hansson, Paolo Valente,
	Chunyan Zhang, Baolin Wang, linux-block, Christoph Hellwig,
	Arnd Bergmann
In-Reply-To: <8f1f3d4c-410c-2632-c776-4ced363bdb0d@kernel.dk>

On 11/03/17 00:05, Jens Axboe wrote:
> On 03/10/2017 07:21 AM, Adrian Hunter wrote:
>>> Essentially I take out that thread and replace it with this one worker
>>> introduced in this very patch. I agree the driver can block in many ways
>>> and that is why I need to have it running in process context, and this
>>> is what the worker introduced here provides.
>>
>> The last time I looked at the blk-mq I/O scheduler code, it pulled up to
>> qdepth requests from the I/O scheduler and left them on a local list while
>> running ->queue_rq().  That means blocking in ->queue_rq() leaves some
>> number of requests in limbo (not issued but also not in the I/O scheduler)
>> for that time.
> 
> Look again, if we're not handling the requeued dispatches, we pull one
> at the time from the scheduler.
> 

That's good :-)

Now the next thing ;-)

It looks like we either set BLK_MQ_F_BLOCKING and miss the possibility of
issuing synchronous requests immediately, or we don't set BLK_MQ_F_BLOCKING
in which case we are never allowed to sleep in ->queue_rq().  Is that true?

^ permalink raw reply

* Getting kernel crash
From: Umesh Patel @ 2017-03-13 10:05 UTC (permalink / raw)
  To: linux-block@vger.kernel.org

Hi,

I'm calling blk_queue_split kernel function from my driver to split larger =
BIO and that is working fine for block size 4K and 8K but getting below cra=
sh for block size 1M.
Using stable 4.10.1 kernel version


[  890.227865] BUG: unable to handle kernel NULL pointer dereference at 000=
0000000000724
[  890.305384] IP: _raw_spin_lock_irqsave+0x22/0x3f
[  890.343896] PGD 0
[  891.381744] Call Trace:
[  891.415424]  try_to_wake_up+0x40/0x3e0
[  891.449074]  ? mempool_free_slab+0x17/0x20
[  891.483213]  wake_up_process+0x15/0x20
[  891.517492]  blkdev_bio_end_io_simple+0x1d/0x20
[  891.551363]  bio_endio+0x56/0x60


Thx.

^ permalink raw reply

* Re: blk_init_queue vs blk_alloc_queue
From: Christoph Hellwig @ 2017-03-13 13:36 UTC (permalink / raw)
  To: Umesh Patel; +Cc: linux-block@vger.kernel.org
In-Reply-To: <CY1PR0401MB1353F4C37A081FCB3E6E145EE1250@CY1PR0401MB1353.namprd04.prod.outlook.com>

On Mon, Mar 13, 2017 at 09:14:12AM +0000, Umesh Patel wrote:
> Hello,
> 
> I Would like to know difference between blk_init_queue and blk_alloc_queue API.
> Can anyone explain when to use what ?

The short answer is that you should neither.  The long answer is that
you should send a link to your driver source if you want anyone to help
you.

^ permalink raw reply

* [PATCH] blk-mq: Fix tagset reinit in the presence of cpu hot-unplug
From: Sagi Grimberg @ 2017-03-13 14:10 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-nvme

In case cpu was unplugged, we need to make sure not to assume
that the tags for that cpu are still allocated. so check
for null tags when reinitializing a tagset.

Reported-by: Yi Zhang <yizhan@redhat.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 block/blk-mq-tag.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index e48bc2c72615..9d97bfc4d465 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -295,6 +295,9 @@ int blk_mq_reinit_tagset(struct blk_mq_tag_set *set)
 	for (i = 0; i < set->nr_hw_queues; i++) {
 		struct blk_mq_tags *tags = set->tags[i];
 
+		if (!tags)
+			continue;
+
 		for (j = 0; j < tags->nr_tags; j++) {
 			if (!tags->static_rqs[j])
 				continue;
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH 06/16] mmc: core: replace waitqueue with worker
From: Jens Axboe @ 2017-03-13 14:19 UTC (permalink / raw)
  To: Adrian Hunter, Linus Walleij
  Cc: linux-mmc@vger.kernel.org, Ulf Hansson, Paolo Valente,
	Chunyan Zhang, Baolin Wang, linux-block, Christoph Hellwig,
	Arnd Bergmann
In-Reply-To: <e4b2c4bc-8627-2234-354d-c986f69b220f@intel.com>

On 03/13/2017 03:25 AM, Adrian Hunter wrote:
> On 11/03/17 00:05, Jens Axboe wrote:
>> On 03/10/2017 07:21 AM, Adrian Hunter wrote:
>>>> Essentially I take out that thread and replace it with this one worker
>>>> introduced in this very patch. I agree the driver can block in many ways
>>>> and that is why I need to have it running in process context, and this
>>>> is what the worker introduced here provides.
>>>
>>> The last time I looked at the blk-mq I/O scheduler code, it pulled up to
>>> qdepth requests from the I/O scheduler and left them on a local list while
>>> running ->queue_rq().  That means blocking in ->queue_rq() leaves some
>>> number of requests in limbo (not issued but also not in the I/O scheduler)
>>> for that time.
>>
>> Look again, if we're not handling the requeued dispatches, we pull one
>> at the time from the scheduler.
>>
> 
> That's good :-)
> 
> Now the next thing ;-)
> 
> It looks like we either set BLK_MQ_F_BLOCKING and miss the possibility of
> issuing synchronous requests immediately, or we don't set BLK_MQ_F_BLOCKING
> in which case we are never allowed to sleep in ->queue_rq().  Is that true?

Only one of those statements is true - if you don't set BLK_MQ_F_BLOCKING,
then you may never block in your ->queue_rq() function. But if you do set it,
it does not preclude immediate issue of sync requests.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH] blk-mq: Fix tagset reinit in the presence of cpu hot-unplug
From: Jens Axboe @ 2017-03-13 14:23 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-block, linux-nvme
In-Reply-To: <1489414211-30437-1-git-send-email-sagi@grimberg.me>

On 03/13/2017 08:10 AM, Sagi Grimberg wrote:
> In case cpu was unplugged, we need to make sure not to assume
> that the tags for that cpu are still allocated. so check
> for null tags when reinitializing a tagset.

Thanks Sagi, that seems reasonable. Applied for 4.11.

-- 
Jens Axboe

^ permalink raw reply

* RE: blk_init_queue vs blk_alloc_queue
From: Umesh Patel @ 2017-03-13 14:23 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-block@vger.kernel.org
In-Reply-To: <20170313133629.GA29232@infradead.org>

Currently we are using below API to initialize request queues for our devic=
e which is not calling blk_queue_split. But there is one more function blk_=
init_queues which is internally calling blk_queue_split.=20
So I was thinking to replace blk_alloc_queue with blk_init_queue but since =
I was not aware of impact I shoot this mail.
I wanted to use blk_queue_split to split BIO that block layer is sending si=
nce it it too large.

Thx

dev->osdev.queue =3D blk_alloc_queue(GFP_KERNEL);

-----Original Message-----
From: Christoph Hellwig [mailto:hch@infradead.org]=20
Sent: Monday, March 13, 2017 7:06 PM
To: Umesh Patel
Cc: linux-block@vger.kernel.org
Subject: Re: blk_init_queue vs blk_alloc_queue

On Mon, Mar 13, 2017 at 09:14:12AM +0000, Umesh Patel wrote:
> Hello,
>=20
> I Would like to know difference between blk_init_queue and blk_alloc_queu=
e API.
> Can anyone explain when to use what ?

The short answer is that you should neither.  The long answer is that you s=
hould send a link to your driver source if you want anyone to help you.

^ permalink raw reply

* Re: [PATCH v5] blkcg: allocate struct blkcg_gq outside request queue spinlock
From: Jens Axboe @ 2017-03-13 14:32 UTC (permalink / raw)
  To: Tahsin Erdogan; +Cc: Tejun Heo, linux-block, David Rientjes, linux-kernel
In-Reply-To: <CAAeU0aN6615BUP+ikqZeTm1=BSEN48uDu3Ucz8o9QrmETnkHSA@mail.gmail.com>

On 03/11/2017 09:35 PM, Tahsin Erdogan wrote:
> On Sat, Mar 11, 2017 at 2:52 PM, Jens Axboe <axboe@kernel.dk> wrote:
> 
>>
>> Talked to Tejun about this as well, and we both agree that the splitting
>> this into separate init/alloc paths would be much cleaner. I can't
>> apply the current patch, sorry, it's just too ugly to live.
> 
> Do you mean, you prefer the approach that was taken in v1 patch or
> something else?

I can no longer find v1 of the patch, just v2 and on. Can you send a
link to it?

-- 
Jens Axboe

^ permalink raw reply

* [PATCH 01/11] block: Fix bdi assignment to bdev inode when racing with disk delete
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
	Jan Kara
In-Reply-To: <20170313151410.5586-1-jack@suse.cz>

When disk->fops->open() in __blkdev_get() returns -ERESTARTSYS, we
restart the process of opening the block device. However we forget to
switch bdev->bd_bdi back to noop_backing_dev_info and as a result bdev
inode will be pointing to a stale bdi. Fix the problem by setting
bdev->bd_bdi later when __blkdev_get() is already guaranteed to succeed.

Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/block_dev.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 2eca00ec4370..53e2389ae4d4 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1556,8 +1556,6 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 		bdev->bd_disk = disk;
 		bdev->bd_queue = disk->queue;
 		bdev->bd_contains = bdev;
-		if (bdev->bd_bdi == &noop_backing_dev_info)
-			bdev->bd_bdi = bdi_get(disk->queue->backing_dev_info);
 
 		if (!partno) {
 			ret = -ENXIO;
@@ -1622,6 +1620,9 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 			}
 			bd_set_size(bdev, (loff_t)bdev->bd_part->nr_sects << 9);
 		}
+
+		if (bdev->bd_bdi == &noop_backing_dev_info)
+			bdev->bd_bdi = bdi_get(disk->queue->backing_dev_info);
 	} else {
 		if (bdev->bd_contains == bdev) {
 			ret = 0;
@@ -1653,8 +1654,6 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 	bdev->bd_disk = NULL;
 	bdev->bd_part = NULL;
 	bdev->bd_queue = NULL;
-	bdi_put(bdev->bd_bdi);
-	bdev->bd_bdi = &noop_backing_dev_info;
 	if (bdev != bdev->bd_contains)
 		__blkdev_put(bdev->bd_contains, mode, 1);
 	bdev->bd_contains = NULL;
-- 
2.10.2

^ permalink raw reply related

* [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
	Jan Kara
In-Reply-To: <20170313151410.5586-1-jack@suse.cz>

blkdev_open() may race with gendisk shutdown in two different ways.
Either del_gendisk() has already unhashed block device inode (and thus
bd_acquire() will end up creating new block device inode) however
gen_gendisk() will still return the gendisk that is being destroyed.
Or bdev returned by bd_acquire() will get unhashed and gendisk destroyed
before we get to get_gendisk() and get_gendisk() will return new gendisk
that got allocated for device that reused the device number.

In both cases this will result in possible inconsistencies between
bdev->bd_disk and bdev->bd_bdi (in the first case after gendisk gets
destroyed and device number reused, in the second case immediately).

Fix the problem by checking whether the gendisk is still alive and inode
hashed when associating bdev inode with it and its bdi. That way we are
sure that we will not associate bdev inode with disk that got past
blk_unregister_region() in del_gendisk() (and thus device number can get
reused). Similarly, we will not associate bdev that was once associated
with gendisk that is going away (and thus the corresponding bdev inode
will get unhashed in del_gendisk()) with a new gendisk that is just
reusing the device number.

Also add a warning that will tell us about unexpected inconsistencies
between bdi associated with the bdev inode and bdi associated with the
disk.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/block_dev.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 53e2389ae4d4..5ec8750f5332 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1560,7 +1560,8 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 		if (!partno) {
 			ret = -ENXIO;
 			bdev->bd_part = disk_get_part(disk, partno);
-			if (!bdev->bd_part)
+			if (!(disk->flags & GENHD_FL_UP) || !bdev->bd_part ||
+			    inode_unhashed(bdev->bd_inode))
 				goto out_clear;

 			ret = 0;
@@ -1614,7 +1615,8 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 			bdev->bd_contains = whole;
 			bdev->bd_part = disk_get_part(disk, partno);
 			if (!(disk->flags & GENHD_FL_UP) ||
-			    !bdev->bd_part || !bdev->bd_part->nr_sects) {
+			    !bdev->bd_part || !bdev->bd_part->nr_sects ||
+			    inode_unhashed(bdev->bd_inode)) {
 				ret = -ENXIO;
 				goto out_clear;
 			}
@@ -1623,6 +1625,9 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)

 		if (bdev->bd_bdi == &noop_backing_dev_info)
 			bdev->bd_bdi = bdi_get(disk->queue->backing_dev_info);
+		else
+			WARN_ON_ONCE(bdev->bd_bdi !=
+				     disk->queue->backing_dev_info);
 	} else {
 		if (bdev->bd_contains == bdev) {
 			ret = 0;
-- 
2.10.2

^ permalink raw reply related

* [PATCH 03/11] bdi: Mark congested->bdi as internal
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
	Jan Kara
In-Reply-To: <20170313151410.5586-1-jack@suse.cz>

congested->bdi pointer is used only to be able to remove congested
structure from bdi->cgwb_congested_tree on structure release. Moreover
the pointer can become NULL when we unregister the bdi. Rename the field
to __bdi and add a comment to make it more explicit this is internal
stuff of memcg writeback code and people should not use the field as
such use will be likely race prone.

We do not bother with converting congested->bdi to a proper refcounted
reference. It will be slightly ugly to special-case bdi->wb.congested to
avoid effectively a cyclic reference of bdi to itself and the reference
gets cleared from bdi_unregister() making it impossible to reference
a freed bdi.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/backing-dev-defs.h |  4 +++-
 mm/backing-dev.c                 | 10 +++++-----
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index ad955817916d..8fb3dcdebc80 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -54,7 +54,9 @@ struct bdi_writeback_congested {
 	atomic_t refcnt;		/* nr of attached wb's and blkg */
 
 #ifdef CONFIG_CGROUP_WRITEBACK
-	struct backing_dev_info *bdi;	/* the associated bdi */
+	struct backing_dev_info *__bdi;	/* the associated bdi, set to NULL
+					 * on bdi unregistration. For memcg-wb
+					 * internal use only! */
 	int blkcg_id;			/* ID of the associated blkcg */
 	struct rb_node rb_node;		/* on bdi->cgwb_congestion_tree */
 #endif
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index c6f2a37028c2..12408f86783c 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -438,7 +438,7 @@ wb_congested_get_create(struct backing_dev_info *bdi, int blkcg_id, gfp_t gfp)
 		return NULL;
 
 	atomic_set(&new_congested->refcnt, 0);
-	new_congested->bdi = bdi;
+	new_congested->__bdi = bdi;
 	new_congested->blkcg_id = blkcg_id;
 	goto retry;
 
@@ -466,10 +466,10 @@ void wb_congested_put(struct bdi_writeback_congested *congested)
 	}
 
 	/* bdi might already have been destroyed leaving @congested unlinked */
-	if (congested->bdi) {
+	if (congested->__bdi) {
 		rb_erase(&congested->rb_node,
-			 &congested->bdi->cgwb_congested_tree);
-		congested->bdi = NULL;
+			 &congested->__bdi->cgwb_congested_tree);
+		congested->__bdi = NULL;
 	}
 
 	spin_unlock_irqrestore(&cgwb_lock, flags);
@@ -752,7 +752,7 @@ static void cgwb_bdi_exit(struct backing_dev_info *bdi)
 			rb_entry(rbn, struct bdi_writeback_congested, rb_node);
 
 		rb_erase(rbn, &bdi->cgwb_congested_tree);
-		congested->bdi = NULL;	/* mark @congested unlinked */
+		congested->__bdi = NULL;	/* mark @congested unlinked */
 	}
 	spin_unlock_irq(&cgwb_lock);
 }
-- 
2.10.2

^ permalink raw reply related

* [PATCH 04/11] bdi: Make wb->bdi a proper reference
From: Jan Kara @ 2017-03-13 15:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Tejun Heo, Tahsin Erdogan, Omar Sandoval,
	Jan Kara
In-Reply-To: <20170313151410.5586-1-jack@suse.cz>

Make wb->bdi a proper refcounted reference to bdi for all bdi_writeback
structures except for the one embedded inside struct backing_dev_info.
That will allow us to simplify bdi unregistration.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/backing-dev.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 12408f86783c..03d4ba27c133 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -294,6 +294,8 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
 
 	memset(wb, 0, sizeof(*wb));
 
+	if (wb != &bdi->wb)
+		bdi_get(bdi);
 	wb->bdi = bdi;
 	wb->last_old_flush = jiffies;
 	INIT_LIST_HEAD(&wb->b_dirty);
@@ -314,8 +316,10 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
 	wb->dirty_sleep = jiffies;
 
 	wb->congested = wb_congested_get_create(bdi, blkcg_id, gfp);
-	if (!wb->congested)
-		return -ENOMEM;
+	if (!wb->congested) {
+		err = -ENOMEM;
+		goto out_put_bdi;
+	}
 
 	err = fprop_local_init_percpu(&wb->completions, gfp);
 	if (err)
@@ -335,6 +339,9 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
 	fprop_local_destroy_percpu(&wb->completions);
 out_put_cong:
 	wb_congested_put(wb->congested);
+out_put_bdi:
+	if (wb != &bdi->wb)
+		bdi_put(bdi);
 	return err;
 }
 
@@ -372,6 +379,8 @@ static void wb_exit(struct bdi_writeback *wb)
 
 	fprop_local_destroy_percpu(&wb->completions);
 	wb_congested_put(wb->congested);
+	if (wb != &wb->bdi->wb)
+		bdi_put(wb->bdi);
 }
 
 #ifdef CONFIG_CGROUP_WRITEBACK
-- 
2.10.2

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox