Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* Re: [PATCH 03/23] sd: implement REQ_OP_WRITE_ZEROES
From: Paolo Bonzini @ 2017-03-29 14:51 UTC (permalink / raw)
  To: Bart Van Assche, agk@redhat.com, lars.ellenberg@linbit.com,
	snitzer@redhat.com, hch@lst.de, martin.petersen@oracle.com,
	philipp.reisner@linbit.com, axboe@kernel.dk, shli@kernel.org
  Cc: linux-scsi@vger.kernel.org, dm-devel@redhat.com,
	drbd-dev@lists.linbit.com, linux-block@vger.kernel.org,
	linux-raid@vger.kernel.org
In-Reply-To: <1490726988.2573.16.camel@sandisk.com>

On 28/03/2017 20:50, Bart Van Assche wrote:
> 
> This means that just like the start and end of a discard must be aligned on a
> discard_granularity boundary, WRITE SAME commands with the UNMAP bit set must
> also respect that granularity. I think this means that either
> __blkdev_issue_zeroout() has to be modified such that it rejects unaligned
> REQ_OP_WRITE_ZEROES operations or that blk_bio_write_same_split() has to be
> modified such that it generates REQ_OP_WRITEs for the unaligned start and tail.

I don't think this is the case.

Rather, Linux should try to align the WRITE SAME commands to the optimal
unmap granularity if the zeroed area requires performing more than one
WRITE SAME command (i.e. > maximum write same length or too large to fit
in the CDB).  However, even in that case it can use WRITE SAME with
UNMAP for the unaligned start and tail.  Unlike the UNMAP command, the
SCSI standard does guarantee that zeroes are written in the unaligned parts.

Paolo

^ permalink raw reply

* Re: [PATCH 23/23] block: remove the discard_zeroes_data flag
From: Paolo Bonzini @ 2017-03-29 14:52 UTC (permalink / raw)
  To: Bart Van Assche, agk@redhat.com, lars.ellenberg@linbit.com,
	snitzer@redhat.com, hch@lst.de, martin.petersen@oracle.com,
	philipp.reisner@linbit.com, axboe@kernel.dk, shli@kernel.org
  Cc: linux-scsi@vger.kernel.org, dm-devel@redhat.com,
	drbd-dev@lists.linbit.com, linux-block@vger.kernel.org,
	linux-raid@vger.kernel.org
In-Reply-To: <1490720411.2573.11.camel@sandisk.com>



On 28/03/2017 19:00, Bart Van Assche wrote:
> On Thu, 2017-03-23 at 10:33 -0400, Christoph Hellwig wrote:
>> Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can
>> kill this hack.
>>
>> [ ... ]
>>
>> diff --git a/Documentation/ABI/testing/sysfs-block b/Documentation/ABI/testing/sysfs-block
>> index 2da04ce6aeef..dea212db9df3 100644
>> --- a/Documentation/ABI/testing/sysfs-block
>> +++ b/Documentation/ABI/testing/sysfs-block
>> @@ -213,14 +213,8 @@ What:		/sys/block/<disk>/queue/discard_zeroes_data
>>  Date:		May 2011
>>  Contact:	Martin K. Petersen <martin.petersen@oracle.com>
>>  Description:
>> -		Devices that support discard functionality may return
>> -		stale or random data when a previously discarded block
>> -		is read back. This can cause problems if the filesystem
>> -		expects discarded blocks to be explicitly cleared. If a
>> -		device reports that it deterministically returns zeroes
>> -		when a discarded area is read the discard_zeroes_data
>> -		parameter will be set to one. Otherwise it will be 0 and
>> -		the result of reading a discarded area is undefined.
>> +		Will always return 0.  Don't rely on any specific behavior
>> +		for discards, and don't read this file.
>>  
>>  What:		/sys/block/<disk>/queue/write_same_max_bytes
>>  Date:		January 2012
>>
>> [ ... ]
>>
>> --- a/block/blk-sysfs.c
>> +++ b/block/blk-sysfs.c
>> @@ -208,7 +208,7 @@ static ssize_t queue_discard_max_store(struct request_queue *q,
>>  
>>  static ssize_t queue_discard_zeroes_data_show(struct request_queue *q, char *page)
>>  {
>> -	return queue_var_show(queue_discard_zeroes_data(q), page);
>> +	return 0;
>>  }
> 
> Hello Christoph,
> 
> It seems to me like the documentation in Documentation/ABI/testing/sysfs-block
> and the above code are not in sync. I think the above code will cause reading
> from the discard_zeroes_data attribute to return an empty string ("") instead
> of "0\n".
> 
> BTW, my personal preference is to remove this attribute entirely because keeping
> it will cause confusion, no matter how well we document the behavior of this
> attribute.

If you remove it, you should probably remove the BLKDISCARDZEROES ioctl too.

That said, the issue with discard_zeroes_data is that it is badly
defined; it was defined as "if I unmap X, will it read as zeroes?" but
this is not how the SCSI standard defines e.g. the UNMAP command with
LBPRZ=1.  But knowing something like LBPRZ ("if something is unmapped,
will it read as zeroes?") _would_ actually be useful for userspace.
This will be especially true once sd maps lseek(SEEK_HOLE/SEEK_DATA) to
the SCSI GET LBA STATUS command, or once dm-thin supports them.

Secondarily, if the former returns 1, userspace is also interested in
knowing "can REQ_OP_WRITE_ZEROES+REQ_UNMAP ever unmap anything?", i.e.
whether BLKDEV_ZERO_NOFALLBACK will ever return anything but
-EOPNOTSUPP.  For SCSI, this should intuitively mean whether LBPWS or
LBPWS10 are set, but the details depend on how the sd driver implements
REQ_OP_WRITE_ZEROES with REQ_UNMAP.

Paolo

^ permalink raw reply

* Re: RFC: always use REQ_OP_WRITE_ZEROES for zeroing offload
From: Paolo Bonzini @ 2017-03-29 14:57 UTC (permalink / raw)
  To: Mike Snitzer, Christoph Hellwig, Jens Axboe, Martin K. Petersen,
	Alasdair G Kergon, shli@kernel.org, philipp.reisner@linbit.com,
	linux-block@vger.kernel.org, Linux SCSI List,
	drbd-dev@lists.linbit.com, dm, linux-raid@vger.kernel.org
In-Reply-To: <20170323225256.GK1138@soda.linbit>

On 23/03/2017 23:53, Lars Ellenberg wrote:
> Thin does not claim to zero data on discard.  which is ok, and correct,
> because it only punches holes on full chunks (or whatever you call
> them), and leaves the rest in the mapping tree as is.
> 
> And that behaviour would prevent DRBD from exposing discards if
> configured on top of thin. (see above)
> 
> But thin *could* easily guarantee zeroing, by simply punching holes
> where it can, and zeroing out the not fully-aligned partial start and
> end of the range.

That's the difference between REQ_OP_DISCARD (only punches holes on full
chunks) and REQ_OP_WRITE_ZEROES with the REQ_UNMAP flag (punches holes +
zeroes incomplete chunks).

dm-thinp's REQ_OP_DISCARD should not do anything for unaligned parts.
Instead, layers above should use REQ_OP_WRITE_ZEROES (with or without
REQ_UNMAP, as required) if they need zeroes.  dm-thinp would have to
split off the partial chunks, and zero them in the lower-level device
with REQ_OP_WRITE_ZEROES.

Paolo

^ permalink raw reply

* Re: [PATCH 12/23] sd: handle REQ_UNMAP
From: Paolo Bonzini @ 2017-03-29 14:57 UTC (permalink / raw)
  To: Bart Van Assche, Alasdair G Kergon, lars.ellenberg@linbit.com,
	Mike Snitzer, Christoph Hellwig, Martin K. Petersen,
	philipp.reisner@linbit.com, Jens Axboe, shli@kernel.org
  Cc: linux-block@vger.kernel.org, linux-raid@vger.kernel.org, dm,
	Linux SCSI List, drbd-dev@lists.linbit.com
In-Reply-To: <1490719722.2573.8.camel@sandisk.com>



On 28/03/2017 18:48, Bart Van Assche wrote:
>> +	if (rq->cmd_flags & REQ_UNMAP) {
>> +		switch (sdkp->provisioning_mode) {
>> +		case SD_LBP_WS16:
>> +			return sd_setup_write_same16_cmnd(cmd, true);
>> +		case SD_LBP_WS10:
>> +			return sd_setup_write_same10_cmnd(cmd, true);
>> +		}
>> +	}
>> +
>>  	if (sdp->no_write_same)
>>  		return BLKPREP_INVALID;
>>  	if (sdkp->ws16 || sector > 0xffffffff || nr_sectors > 0xffff)
> Users can change the provisioning mode from user space from SD_LBP_WS16 into
> SD_LBP_WS10 so I'm not sure it's safe to skip the (sdkp->ws16 || sector >
> 0xffffffff || nr_sectors > 0xffff) check if REQ_UNMAP is set.

Yeah, if REQ_UNMAP is set you should probably check sdkp->provisioning_mode
instead of sdkp->ws16, but apart from this it should still go through the
checks below.

Plus, if the provisioning mode is not ws10 or ws16, should
sd_setup_write_zeroes_cmnd:

1) do a WRITE SAME without UNMAP (what Christoph's code does)

2) return BLKPREP_INVALID

3) ignore provisioning mode and do a WRITE SAME with UNMAP

4) do a WRITE SAME without UNMAP for SD_LBP_{ZERO,FULL,DISABLE},
do a WRITE SAME with UNMAP for SD_LBP_{WS10,WS16,UNMAP}.

I'm in favor of (4).  The distinction between SD_LBP_UNMAP, SD_LBP_WS10
and SD_LBP_WS16 is as problematic as discard_zeroes_data in my opinion.

Thanks,

Paolo

^ permalink raw reply

* Re: [PATCH 03/23] sd: implement REQ_OP_WRITE_ZEROES
From: Bart Van Assche @ 2017-03-29 16:28 UTC (permalink / raw)
  To: agk@redhat.com, lars.ellenberg@linbit.com, snitzer@redhat.com,
	hch@lst.de, martin.petersen@oracle.com,
	philipp.reisner@linbit.com, axboe@kernel.dk, pbonzini@redhat.com,
	shli@kernel.org
  Cc: linux-scsi@vger.kernel.org, dm-devel@redhat.com,
	drbd-dev@lists.linbit.com, linux-block@vger.kernel.org,
	linux-raid@vger.kernel.org
In-Reply-To: <e19ca1bf-0908-1eb0-cb04-cd306692f216@redhat.com>

On Wed, 2017-03-29 at 16:51 +0200, Paolo Bonzini wrote:
> On 28/03/2017 20:50, Bart Van Assche wrote:
> > This means that just like the start and end of a discard must be aligne=
d on a
> > discard_granularity boundary, WRITE SAME commands with the UNMAP bit se=
t must
> > also respect that granularity. I think this means that either
> > __blkdev_issue_zeroout() has to be modified such that it rejects unalig=
ned
> > REQ_OP_WRITE_ZEROES operations or that blk_bio_write_same_split() has t=
o be
> > modified such that it generates REQ_OP_WRITEs for the unaligned start a=
nd tail.
>=20
> I don't think this is the case.

Hello Paolo,

Can you cite the section(s) from the SCSI specs that support your view? I
reread the "5.49 WRITE SAME (10) command" and "4.7.3.4.4 WRITE SAME command
and unmap operations" sections but I have not found any explicit statement
that specifies the behavior for unaligned WRITE SAME commands with the UNMA=
P
bit set. It seems to me like=A0the OPTIMAL UNMAP GRANULARITY parameter was
overlooked when both sections were written. Should we ask the T10 committee
for a clarification?

Another question is, if the specification of WRITE SAME + UNMAP would be
made unambiguous in the SBC document, whether or not we should take the ris=
k
to trigger behavior that is not what we expect by sending unaligned WRITE
SAME + UNMAP commands to SCSI devices?

Thanks,

Bart.=

^ permalink raw reply

* Re: [PATCH 03/23] sd: implement REQ_OP_WRITE_ZEROES
From: Paolo Bonzini @ 2017-03-29 16:53 UTC (permalink / raw)
  To: Bart Van Assche, agk@redhat.com, lars.ellenberg@linbit.com,
	snitzer@redhat.com, hch@lst.de, martin.petersen@oracle.com,
	philipp.reisner@linbit.com, axboe@kernel.dk, shli@kernel.org
  Cc: linux-scsi@vger.kernel.org, dm-devel@redhat.com,
	drbd-dev@lists.linbit.com, linux-block@vger.kernel.org,
	linux-raid@vger.kernel.org
In-Reply-To: <1490804856.3551.3.camel@sandisk.com>



On 29/03/2017 18:28, Bart Van Assche wrote:
> On Wed, 2017-03-29 at 16:51 +0200, Paolo Bonzini wrote:
>> On 28/03/2017 20:50, Bart Van Assche wrote:
>>> This means that just like the start and end of a discard must be aligned on a
>>> discard_granularity boundary, WRITE SAME commands with the UNMAP bit set must
>>> also respect that granularity. I think this means that either
>>> __blkdev_issue_zeroout() has to be modified such that it rejects unaligned
>>> REQ_OP_WRITE_ZEROES operations or that blk_bio_write_same_split() has to be
>>> modified such that it generates REQ_OP_WRITEs for the unaligned start and tail.
>>
>> I don't think this is the case.
> 
> Hello Paolo,
> 
> Can you cite the section(s) from the SCSI specs that support your view? I
> reread the "5.49 WRITE SAME (10) command" and "4.7.3.4.4 WRITE SAME command
> and unmap operations" sections but I have not found any explicit statement
> that specifies the behavior for unaligned WRITE SAME commands with the UNMAP
> bit set. It seems to me like the OPTIMAL UNMAP GRANULARITY parameter was
> overlooked when both sections were written. Should we ask the T10 committee
> for a clarification?

>From 4.7.3.4.4:

------
If unmap operations are requested in a WRITE SAME command,
then for each specified LBA:

if the Data-Out Buffer of the WRITE SAME command is the same as the
logical block data returned by a read operation from that LBA while in
the unmapped state (see 4.7.4.5), then:

1) the device server performs the actions described in table 6; and

2) if an unmap operation is not performed in step 1), then the device
server shall perform the specified write operation to that LBA;
------

and from the description of WRITE SAME (10): "subsequent read operations
behave as if the device server wrote the single block of user data
received from the Data-Out Buffer to each logical block without
modification" (I have a slightly older copy though, it's 5.45 here).

It's pretty unambiguous that if the device cannot unmap (including the
case where the request is misaligned with respect to the granularity) it
does a write.

> Another question is, if the specification of WRITE SAME + UNMAP would be
> made unambiguous in the SBC document, whether or not we should take the risk
> to trigger behavior that is not what we expect by sending unaligned WRITE
> SAME + UNMAP commands to SCSI devices?

Yes, I think we should.

Paolo

^ permalink raw reply

* kmemleak complaints on request queue stats (virtio)
From: Sagi Grimberg @ 2017-03-29 17:02 UTC (permalink / raw)
  To: linux-block@vger.kernel.org

Hi,

I just got the below kmemleak report. Just thought I'd send it
out as I don't have time to look into it just now...

--
unreferenced object 0xffff8ab236717920 (size 32):
   comm "swapper/0", pid 1, jiffies 4294892551 (age 9966.044s)
   hex dump (first 32 bytes):
     20 79 71 36 b2 8a ff ff 20 79 71 36 b2 8a ff ff   yq6.... yq6....
     00 00 00 00 ff ff ff ff e0 c1 4b 9b ff ff ff ff  ..........K.....
   backtrace:
     [<ffffffff9b84212a>] kmemleak_alloc+0x4a/0xa0
     [<ffffffff9b1ff780>] kmem_cache_alloc_trace+0x110/0x230
     [<ffffffff9b3cd74f>] blk_alloc_queue_stats+0x1f/0x40
     [<ffffffff9b3bacd4>] blk_alloc_queue_node+0x94/0x2e0
     [<ffffffff9b3cbdd0>] blk_mq_init_queue+0x20/0x60
     [<ffffffff9b593665>] loop_add+0xe5/0x270
     [<ffffffff9bfe348b>] loop_init+0x10b/0x149
     [<ffffffff9b002193>] do_one_initcall+0x53/0x1a0
     [<ffffffff9bf8b133>] kernel_init_freeable+0x16d/0x1f3
     [<ffffffff9b83ed0e>] kernel_init+0xe/0x100
     [<ffffffff9b84d27c>] ret_from_fork+0x2c/0x40
     [<ffffffffffffffff>] 0xffffffffffffffff
unreferenced object 0xffff8ab23666d440 (size 32):
   comm "swapper/0", pid 1, jiffies 4294892552 (age 9966.040s)
   hex dump (first 32 bytes):
     40 d4 66 36 b2 8a ff ff 40 d4 66 36 b2 8a ff ff  @.f6....@.f6....
     00 00 00 00 00 00 00 00 2a 21 84 9b ff ff ff ff  ........*!......
   backtrace:
     [<ffffffff9b84212a>] kmemleak_alloc+0x4a/0xa0
     [<ffffffff9b1ff780>] kmem_cache_alloc_trace+0x110/0x230
     [<ffffffff9b3cd74f>] blk_alloc_queue_stats+0x1f/0x40
     [<ffffffff9b3bacd4>] blk_alloc_queue_node+0x94/0x2e0
     [<ffffffff9b3cbdd0>] blk_mq_init_queue+0x20/0x60
     [<ffffffff9b596382>] virtblk_probe+0x172/0x700
     [<ffffffff9b4e3193>] virtio_dev_probe+0x143/0x1f0
     [<ffffffff9b56eb7f>] driver_probe_device+0x2bf/0x460
     [<ffffffff9b56edff>] __driver_attach+0xdf/0xf0
     [<ffffffff9b56c914>] bus_for_each_dev+0x64/0xa0
     [<ffffffff9b56e3ce>] driver_attach+0x1e/0x20
     [<ffffffff9b56decd>] bus_add_driver+0x1fd/0x270
     [<ffffffff9b56f910>] driver_register+0x60/0xe0
     [<ffffffff9b4e2e10>] register_virtio_driver+0x20/0x30
     [<ffffffff9bfe351a>] init+0x51/0x7e
     [<ffffffff9b002193>] do_one_initcall+0x53/0x1a0
--

^ permalink raw reply

* Re: kmemleak complaints on request queue stats (virtio)
From: Jens Axboe @ 2017-03-29 17:04 UTC (permalink / raw)
  To: Sagi Grimberg, linux-block@vger.kernel.org
In-Reply-To: <c1659bc3-82e1-7d05-cec7-d5a6e78bd8e5@grimberg.me>

On 03/29/2017 11:02 AM, Sagi Grimberg wrote:
> Hi,
> 
> I just got the below kmemleak report. Just thought I'd send it
> out as I don't have time to look into it just now...
> 
> --
> unreferenced object 0xffff8ab236717920 (size 32):
>    comm "swapper/0", pid 1, jiffies 4294892551 (age 9966.044s)
>    hex dump (first 32 bytes):
>      20 79 71 36 b2 8a ff ff 20 79 71 36 b2 8a ff ff   yq6.... yq6....
>      00 00 00 00 ff ff ff ff e0 c1 4b 9b ff ff ff ff  ..........K.....
>    backtrace:
>      [<ffffffff9b84212a>] kmemleak_alloc+0x4a/0xa0
>      [<ffffffff9b1ff780>] kmem_cache_alloc_trace+0x110/0x230
>      [<ffffffff9b3cd74f>] blk_alloc_queue_stats+0x1f/0x40
>      [<ffffffff9b3bacd4>] blk_alloc_queue_node+0x94/0x2e0
>      [<ffffffff9b3cbdd0>] blk_mq_init_queue+0x20/0x60
>      [<ffffffff9b593665>] loop_add+0xe5/0x270
>      [<ffffffff9bfe348b>] loop_init+0x10b/0x149
>      [<ffffffff9b002193>] do_one_initcall+0x53/0x1a0
>      [<ffffffff9bf8b133>] kernel_init_freeable+0x16d/0x1f3
>      [<ffffffff9b83ed0e>] kernel_init+0xe/0x100
>      [<ffffffff9b84d27c>] ret_from_fork+0x2c/0x40
>      [<ffffffffffffffff>] 0xffffffffffffffff

You don't mention what you are running? But I'm assuming it was my 4.12
branch. If so, this is fixed in a later revision of it.  If you pull an
update, it should go away.

-- 
Jens Axboe

^ permalink raw reply

* [PATCH] blk-mq-pci: Fix two spelling mistakes
From: Sagi Grimberg @ 2017-03-29 17:04 UTC (permalink / raw)
  To: linux-block

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 block/blk-mq-pci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-mq-pci.c b/block/blk-mq-pci.c
index 966c2169762e..0c3354cf3552 100644
--- a/block/blk-mq-pci.c
+++ b/block/blk-mq-pci.c
@@ -23,7 +23,7 @@
  * @pdev:	PCI device associated with @set.
  *
  * This function assumes the PCI device @pdev has at least as many available
- * interrupt vetors as @set has queues.  It will then queuery the vector
+ * interrupt vectors as @set has queues.  It will then query the vector
  * corresponding to each queue for it's affinity mask and built queue mapping
  * that maps a queue to the CPUs that have irq affinity for the corresponding
  * vector.
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH v2 1/3] block: warn if sharing request queue across gendisks
From: Omar Sandoval @ 2017-03-29 17:06 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, kernel-team
In-Reply-To: <20170329141035.GB8461@kernel.dk>

On Wed, Mar 29, 2017 at 08:10:35AM -0600, Jens Axboe wrote:
> On Tue, Mar 28 2017, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > Now that the remaining drivers have been converted to one request queue
> > per gendisk, let's warn if a request queue gets registered more than
> > once. This will catch future drivers which might do it inadvertently or
> > any old drivers that I may have missed.
> 
> Added 1-3 for 4.12, thanks. Final nail in the coffin for shared queues.

Thanks!

> BTW, can you include a cover letter when it's more than one patch? Makes
> the flow a bit more straight forward.

Yeah, I'll make sure to do that in the future.

^ permalink raw reply

* Re: kmemleak complaints on request queue stats (virtio)
From: Sagi Grimberg @ 2017-03-29 17:06 UTC (permalink / raw)
  To: Jens Axboe, linux-block@vger.kernel.org
In-Reply-To: <401a0b33-c8ea-d71d-7a9e-3dc47c9129ef@kernel.dk>


> You don't mention what you are running? But I'm assuming it was my 4.12
> branch.

Ehh, details...

> If so, this is fixed in a later revision of it.  If you pull an
> update, it should go away.

Will try, thanks Jens.

^ permalink raw reply

* Re: [PATCH] blk-mq-pci: Fix two spelling mistakes
From: Jens Axboe @ 2017-03-29 17:11 UTC (permalink / raw)
  To: Sagi Grimberg, linux-block
In-Reply-To: <1490807076-23246-1-git-send-email-sagi@grimberg.me>

On 03/29/2017 11:04 AM, Sagi Grimberg wrote:
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>

Applied, thanks.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH] block-mq: don't re-queue if we get a queue error
From: Sagi Grimberg @ 2017-03-29 18:01 UTC (permalink / raw)
  To: Josef Bacik, linux-block, kernel-team
In-Reply-To: <1490733472-3088-1-git-send-email-jbacik@fb.com>

Looks good,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply

* Re: [PATCH] block-mq: don't re-queue if we get a queue error
From: Jens Axboe @ 2017-03-29 19:19 UTC (permalink / raw)
  To: Josef Bacik, linux-block, kernel-team
In-Reply-To: <1490733472-3088-1-git-send-email-jbacik@fb.com>

On 03/28/2017 02:37 PM, Josef Bacik wrote:
> When try to issue a request directly and we fail we will requeue the
> request, but call blk_mq_end_request() as well.  This leads to the
> completed request being on a queuelist and getting ended twice, which
> causes list corruption in schedulers and other shenanigans.

I think this is purely a cosmetic issue, as it should cause no
corruption. But it doesn't make sense to issue a requeue trace,
for instance, if we're just going to end the IO anyway. I have
applied it for 4.12.

-- 
Jens Axboe

^ permalink raw reply

* [PATCH] blk-mq: Export queue state through /sys/kernel/debug/block/*/state
From: Bart Van Assche @ 2017-03-29 20:20 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Omar Sandoval, Hannes Reinecke, linux-block@vger.kernel.org

Make it possible to check whether or not a block layer queue has=0A=
been stopped. Make it possible to run a blk-mq queue from user=0A=
space.=0A=
=0A=
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>=0A=
Cc: Omar Sandoval <osandov@fb.com>=0A=
Cc: Hannes Reinecke <hare@suse.com>=0A=
---=0A=
 block/blk-mq-debugfs.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++=
++++=0A=
 1 file changed, 84 insertions(+)=0A=
=0A=
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c=0A=
index 4b3f962a9c7a..cff780c47d88 100644=0A=
--- a/block/blk-mq-debugfs.c=0A=
+++ b/block/blk-mq-debugfs.c=0A=
@@ -43,6 +43,87 @@ static int blk_mq_debugfs_seq_open(struct inode *inode, =
struct file *file,=0A=
 	return ret;=0A=
 }=0A=
 =0A=
+static const char *const blk_queue_flag_name[] =3D {=0A=
+	[QUEUE_FLAG_QUEUED]	 =3D "QUEUED",=0A=
+	[QUEUE_FLAG_STOPPED]	 =3D "STOPPED",=0A=
+	[QUEUE_FLAG_SYNCFULL]	 =3D "SYNCFULL",=0A=
+	[QUEUE_FLAG_ASYNCFULL]	 =3D "ASYNCFULL",=0A=
+	[QUEUE_FLAG_DYING]	 =3D "DYING",=0A=
+	[QUEUE_FLAG_BYPASS]	 =3D "BYPASS",=0A=
+	[QUEUE_FLAG_BIDI]	 =3D "BIDI",=0A=
+	[QUEUE_FLAG_NOMERGES]	 =3D "NOMERGES",=0A=
+	[QUEUE_FLAG_SAME_COMP]	 =3D "SAME_COMP",=0A=
+	[QUEUE_FLAG_FAIL_IO]	 =3D "FAIL_IO",=0A=
+	[QUEUE_FLAG_STACKABLE]	 =3D "STACKABLE",=0A=
+	[QUEUE_FLAG_NONROT]	 =3D "NONROT",=0A=
+	[QUEUE_FLAG_VIRT]	 =3D "VIRT",=0A=
+	[QUEUE_FLAG_IO_STAT]	 =3D "IO_STAT",=0A=
+	[QUEUE_FLAG_DISCARD]	 =3D "DISCARD",=0A=
+	[QUEUE_FLAG_NOXMERGES]	 =3D "NOXMERGES",=0A=
+	[QUEUE_FLAG_ADD_RANDOM]	 =3D "ADD_RANDOM",=0A=
+	[QUEUE_FLAG_SECERASE]	 =3D "SECERASE",=0A=
+	[QUEUE_FLAG_SAME_FORCE]	 =3D "SAME_FORCE",=0A=
+	[QUEUE_FLAG_DEAD]	 =3D "DEAD",=0A=
+	[QUEUE_FLAG_INIT_DONE]	 =3D "INIT_DONE",=0A=
+	[QUEUE_FLAG_NO_SG_MERGE] =3D "NO_SG_MERGE",=0A=
+	[QUEUE_FLAG_POLL]	 =3D "POLL",=0A=
+	[QUEUE_FLAG_WC]		 =3D "WC",=0A=
+	[QUEUE_FLAG_FUA]	 =3D "FUA",=0A=
+	[QUEUE_FLAG_FLUSH_NQ]	 =3D "FLUSH_NQ",=0A=
+	[QUEUE_FLAG_DAX]	 =3D "DAX",=0A=
+	[QUEUE_FLAG_STATS]	 =3D "STATS",=0A=
+	[QUEUE_FLAG_RESTART]	 =3D "RESTART",=0A=
+	[QUEUE_FLAG_POLL_STATS]	 =3D "POLL_STATS",=0A=
+};=0A=
+=0A=
+static int blk_queue_flags_show(struct seq_file *m, void *v)=0A=
+{=0A=
+	struct request_queue *q =3D m->private;=0A=
+	bool sep =3D false;=0A=
+	int i;=0A=
+=0A=
+	for (i =3D 0; i < sizeof(q->queue_flags) * BITS_PER_BYTE; i++) {=0A=
+		if (!(q->queue_flags & BIT(i)))=0A=
+			continue;=0A=
+		if (sep)=0A=
+			seq_puts(m, " ");=0A=
+		sep =3D true;=0A=
+		if (blk_queue_flag_name[i])=0A=
+			seq_puts(m, blk_queue_flag_name[i]);=0A=
+		else=0A=
+			seq_printf(m, "%d", i);=0A=
+	}=0A=
+	seq_puts(m, "\n");=0A=
+	return 0;=0A=
+}=0A=
+=0A=
+static ssize_t blk_queue_flags_store(struct file *file, const char __user =
*ubuf,=0A=
+				     size_t len, loff_t *offp)=0A=
+{=0A=
+	struct request_queue *q =3D file_inode(file)->i_private;=0A=
+=0A=
+	blk_mq_run_hw_queues(q, true);=0A=
+	return len;=0A=
+}=0A=
+=0A=
+static int blk_queue_flags_open(struct inode *inode, struct file *file)=0A=
+{=0A=
+	return single_open(file, blk_queue_flags_show, inode->i_private);=0A=
+}=0A=
+=0A=
+static const struct file_operations blk_queue_flags_fops =3D {=0A=
+	.open		=3D blk_queue_flags_open,=0A=
+	.read		=3D seq_read,=0A=
+	.llseek		=3D seq_lseek,=0A=
+	.release	=3D single_release,=0A=
+	.write		=3D blk_queue_flags_store,=0A=
+};=0A=
+=0A=
+static const struct blk_mq_debugfs_attr blk_queue_attrs[] =3D {=0A=
+	{"state", 0600, &blk_queue_flags_fops},=0A=
+	{},=0A=
+};=0A=
+=0A=
 static void print_stat(struct seq_file *m, struct blk_rq_stat *stat)=0A=
 {=0A=
 	if (stat->nr_samples) {=0A=
@@ -735,6 +816,9 @@ int blk_mq_debugfs_register_hctxs(struct request_queue =
*q)=0A=
 	if (!q->debugfs_dir)=0A=
 		return -ENOENT;=0A=
 =0A=
+	if (!debugfs_create_files(q->debugfs_dir, q, blk_queue_attrs))=0A=
+		goto err;=0A=
+=0A=
 	q->mq_debugfs_dir =3D debugfs_create_dir("mq", q->debugfs_dir);=0A=
 	if (!q->mq_debugfs_dir)=0A=
 		goto err;=0A=
-- =0A=
2.12.0=0A=
=0A=

^ permalink raw reply related

* Re: [PATCH] blk-mq: Export queue state through /sys/kernel/debug/block/*/state
From: Jens Axboe @ 2017-03-29 20:31 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Omar Sandoval, Hannes Reinecke, linux-block@vger.kernel.org
In-Reply-To: <1D08B61A9CF0974AA09887BE32D889DA12B851@ULS-OP-MBXIP03.sdcorp.global.sandisk.com>

On 03/29/2017 02:20 PM, Bart Van Assche wrote:
> Make it possible to check whether or not a block layer queue has
> been stopped. Make it possible to run a blk-mq queue from user
> space.

I like this, I've had run-this-queue wired up as well from sysfs
in the past. Maybe we should push it one further, and also allow
things like running a stopped queue?

Would probably be nicer if the file accepted input like "run" (which
would be your current run-this-queue) and "start" (start stopped
queues).

Then we could also EINVAL writes that we don't grok, instead of
just blindly always running the queue.

-- 
Jens Axboe

^ permalink raw reply

* Re: v4.11-rc blk-mq lockup?
From: Bart Van Assche @ 2017-03-29 20:36 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block@vger.kernel.org
In-Reply-To: <1757f63c-7603-86e8-afde-0cb948ba8f66@kernel.dk>

On 03/28/2017 09:30 AM, Jens Axboe wrote:=0A=
> On 03/28/2017 10:25 AM, Bart Van Assche wrote:=0A=
>> I do not know whether it would be possible to modify the test such that =
only=0A=
>> the dm driver is involved but no SCSI code.=0A=
> =0A=
> How about the other way around? Just SCSI, but no dm?=0A=
=0A=
Hello Jens,=0A=
=0A=
Sorry but it could take a long time to figure out how to reproduce this=0A=
issue if I start modifying the test. BTW, the patch I just posted =0A=
("blk-mq: Export queue state through /sys/kernel/debug/block/*/state") =0A=
allows me to trigger a blk-mq queue run from user space. If the lockup=0A=
occurs and I use that facility to trigger a queue run the test proceeds.=0A=
The command I used to trigger a queue run is as follows:=0A=
=0A=
for a in /sys/kernel/debug/block/*/state; do echo 1 >$a; wait=0A=
=0A=
> Thanks for running it again, but it's the wrong state file. I should have=
=0A=
> been more clear. The one I'm interested in is in the mq/<num>/ directorie=
s,=0A=
> like the 'tags' etc files.=0A=
> =0A=
> Ala:=0A=
>  =0A=
>  	        for f in "$d"/{dispatch,state,tags*,cpu*/rq_list}; do=0A=
=0A=
Ah, thanks, that makes it clear :-)=0A=
 =0A=
> Also, can you include the involved dm devices as well for this state=0A=
> dump?=0A=
=0A=
I would like to, but the 02-sq-on-mq test configures the dm device =0A=
nodes in single queue mode and there is only information available =0A=
under /sys/kernel/debug/block/ for blk-mq devices ...=0A=
=0A=
Anyway, the updated script:=0A=
=0A=
#!/bin/bash=0A=
=0A=
show_state() {=0A=
    local a dev=3D$1=0A=
=0A=
    for a in device/state queue/scheduler; do=0A=
	[ -e "$dev/$a" ] && grep -aH '' "$dev/$a"=0A=
    done=0A=
}=0A=
=0A=
cd /sys/class/block || exit $?=0A=
for dev in *; do=0A=
    if [ -e "$dev/mq" ]; then=0A=
	echo "$dev"=0A=
	pending=3D0=0A=
	for f in "$dev"/mq/*/{pending,*/rq_list}; do=0A=
	    [ -e "$f" ] || continue=0A=
	    if { read -r line1 && read -r line2; } <"$f"; then=0A=
		echo "$f"=0A=
		echo "$line1 $line2" >/dev/null=0A=
		head -n 9 "$f"=0A=
		((pending++))=0A=
	    fi=0A=
	done=0A=
	(=0A=
	    busy=3D0=0A=
	    cd /sys/kernel/debug/block >&/dev/null &&=0A=
	    for d in "$dev"/mq/*; do=0A=
		[ ! -d "$d" ] && continue=0A=
		grep -q '^busy=3D0$' "$d/tags" && continue=0A=
		((busy++))=0A=
	        for f in "$d"/{dispatch,state,tags*,cpu*/rq_list}; do=0A=
		    [ -e "$f" ] && grep -aH '' "$f"=0A=
		done=0A=
	    done=0A=
	    exit $busy=0A=
	)=0A=
	pending=3D$((pending+$?))=0A=
	if [ "$pending" -gt 0 ]; then=0A=
	    grep -aH '' /sys/kernel/debug/block/"$dev"/state=0A=
	    show_state "$dev"=0A=
	fi=0A=
    fi=0A=
done=0A=
=0A=
=0A=
And the output for the test run of today:=0A=
=0A=
sda=0A=
sdb=0A=
sdd=0A=
sdd/mq/0/dispatch:ffff88036437d140 {.cmd_flags=3D0xca01, .rq_flags=3D0x2040=
, .tag=3D53, .internal_tag=3D-1}=0A=
sdd/mq/0/state:0x4=0A=
sdd/mq/0/tags:nr_tags=3D62=0A=
sdd/mq/0/tags:nr_reserved_tags=3D0=0A=
sdd/mq/0/tags:active_queues=3D0=0A=
sdd/mq/0/tags:=0A=
sdd/mq/0/tags:bitmap_tags:=0A=
sdd/mq/0/tags:depth=3D62=0A=
sdd/mq/0/tags:busy=3D31=0A=
sdd/mq/0/tags:bits_per_word=3D8=0A=
sdd/mq/0/tags:map_nr=3D8=0A=
sdd/mq/0/tags:alloc_hint=3D{48, 48, 38, 44, 54, 6, 52, 23, 30, 6, 51, 26, 6=
1, 45, 9, 56, 55, 13, 44, 45, 12, 12, 23, 42, 44, 24, 41, 0, 54, 4, 4, 45}=
=0A=
sdd/mq/0/tags:wake_batch=3D7=0A=
sdd/mq/0/tags:wake_index=3D0=0A=
sdd/mq/0/tags:ws=3D{=0A=
sdd/mq/0/tags:	{.wait_cnt=3D7, .wait=3Dinactive},=0A=
sdd/mq/0/tags:	{.wait_cnt=3D7, .wait=3Dinactive},=0A=
sdd/mq/0/tags:	{.wait_cnt=3D7, .wait=3Dinactive},=0A=
sdd/mq/0/tags:	{.wait_cnt=3D7, .wait=3Dinactive},=0A=
sdd/mq/0/tags:	{.wait_cnt=3D7, .wait=3Dinactive},=0A=
sdd/mq/0/tags:	{.wait_cnt=3D7, .wait=3Dinactive},=0A=
sdd/mq/0/tags:	{.wait_cnt=3D7, .wait=3Dinactive},=0A=
sdd/mq/0/tags:	{.wait_cnt=3D7, .wait=3Dinactive},=0A=
sdd/mq/0/tags:}=0A=
sdd/mq/0/tags:round_robin=3D0=0A=
sdd/mq/0/tags_bitmap:00000000: ffff 7f00 0000 e01f=0A=
sdd/mq/0/cpu7/rq_list:ffff88036437e880 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D54, .internal_tag=3D-1}=0A=
sdd/mq/0/cpu7/rq_list:ffff8803f7ef0000 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D55, .internal_tag=3D-1}=0A=
sdd/mq/0/cpu7/rq_list:ffff8803f7ef1740 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D56, .internal_tag=3D-1}=0A=
sdd/mq/0/cpu7/rq_list:ffff8803f7ef2e80 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D57, .internal_tag=3D-1}=0A=
sdd/mq/0/cpu7/rq_list:ffff8803f7ef45c0 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D58, .internal_tag=3D-1}=0A=
sdd/mq/0/cpu7/rq_list:ffff8803f7ef5d00 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D59, .internal_tag=3D-1}=0A=
sdd/mq/0/cpu7/rq_list:ffff8803f7ef7440 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D60, .internal_tag=3D-1}=0A=
sdd/mq/0/cpu7/rq_list:ffff880386760000 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D0, .internal_tag=3D-1}=0A=
sdd/mq/0/cpu7/rq_list:ffff880386761740 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D1, .internal_tag=3D-1}=0A=
sdd/mq/0/cpu7/rq_list:ffff880386762e80 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D2, .internal_tag=3D-1}=0A=
sdd/mq/0/cpu7/rq_list:ffff8803867645c0 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D3, .internal_tag=3D-1}=0A=
sdd/mq/0/cpu7/rq_list:ffff880386765d00 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D4, .internal_tag=3D-1}=0A=
sdd/mq/0/cpu7/rq_list:ffff880386767440 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D5, .internal_tag=3D-1}=0A=
sdd/mq/0/cpu7/rq_list:ffff880386768b80 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D6, .internal_tag=3D-1}=0A=
sdd/mq/0/cpu7/rq_list:ffff88038676a2c0 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D7, .internal_tag=3D-1}=0A=
sdd/mq/0/cpu7/rq_list:ffff88038676ba00 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D8, .internal_tag=3D-1}=0A=
sdd/mq/0/cpu7/rq_list:ffff88038676d140 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D9, .internal_tag=3D-1}=0A=
sdd/mq/0/cpu7/rq_list:ffff88038676e880 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D10, .internal_tag=3D-1}=0A=
sdd/mq/0/cpu7/rq_list:ffff8803f8650000 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D11, .internal_tag=3D-1}=0A=
sdd/mq/0/cpu7/rq_list:ffff8803f8651740 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D12, .internal_tag=3D-1}=0A=
sdd/mq/0/cpu7/rq_list:ffff8803f8652e80 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D13, .internal_tag=3D-1}=0A=
sdd/mq/0/cpu7/rq_list:ffff8803f86545c0 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D14, .internal_tag=3D-1}=0A=
sdd/mq/0/cpu7/rq_list:ffff8803f8655d00 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D15, .internal_tag=3D-1}=0A=
sdd/mq/0/cpu7/rq_list:ffff8803f8657440 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D16, .internal_tag=3D-1}=0A=
sdd/mq/0/cpu7/rq_list:ffff8803f8658b80 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D17, .internal_tag=3D-1}=0A=
sdd/mq/0/cpu7/rq_list:ffff8803f865a2c0 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D18, .internal_tag=3D-1}=0A=
sdd/mq/0/cpu7/rq_list:ffff8803f865ba00 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D19, .internal_tag=3D-1}=0A=
sdd/mq/0/cpu7/rq_list:ffff8803f865d140 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D20, .internal_tag=3D-1}=0A=
sdd/mq/0/cpu7/rq_list:ffff8803f865e880 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D21, .internal_tag=3D-1}=0A=
sdd/mq/0/cpu7/rq_list:ffff8803fb630000 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D22, .internal_tag=3D-1}=0A=
/sys/kernel/debug/block/sdd/state:SAME_COMP STACKABLE IO_STAT INIT_DONE POL=
L=0A=
sdd/device/state:running=0A=
sdd/queue/scheduler:[none] =0A=
sde=0A=
sde/mq/0/state:0x0=0A=
sde/mq/0/tags:nr_tags=3D62=0A=
sde/mq/0/tags:nr_reserved_tags=3D0=0A=
sde/mq/0/tags:active_queues=3D0=0A=
sde/mq/0/tags:=0A=
sde/mq/0/tags:bitmap_tags:=0A=
sde/mq/0/tags:depth=3D62=0A=
sde/mq/0/tags:busy=3D31=0A=
sde/mq/0/tags:bits_per_word=3D8=0A=
sde/mq/0/tags:map_nr=3D8=0A=
sde/mq/0/tags:alloc_hint=3D{48, 48, 38, 44, 54, 6, 52, 23, 30, 6, 51, 26, 6=
1, 45, 9, 56, 55, 13, 44, 45, 12, 12, 23, 42, 44, 24, 41, 0, 54, 4, 4, 45}=
=0A=
sde/mq/0/tags:wake_batch=3D7=0A=
sde/mq/0/tags:wake_index=3D0=0A=
sde/mq/0/tags:ws=3D{=0A=
sde/mq/0/tags:	{.wait_cnt=3D7, .wait=3Dinactive},=0A=
sde/mq/0/tags:	{.wait_cnt=3D7, .wait=3Dinactive},=0A=
sde/mq/0/tags:	{.wait_cnt=3D7, .wait=3Dinactive},=0A=
sde/mq/0/tags:	{.wait_cnt=3D7, .wait=3Dinactive},=0A=
sde/mq/0/tags:	{.wait_cnt=3D7, .wait=3Dinactive},=0A=
sde/mq/0/tags:	{.wait_cnt=3D7, .wait=3Dinactive},=0A=
sde/mq/0/tags:	{.wait_cnt=3D7, .wait=3Dinactive},=0A=
sde/mq/0/tags:	{.wait_cnt=3D7, .wait=3Dinactive},=0A=
sde/mq/0/tags:}=0A=
sde/mq/0/tags:round_robin=3D0=0A=
sde/mq/0/tags_bitmap:00000000: ffff 7f00 0000 e01f=0A=
/sys/kernel/debug/block/sde/state:SAME_COMP STACKABLE IO_STAT INIT_DONE POL=
L=0A=
sde/device/state:running=0A=
sde/queue/scheduler:[none] =0A=
sdf=0A=
sdg=0A=
sdh=0A=
sdi=0A=
sdj=0A=
sdk=0A=
sr0=0A=

^ permalink raw reply

* [GIT PULL] Block fixes for 4.11-rc
From: Jens Axboe @ 2017-03-29 21:27 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org

Hi Linus,

5 fixes for this series. This pull request contains:

- A fix from me to ensure that blk-mq drivers that terminate IO in their
  ->queue_rq() handler by returning QUEUE_ERROR don't stall with a
  scheduler enabled.

- 4 nbd fixes from Josef and Ratna, fixing various problems that are
  critical enough to go in for this cycle. They have been well tested.

Please pull!


  git://git.kernel.dk/linux-block.git for-linus


----------------------------------------------------------------
Jens Axboe (1):
      blk-mq: include errors in did_work calculation

Josef Bacik (3):
      nbd: handle ERESTARTSYS properly
      nbd: set rq->errors to actual error code
      nbd: set queue timeout properly

Ratna Manoj Bolla (1):
      nbd: replace kill_bdev() with __invalidate_device()

 block/blk-mq.c      |   7 +--
 drivers/block/nbd.c | 136 +++++++++++++++++++++++++++++++++++++++-------------
 2 files changed, 107 insertions(+), 36 deletions(-)

-- 
Jens Axboe

^ permalink raw reply

* [PATCH] blk-mq: Export queue state through /sys/kernel/debug/block/*/state
From: Bart Van Assche @ 2017-03-29 21:32 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Omar Sandoval, Hannes Reinecke, linux-block@vger.kernel.org

Make it possible to check whether or not a block layer queue has=0A=
been stopped. Make it possible to start and to run a blk-mq queue=0A=
from user space.=0A=
=0A=
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>=0A=
Cc: Omar Sandoval <osandov@fb.com>=0A=
Cc: Hannes Reinecke <hare@suse.com>=0A=
=0A=
---=0A=
=0A=
Changes compared to v1:=0A=
- Constified blk_queue_flag_name.=0A=
- Left out QUEUE_FLAG_VIRT because it is a synonym of QUEUE_FLAG_NONROT.=0A=
- Check array size before reading from blk_queue_flag_name[].=0A=
- Add functionality to restart a block layer queue.=0A=
=0A=
---=0A=
 block/blk-mq-debugfs.c | 97 ++++++++++++++++++++++++++++++++++++++++++++++=
++++=0A=
 1 file changed, 97 insertions(+)=0A=
=0A=
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c=0A=
index 4b3f962a9c7a..f8b97d6306af 100644=0A=
--- a/block/blk-mq-debugfs.c=0A=
+++ b/block/blk-mq-debugfs.c=0A=
@@ -43,6 +43,100 @@ static int blk_mq_debugfs_seq_open(struct inode *inode,=
 struct file *file,=0A=
 	return ret;=0A=
 }=0A=
 =0A=
+static const char *const blk_queue_flag_name[] =3D {=0A=
+	[QUEUE_FLAG_QUEUED]	 =3D "QUEUED",=0A=
+	[QUEUE_FLAG_STOPPED]	 =3D "STOPPED",=0A=
+	[QUEUE_FLAG_SYNCFULL]	 =3D "SYNCFULL",=0A=
+	[QUEUE_FLAG_ASYNCFULL]	 =3D "ASYNCFULL",=0A=
+	[QUEUE_FLAG_DYING]	 =3D "DYING",=0A=
+	[QUEUE_FLAG_BYPASS]	 =3D "BYPASS",=0A=
+	[QUEUE_FLAG_BIDI]	 =3D "BIDI",=0A=
+	[QUEUE_FLAG_NOMERGES]	 =3D "NOMERGES",=0A=
+	[QUEUE_FLAG_SAME_COMP]	 =3D "SAME_COMP",=0A=
+	[QUEUE_FLAG_FAIL_IO]	 =3D "FAIL_IO",=0A=
+	[QUEUE_FLAG_STACKABLE]	 =3D "STACKABLE",=0A=
+	[QUEUE_FLAG_NONROT]	 =3D "NONROT",=0A=
+	[QUEUE_FLAG_IO_STAT]	 =3D "IO_STAT",=0A=
+	[QUEUE_FLAG_DISCARD]	 =3D "DISCARD",=0A=
+	[QUEUE_FLAG_NOXMERGES]	 =3D "NOXMERGES",=0A=
+	[QUEUE_FLAG_ADD_RANDOM]	 =3D "ADD_RANDOM",=0A=
+	[QUEUE_FLAG_SECERASE]	 =3D "SECERASE",=0A=
+	[QUEUE_FLAG_SAME_FORCE]	 =3D "SAME_FORCE",=0A=
+	[QUEUE_FLAG_DEAD]	 =3D "DEAD",=0A=
+	[QUEUE_FLAG_INIT_DONE]	 =3D "INIT_DONE",=0A=
+	[QUEUE_FLAG_NO_SG_MERGE] =3D "NO_SG_MERGE",=0A=
+	[QUEUE_FLAG_POLL]	 =3D "POLL",=0A=
+	[QUEUE_FLAG_WC]		 =3D "WC",=0A=
+	[QUEUE_FLAG_FUA]	 =3D "FUA",=0A=
+	[QUEUE_FLAG_FLUSH_NQ]	 =3D "FLUSH_NQ",=0A=
+	[QUEUE_FLAG_DAX]	 =3D "DAX",=0A=
+	[QUEUE_FLAG_STATS]	 =3D "STATS",=0A=
+	[QUEUE_FLAG_RESTART]	 =3D "RESTART",=0A=
+	[QUEUE_FLAG_POLL_STATS]	 =3D "POLL_STATS",=0A=
+};=0A=
+=0A=
+static int blk_queue_flags_show(struct seq_file *m, void *v)=0A=
+{=0A=
+	struct request_queue *q =3D m->private;=0A=
+	bool sep =3D false;=0A=
+	int i;=0A=
+=0A=
+	for (i =3D 0; i < sizeof(q->queue_flags) * BITS_PER_BYTE; i++) {=0A=
+		if (!(q->queue_flags & BIT(i)))=0A=
+			continue;=0A=
+		if (sep)=0A=
+			seq_puts(m, " ");=0A=
+		sep =3D true;=0A=
+		if (i < ARRAY_SIZE(blk_queue_flag_name) &&=0A=
+		    blk_queue_flag_name[i])=0A=
+			seq_puts(m, blk_queue_flag_name[i]);=0A=
+		else=0A=
+			seq_printf(m, "%d", i);=0A=
+	}=0A=
+	seq_puts(m, "\n");=0A=
+	return 0;=0A=
+}=0A=
+=0A=
+static ssize_t blk_queue_flags_store(struct file *file, const char __user =
*ubuf,=0A=
+				     size_t len, loff_t *offp)=0A=
+{=0A=
+	struct request_queue *q =3D file_inode(file)->i_private;=0A=
+	char op[16] =3D { }, *s;=0A=
+=0A=
+	len =3D min(len, sizeof(op) - 1);=0A=
+	if (copy_from_user(op, ubuf, len))=0A=
+		return -EFAULT;=0A=
+	s =3D op;=0A=
+	strsep(&s, " \t\n"); /* strip trailing whitespace */=0A=
+	if (strcmp(op, "run") =3D=3D 0) {=0A=
+		blk_mq_run_hw_queues(q, true);=0A=
+	} else if (strcmp(op, "start") =3D=3D 0) {=0A=
+		blk_mq_start_stopped_hw_queues(q, true);=0A=
+	} else {=0A=
+		pr_err("%s: unsupported operation %s\n", __func__, op);=0A=
+		return -EINVAL;=0A=
+	}=0A=
+	return len;=0A=
+}=0A=
+=0A=
+static int blk_queue_flags_open(struct inode *inode, struct file *file)=0A=
+{=0A=
+	return single_open(file, blk_queue_flags_show, inode->i_private);=0A=
+}=0A=
+=0A=
+static const struct file_operations blk_queue_flags_fops =3D {=0A=
+	.open		=3D blk_queue_flags_open,=0A=
+	.read		=3D seq_read,=0A=
+	.llseek		=3D seq_lseek,=0A=
+	.release	=3D single_release,=0A=
+	.write		=3D blk_queue_flags_store,=0A=
+};=0A=
+=0A=
+static const struct blk_mq_debugfs_attr blk_queue_attrs[] =3D {=0A=
+	{"state", 0600, &blk_queue_flags_fops},=0A=
+	{},=0A=
+};=0A=
+=0A=
 static void print_stat(struct seq_file *m, struct blk_rq_stat *stat)=0A=
 {=0A=
 	if (stat->nr_samples) {=0A=
@@ -735,6 +829,9 @@ int blk_mq_debugfs_register_hctxs(struct request_queue =
*q)=0A=
 	if (!q->debugfs_dir)=0A=
 		return -ENOENT;=0A=
 =0A=
+	if (!debugfs_create_files(q->debugfs_dir, q, blk_queue_attrs))=0A=
+		goto err;=0A=
+=0A=
 	q->mq_debugfs_dir =3D debugfs_create_dir("mq", q->debugfs_dir);=0A=
 	if (!q->mq_debugfs_dir)=0A=
 		goto err;=0A=
-- =0A=
2.12.0=0A=
=0A=
=0A=

^ permalink raw reply related

* Re: [PATCH 03/23] sd: implement REQ_OP_WRITE_ZEROES
From: Martin K. Petersen @ 2017-03-30  2:25 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: agk@redhat.com, lars.ellenberg@linbit.com, snitzer@redhat.com,
	hch@lst.de, martin.petersen@oracle.com,
	philipp.reisner@linbit.com, axboe@kernel.dk, shli@kernel.org,
	linux-scsi@vger.kernel.org, dm-devel@redhat.com,
	drbd-dev@lists.linbit.com, linux-block@vger.kernel.org,
	linux-raid@vger.kernel.org
In-Reply-To: <1490726988.2573.16.camel@sandisk.com>

Bart Van Assche <Bart.VanAssche@sandisk.com> writes:

Hi Bart,

> A quote from SBC: "An OPTIMAL UNMAP GRANULARITY field set to a
> non-zero value indicates the optimal granularity in logical blocks for
> unmap requests (e.g., an UNMAP command or a WRITE SAME (16) command
> with the UNMAP bit set to one).  An unmap request with a number of
> logical blocks that is not a multiple of this value may result in
> unmap operations on fewer LBAs than requested."

Indeed. Fewer LBAs than requested may be *unmapped*. That does not imply
that they are not *written*.

> This means that just like the start and end of a discard must be
> aligned on a discard_granularity boundary, WRITE SAME commands with
> the UNMAP bit set must also respect that granularity. I think this
> means that either __blkdev_issue_zeroout() has to be modified such
> that it rejects unaligned REQ_OP_WRITE_ZEROES operations or that
> blk_bio_write_same_split() has to be modified such that it generates
> REQ_OP_WRITEs for the unaligned start and tail.

No, that's not correct. SBC states:

"a) if the Data-Out Buffer of the WRITE SAME command is the same as the
   logical block data returned by a read operation from that LBA while
   in the unmapped state, then:

   1) the device server performs the actions described in table 6
      [unmap]; and

   2) if an unmap operation is not performed in step 1), then the device
      server shall perform the specified write operation to that LBA;"

I.e. With WRITE SAME it is the responsibility of the device server to
write any LBAs described by the command that were not successfully
unmapped.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply

* [PATCH] block: do not put mq context in blk_mq_alloc_request_hctx
From: Minchan Kim @ 2017-03-30  5:20 UTC (permalink / raw)
  To: Jens Axboe
  Cc: kernel-team, linux-block, linux-kernel, Minchan Kim,
	Sagi Grimberg, Omar Sandoval

In blk_mq_alloc_request_hctx, blk_mq_sched_get_request doesn't
get sw context so we don't need to put the context with
blk_mq_put_ctx. Unless, we will see preempt counter underflow.

Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---

Maybe, it would be fixed by someone but I have noticed preempt counter
undeflow problem a few weeks ago and still see the problem with
linux-next.

 block/blk-mq.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index a4546f060e80..a6f3998dc4ee 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -321,7 +321,6 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, int rw,
 
 	rq = blk_mq_sched_get_request(q, NULL, rw, &alloc_data);
 
-	blk_mq_put_ctx(alloc_data.ctx);
 	blk_queue_exit(q);
 
 	if (!rq)
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH] blk-mq: Export queue state through /sys/kernel/debug/block/*/state
From: Hannes Reinecke @ 2017-03-30  5:50 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe; +Cc: Omar Sandoval, linux-block@vger.kernel.org
In-Reply-To: <1D08B61A9CF0974AA09887BE32D889DA12B851@ULS-OP-MBXIP03.sdcorp.global.sandisk.com>

On 03/29/2017 10:20 PM, Bart Van Assche wrote:
> Make it possible to check whether or not a block layer queue has
> been stopped. Make it possible to run a blk-mq queue from user
> space.
> 
> Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
> Cc: Omar Sandoval <osandov@fb.com>
> Cc: Hannes Reinecke <hare@suse.com>
> ---
>  block/blk-mq-debugfs.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 84 insertions(+)
> 
About bloody time :-)

Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nï¿½rnberg
GF: F. Imendï¿½rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nï¿½rnberg)

^ permalink raw reply

* Re: [PATCH 1/7] ѕd: split sd_setup_discard_cmnd
From: hch @ 2017-03-30  8:49 UTC (permalink / raw)
  To: axboe@kernel.dk
  Cc: Bart Van Assche, hch@lst.de, tj@kernel.org,
	martin.petersen@oracle.com, linux-scsi@vger.kernel.org,
	linux-block@vger.kernel.org, linux-ide@vger.kernel.org
In-Reply-To: <20170328140509.GA27578@kernel.dk>

On Tue, Mar 28, 2017 at 08:05:09AM -0600, axboe@kernel.dk wrote:
> > Although I know this is an issue in the existing code and not something
> > introduced by you: please consider using logical_to_sectors() instead of
> > open-coding this function. Otherwise this patch looks fine to me.
> 
> The downside of doing that is that we still call ilog2() twice, which
> sucks. Would be faster to cache ilog2(sector_size) and use that in the
> shift calculation.

I suspect that gcc is smart enough to optimize it away.  That beeing said
while this looks like a nice cleanup this patch is just supposed to move
code, so I'd rather not add the change here and leave it for a separate
submission.

^ permalink raw reply

* Re: [PATCH 01/23] block: renumber REQ_OP_WRITE_ZEROES
From: hch @ 2017-03-30  8:53 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: agk@redhat.com, lars.ellenberg@linbit.com, snitzer@redhat.com,
	hch@lst.de, martin.petersen@oracle.com,
	philipp.reisner@linbit.com, axboe@kernel.dk, shli@kernel.org,
	linux-scsi@vger.kernel.org, dm-devel@redhat.com,
	drbd-dev@lists.linbit.com, linux-block@vger.kernel.org,
	linux-raid@vger.kernel.org
In-Reply-To: <1490717553.2573.4.camel@sandisk.com>

On Tue, Mar 28, 2017 at 04:12:46PM +0000, Bart Van Assche wrote:
> Since REQ_OP_WRITE_ZEROES was introduced in kernel v4.10, do we need
> "Cc: stable" and "Fixes: a6f0788ec2881" tags for this patch?

No.  This just works around the way scsi_setup_cmnd sets up the data
direction.  Before this series it's not an issue because no one used
the req_op data direction for setting up the dma direction.

^ permalink raw reply

* Re: [PATCH 11/23] block_dev: use blkdev_issue_zerout for hole punches
From: hch @ 2017-03-30  8:59 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: agk@redhat.com, lars.ellenberg@linbit.com, snitzer@redhat.com,
	hch@lst.de, martin.petersen@oracle.com,
	philipp.reisner@linbit.com, axboe@kernel.dk, shli@kernel.org,
	linux-scsi@vger.kernel.org, dm-devel@redhat.com,
	drbd-dev@lists.linbit.com, linux-block@vger.kernel.org,
	linux-raid@vger.kernel.org
In-Reply-To: <1490719834.2573.9.camel@sandisk.com>

On Tue, Mar 28, 2017 at 04:50:47PM +0000, Bart Van Assche wrote:
> On Thu, 2017-03-23 at 10:33 -0400, Christoph Hellwig wrote:
> > This gets us support for non-discard efficient write of zeroes (e.g. NVMe)
> > and preparse for removing the discard_zeroes_data flag.
> 
> Hello Christoph,
> 
> "preparse" probably should have been "prepare"?

Yes, fixed.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox