Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* Re: [PATCH 2/7] bio-integrity: save original iterator for verify stage
From: Dmitry Monakhov @ 2017-04-04 12:15 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-kernel, linux-block, martin.petersen
In-Reply-To: <20170404070122.GC12008@infradead.org>

Christoph Hellwig <hch@infradead.org> writes:

> This is a pretty big increase in the bio_integrity_payload size,
> but I guess we can't get around it..

Yes, everybody hate this solution, me too, but I've stated with
other approach and it is appeaded to be very ugly.

My idea was that we have two types of iterator incrementors: bio_advance()
and bio_xx_complete, First one is called during split, later is called
on completion ( req_bio_endio() ) . So we can add new field "bi_done" to
iterator which has similar meaning as bi_bvec_done, but at full iterator
scope.  It is incremented during completion, but before end_io.
Chain bios will propogate bi_done to parent bio to parent one.
On ->vefify_fn() iterator will be rewinded (counter part of bvec_advance) to
iter->bi_done bytes, so we will get oritinal iterator. 
I've even prepare a patch for this idea and it looks big and awful.
Even more it does not works if chained bios overlapts (raid1,raid10,
etc).

But... at the time I've wrote this email I've realized that I do not
care about what happen with chained bios. The only thing is important
is parent bio and how far it was advanced. If bi_done is incremented
inside bvec_iter_advance() I can be shure that at the moment
->bi_end_io()
original position can be restored by rewinding back to io_done bytes.

I'll try to implement this.

>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* [PATCH 2/2] scsi: Add template flag 'host_tagset'
From: Hannes Reinecke @ 2017-04-04 12:07 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Omar Sandoval, Martin K. Petersen, James Bottomley,
	Christoph Hellwig, Bart van Assche, linux-block, linux-scsi,
	Hannes Reinecke, Hannes Reinecke
In-Reply-To: <1491307665-47656-1-git-send-email-hare@suse.de>

Add a host template flag 'host_tagset' to enable the use of a
global tagmap for block-mq.

Signed-off-by: Hannes Reinecke <hare@suse.com>
---
 drivers/scsi/scsi_lib.c  | 2 ++
 include/scsi/scsi_host.h | 5 +++++
 2 files changed, 7 insertions(+)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index ba22866..00036cb 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2193,6 +2193,8 @@ int scsi_mq_setup_tags(struct Scsi_Host *shost)
 	shost->tag_set.cmd_size = cmd_size;
 	shost->tag_set.numa_node = NUMA_NO_NODE;
 	shost->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+	if (shost->hostt->host_tagset)
+		shost->tag_set.flags |= BLK_MQ_F_GLOBAL_TAGS;
 	shost->tag_set.flags |=
 		BLK_ALLOC_POLICY_TO_MQ_FLAG(shost->hostt->tag_alloc_policy);
 	shost->tag_set.driver_data = shost;
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index 3cd8c3b..dff3ec1 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -457,6 +457,11 @@ struct scsi_host_template {
 	unsigned no_async_abort:1;
 
 	/*
+	 * True if the host supports a host-wide tagspace
+	 */
+	unsigned host_tagset:1;
+
+	/*
 	 * Countdown for host blocking with no commands outstanding.
 	 */
 	unsigned int max_host_blocked;
-- 
1.8.5.6

^ permalink raw reply related

* [PATCH 1/2] block: Implement global tagset
From: Hannes Reinecke @ 2017-04-04 12:07 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Omar Sandoval, Martin K. Petersen, James Bottomley,
	Christoph Hellwig, Bart van Assche, linux-block, linux-scsi,
	Hannes Reinecke, Hannes Reinecke
In-Reply-To: <1491307665-47656-1-git-send-email-hare@suse.de>

Most legacy HBAs have a tagset per HBA, not per queue. To map
these devices onto block-mq this patch implements a new tagset
flag BLK_MQ_F_GLOBAL_TAGS, which will cause the tag allocator
to use just one tagset for all hardware queues.

Signed-off-by: Hannes Reinecke <hare@suse.com>
---
 block/blk-mq-tag.c     | 12 ++++++++----
 block/blk-mq.c         | 10 ++++++++--
 include/linux/blk-mq.h |  1 +
 3 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index e48bc2c..a14e76c 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -276,9 +276,11 @@ static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
 void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 		busy_tag_iter_fn *fn, void *priv)
 {
-	int i;
+	int i, lim = tagset->nr_hw_queues;
 
-	for (i = 0; i < tagset->nr_hw_queues; i++) {
+	if (tagset->flags & BLK_MQ_F_GLOBAL_TAGS)
+		lim = 1;
+	for (i = 0; i < lim; i++) {
 		if (tagset->tags && tagset->tags[i])
 			blk_mq_all_tag_busy_iter(tagset->tags[i], fn, priv);
 	}
@@ -287,12 +289,14 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 
 int blk_mq_reinit_tagset(struct blk_mq_tag_set *set)
 {
-	int i, j, ret = 0;
+	int i, j, ret = 0, lim = set->nr_hw_queues;
 
 	if (!set->ops->reinit_request)
 		goto out;
 
-	for (i = 0; i < set->nr_hw_queues; i++) {
+	if (set->flags & BLK_MQ_F_GLOBAL_TAGS)
+		lim = 1;
+	for (i = 0; i < lim; i++) {
 		struct blk_mq_tags *tags = set->tags[i];
 
 		for (j = 0; j < tags->nr_tags; j++) {
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 159187a..db96ed0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2061,6 +2061,10 @@ static bool __blk_mq_alloc_rq_map(struct blk_mq_tag_set *set, int hctx_idx)
 {
 	int ret = 0;
 
+	if ((set->flags & BLK_MQ_F_GLOBAL_TAGS) && hctx_idx != 0) {
+		set->tags[hctx_idx] = set->tags[0];
+		return true;
+	}
 	set->tags[hctx_idx] = blk_mq_alloc_rq_map(set, hctx_idx,
 					set->queue_depth, set->reserved_tags);
 	if (!set->tags[hctx_idx])
@@ -2080,8 +2084,10 @@ static void blk_mq_free_map_and_requests(struct blk_mq_tag_set *set,
 					 unsigned int hctx_idx)
 {
 	if (set->tags[hctx_idx]) {
-		blk_mq_free_rqs(set, set->tags[hctx_idx], hctx_idx);
-		blk_mq_free_rq_map(set->tags[hctx_idx]);
+		if (!(set->flags & BLK_MQ_F_GLOBAL_TAGS) || hctx_idx == 0) {
+			blk_mq_free_rqs(set, set->tags[hctx_idx], hctx_idx);
+			blk_mq_free_rq_map(set->tags[hctx_idx]);
+		}
 		set->tags[hctx_idx] = NULL;
 	}
 }
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index b296a90..eee27b016 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -155,6 +155,7 @@ enum {
 	BLK_MQ_F_DEFER_ISSUE	= 1 << 4,
 	BLK_MQ_F_BLOCKING	= 1 << 5,
 	BLK_MQ_F_NO_SCHED	= 1 << 6,
+	BLK_MQ_F_GLOBAL_TAGS	= 1 << 7,
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
 	BLK_MQ_F_ALLOC_POLICY_BITS = 1,
 
-- 
1.8.5.6

^ permalink raw reply related

* [RFC PATCH 0/2] block,scsi: support host-wide tagset
From: Hannes Reinecke @ 2017-04-04 12:07 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Omar Sandoval, Martin K. Petersen, James Bottomley,
	Christoph Hellwig, Bart van Assche, linux-block, linux-scsi,
	Hannes Reinecke

Hi all,

as discussed recently most existing HBAs have a host-wide tagset which
does not map easily onto the per-queue tagset model of block mq.
This patchset implements a flag BLK_MQ_F_GLOBAL_TAGS for block-mq, which
enables the use of a shared tagset for all hardware queues.
The second patch adds a flag 'host_tagset' to the SCSI host template,
which allows drivers to enable the use of the global tagset.

This patchset probably has some performance implications as
there is a quite high probability of cache-bouncing when allocating
tags. Also I'm not quite sure if the implemented tagset sharing
is the correct way to handle things.
So this can be considered an RFC.

As usual, comments and reviews are welcome.

Hannes Reinecke (2):
  block: Implement global tagset
  scsi: Add template flag 'host_tagset'

 block/blk-mq-tag.c       | 12 ++++++++----
 block/blk-mq.c           | 10 ++++++++--
 drivers/scsi/scsi_lib.c  |  2 ++
 include/linux/blk-mq.h   |  1 +
 include/scsi/scsi_host.h |  5 +++++
 5 files changed, 24 insertions(+), 6 deletions(-)

-- 
1.8.5.6

^ permalink raw reply

* Re: [PATCH] loop: Add PF_LESS_THROTTLE to block/loop device thread.
From: Michal Hocko @ 2017-04-04 11:23 UTC (permalink / raw)
  To: NeilBrown; +Cc: Jens Axboe, linux-block, linux-mm, LKML
In-Reply-To: <871staffus.fsf@notabene.neil.brown.name>

On Mon 03-04-17 11:18:51, NeilBrown wrote:
> 
> When a filesystem is mounted from a loop device, writes are
> throttled by balance_dirty_pages() twice: once when writing
> to the filesystem and once when the loop_handle_cmd() writes
> to the backing file.  This double-throttling can trigger
> positive feedback loops that create significant delays.  The
> throttling at the lower level is seen by the upper level as
> a slow device, so it throttles extra hard.
> 
> The PF_LESS_THROTTLE flag was created to handle exactly this
> circumstance, though with an NFS filesystem mounted from a
> local NFS server.  It reduces the throttling on the lower
> layer so that it can proceed largely unthrottled.
> 
> To demonstrate this, create a filesystem on a loop device
> and write (e.g. with dd) several large files which combine
> to consume significantly more than the limit set by
> /proc/sys/vm/dirty_ratio or dirty_bytes.  Measure the total
> time taken.
> 
> When I do this directly on a device (no loop device) the
> total time for several runs (mkfs, mount, write 200 files,
> umount) is fairly stable: 28-35 seconds.
> When I do this over a loop device the times are much worse
> and less stable.  52-460 seconds.  Half below 100seconds,
> half above.
> When I apply this patch, the times become stable again,
> though not as fast as the no-loop-back case: 53-72 seconds.
> 
> There may be room for further improvement as the total overhead still
> seems too high, but this is a big improvement.

Yes this makes sense to me

> Signed-off-by: NeilBrown <neilb@suse.com>

Acked-by: Michal Hocko <mhocko@suse.com>

one nit below

> ---
>  drivers/block/loop.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 0ecb6461ed81..a7e1dd215fc2 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1694,8 +1694,11 @@ static void loop_queue_work(struct kthread_work *work)
>  {
>  	struct loop_cmd *cmd =
>  		container_of(work, struct loop_cmd, work);
> +	int oldflags = current->flags & PF_LESS_THROTTLE;
>  
> +	current->flags |= PF_LESS_THROTTLE;
>  	loop_handle_cmd(cmd);
> +	current->flags = (current->flags & ~PF_LESS_THROTTLE) | oldflags;

we have a helper for this tsk_restore_flags(). It is not used
consistently and maybe we want a dedicated api like we have for the
scope NOIO/NOFS but that is a separate thing. I would find
tsk_restore_flags easier to read.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH V2 04/16] block, bfq: modify the peak-rate estimator
From: Paolo Valente @ 2017-04-04 10:42 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: tj@kernel.org, axboe@kernel.dk, ulf.hansson@linaro.org,
	linux-kernel@vger.kernel.org, fchecconi@gmail.com,
	Arianna Avanzini, linux-block@vger.kernel.org,
	linus.walleij@linaro.org, broonie@kernel.org
In-Reply-To: <1490974279.2587.5.camel@sandisk.com>


> Il giorno 31 mar 2017, alle ore 17:31, Bart Van Assche =
<bart.vanassche@sandisk.com> ha scritto:
>=20
> On Fri, 2017-03-31 at 14:47 +0200, Paolo Valente wrote:
>> -static bool bfq_update_peak_rate(struct bfq_data *bfqd, struct =
bfq_queue *bfqq,
>> -                                bool compensate)
>> +static bool bfq_bfqq_is_slow(struct bfq_data *bfqd, struct bfq_queue =
*bfqq,
>> +                                bool compensate, enum =
bfqq_expiration reason,
>> +                                unsigned long *delta_ms)
>>  {
>> -       u64 bw, usecs, expected, timeout;
>> -       ktime_t delta;
>> -       int update =3D 0;
>> +       ktime_t delta_ktime;
>> +       u32 delta_usecs;
>> +       bool slow =3D BFQQ_SEEKY(bfqq); /* if delta too short, use =
seekyness */
>> =20
>> -       if (!bfq_bfqq_sync(bfqq) || bfq_bfqq_budget_new(bfqq))
>> +       if (!bfq_bfqq_sync(bfqq))
>>                 return false;
>> =20
>>         if (compensate)
>> -               delta =3D bfqd->last_idling_start;
>> +               delta_ktime =3D bfqd->last_idling_start;
>>         else
>> -               delta =3D ktime_get();
>> -       delta =3D ktime_sub(delta, bfqd->last_budget_start);
>> -       usecs =3D ktime_to_us(delta);
>> -
>> -       /* Don't trust short/unrealistic values. */
>> -       if (usecs < 100 || usecs >=3D LONG_MAX)
>> -               return false;
>> -
>> -       /*
>> -        * Calculate the bandwidth for the last slice.  We use a 64 =
bit
>> -        * value to store the peak rate, in sectors per usec in fixed
>> -        * point math.  We do so to have enough precision in the =
estimate
>> -        * and to avoid overflows.
>> -        */
>> -       bw =3D (u64)bfqq->entity.service << BFQ_RATE_SHIFT;
>> -       do_div(bw, (unsigned long)usecs);
>> +               delta_ktime =3D ktime_get();
>> +       delta_ktime =3D ktime_sub(delta_ktime, =
bfqd->last_budget_start);
>> +       delta_usecs =3D ktime_to_us(delta_ktime);
>> +
>=20
> This patch changes the type of the variable in which the result of =
ktime_to_us()
> is stored from u64 into u32 and next compares that result with =
LONG_MAX. Since
> ktime_to_us() returns a signed 64-bit number, are you sure you want to =
store that
> result in a 32-bit variable? If ktime_to_us() would e.g. return =
0xffffffff00000100
> or 0x100000100 then the assignment will truncate these numbers to =
0x100.
>=20

The instruction above the assignment you highlight stores in
delta_ktime the difference between 'now' and the last budget start.
The latter may have happened at most about 100 ms before 'now'.  So
there should be no overflow issue.

Thanks,
Paolo

> Bart.

^ permalink raw reply

* Re: [RFC PATCH] blk: reset 'bi_next' when bio is done inside request
From: Michael Wang @ 2017-04-04 10:23 UTC (permalink / raw)
  To: NeilBrown, linux-kernel@vger.kernel.org, linux-block, linux-raid
  Cc: Jens Axboe, Shaohua Li, Jinpu Wang
In-Reply-To: <871st8jyya.fsf@notabene.neil.brown.name>


On 04/04/2017 11:37 AM, NeilBrown wrote:
> On Tue, Apr 04 2017, Michael Wang wrote:
[snip]
>>>
>>> If sync_request_write() is using a bio that has already been used, it
>>> should call bio_reset() and fill in the details again.
>>> However I don't see how that would happen.
>>> Can you give specific details on the situation that triggers the bug?
>>
>> We have storage side mapping lv through scst to server, on server side
>> we assemble them into multipath device, and then assemble these dm into
>> two raid1.
>>
>> The test is firstly do mkfs.ext4 on raid1 then start fio on it, on storage
>> side we unmap all the lv (could during mkfs or fio), then on server side
>> we hit the BUG (reproducible).
> 
> So I assume the initial resync is still happening at this point?
> And you unmap *all* the lv's so you expect IO to fail?
> I can see that the code would behave strangely if you have a
> bad-block-list configured (which is the default).
> Do you have a bbl?  If you create the array without the bbl, does it
> still crash?

The resync is at least happen concurrently in this case, we try
to simulate the case that all the connections dropped, the IO do
failed, also bunch of kernel log like:

md: super_written gets error=-5
blk_update_request: I/O error, dev dm-3, sector 64184
md/raid1:md1: dm-2: unrecoverable I/O read error for block 46848

we expect that to happen, but server should not crash on BUG.

And we haven't done any thing special regarding bbl, the bad_blocks
in sysfs are all empty.

> 
>>
>> The path of bio was confirmed by add tracing, it is reused in sync_request_write()
>> with 'bi_next' once chained inside blk_attempt_plug_merge().
> 
> I still don't see why it is re-used.
> I assume you didn't explicitly ask for a check/repair (i.e. didn't write
> to .../md/sync_action at all?).  In that case MD_RECOVERY_REQUESTED is
> not set.

Just unmap lv on storage side, no operation on server side.

> So sync_request() sends only one bio to generic_make_request():
>    r1_bio->bios[r1_bio->read_disk];
> 
> then sync_request_write() *doesn't* send that bio again, but does send
> all the others.
> 
> So where does it reuse a bio?

If that's the design then it would be strange... the log do showing the path
of that bio go through sync_request(), will do more investigation.

> 
>>
>> We also tried to reset the bi_next inside sync_request_write() before
>> generic_make_request() which also works.
>>
>> The testing was done with 4.4, but we found upstream also left bi_next
>> chained after done in request, thus we post this RFC.
>>
>> Regarding raid1, we haven't found the place on path where the bio was
>> reset... where does it supposed to be?
> 
> I'm not sure what you mean.
> We only reset bios when they are being reused.
> One place is in process_checks() where bio_reset() is called before
> filling in all the details.
> 
> 
> Maybe, in sync_request_write(), before
> 
> 	wbio->bi_rw = WRITE;
> 
> add something like
>   if (wbio->bi_next)
>      printk("bi_next!= NULL i=%d read_disk=%d bi_end_io=%pf\n",
>           i, r1_bio->read_disk, wbio->bi_end_io);
> 
> that might help narrow down what is happening.

Just triggered again in 4.4, dmesg like:

[  399.240230] md: super_written gets error=-5
[  399.240286] md: super_written gets error=-5
[  399.240286] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
[  399.240300] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
[  399.240312] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
[  399.240323] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
[  399.240334] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
[  399.240341] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
[  399.240349] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
[  399.240352] bi_next!= NULL i=0 read_disk=0 bi_end_io=end_sync_write [raid1]
[  399.240363] ------------[ cut here ]------------
[  399.240364] kernel BUG at block/blk-core.c:2147!
[  399.240365] invalid opcode: 0000 [#1] SMP 
[  399.240378] Modules linked in: ib_srp scsi_transport_srp raid1 md_mod ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_core vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ib_sa ib_mad ib_core ib_addr ib_netlink iTCO_wdt iTCO_vendor_support dcdbas dell_smm_hwmon acpi_cpufreq x86_pkg_temp_thermal tpm_tis coretemp evdev tpm i2c_i801 crct10dif_pclmul serio_raw crc32_pclmul battery processor acpi_pad button kvm_intel kvm dm_round_robin irqbypass dm_multipath autofs4 sg sd_mod crc32c_intel ahci libahci psmouse libata mlx4_core scsi_mod xhci_pci xhci_hcd mlx_compat fan thermal [last unloaded: scsi_transport_srp]
[  399.240380] CPU: 1 PID: 2052 Comm: md0_raid1 Not tainted 4.4.50-1-pserver+ #26
[  399.240381] Hardware name: Dell Inc. Precision Tower 3620/09WH54, BIOS 1.3.6 05/26/2016
[  399.240381] task: ffff8804031b6200 ti: ffff8800d72b4000 task.ti: ffff8800d72b4000
[  399.240385] RIP: 0010:[<ffffffff813fcd9e>]  [<ffffffff813fcd9e>] generic_make_request+0x29e/0x2a0
[  399.240385] RSP: 0018:ffff8800d72b7d10  EFLAGS: 00010286
[  399.240386] RAX: ffff8804031b6200 RBX: ffff8800d2577e00 RCX: 000000003fffffff
[  399.240387] RDX: ffffffffc0000001 RSI: 0000000000000001 RDI: ffff8800d5e8c1e0
[  399.240387] RBP: ffff8800d72b7d50 R08: 0000000000000000 R09: 000000000000003f
[  399.240388] R10: 0000000000000004 R11: 00000000001db9ac R12: 00000000ffffffff
[  399.240388] R13: ffff8800d2748e00 R14: ffff88040a016400 R15: ffff8800d2748e40
[  399.240389] FS:  0000000000000000(0000) GS:ffff88041dc40000(0000) knlGS:0000000000000000
[  399.240390] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  399.240390] CR2: 00007fb49246a000 CR3: 000000040215c000 CR4: 00000000003406e0
[  399.240391] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  399.240391] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  399.240392] Stack:
[  399.240393]  ffff8800d72b7d18 ffff8800d72b7d30 0000000000000000 0000000000000000
[  399.240394]  ffffffffa079c290 ffff8800d2577e00 0000000000000000 ffff8800d2748e00
[  399.240395]  ffff8800d72b7e58 ffffffffa079e74c ffff88040b661c00 ffff8800d2577e00
[  399.240396] Call Trace:
[  399.240398]  [<ffffffffa079c290>] ? sync_request+0xb20/0xb20 [raid1]
[  399.240400]  [<ffffffffa079e74c>] raid1d+0x65c/0x1060 [raid1]
[  399.240403]  [<ffffffff810b6800>] ? trace_raw_output_itimer_expire+0x80/0x80
[  399.240407]  [<ffffffffa0772040>] md_thread+0x130/0x140 [md_mod]
[  399.240409]  [<ffffffff81094790>] ? wait_woken+0x80/0x80
[  399.240412]  [<ffffffffa0771f10>] ? find_pers+0x70/0x70 [md_mod]
[  399.240414]  [<ffffffff81075066>] kthread+0xd6/0xf0
[  399.240415]  [<ffffffff81074f90>] ? kthread_park+0x50/0x50
[  399.240417]  [<ffffffff8180411f>] ret_from_fork+0x3f/0x70
[  399.240418]  [<ffffffff81074f90>] ? kthread_park+0x50/0x50
[  399.240433] Code: 89 04 24 e9 2d ff ff ff 49 8d bd d8 07 00 00 f0 49 83 ad d8 07 00 00 01 74 05 e9 8b fe ff ff 41 ff 95 e8 07 00 00 e9 7f fe ff ff <0f> 0b 55 48 63 c7 48 89 e5 41 54 53 48 89 f3 48 83 ec 28 48 0b 
[  399.240434] RIP  [<ffffffff813fcd9e>] generic_make_request+0x29e/0x2a0
[  399.240435]  RSP <ffff8800d72b7d10>


Regards,
Michael Wang

> 
> NeilBrown
> 

^ permalink raw reply

* Re: [RFC PATCH] blk: reset 'bi_next' when bio is done inside request
From: NeilBrown @ 2017-04-04  9:37 UTC (permalink / raw)
  To: Michael Wang, linux-kernel@vger.kernel.org, linux-block,
	linux-raid
  Cc: Jens Axboe, Shaohua Li, Jinpu Wang
In-Reply-To: <9be3ca00-d802-bf64-bcdc-1e76608147f0@profitbricks.com>

[-- Attachment #1: Type: text/plain, Size: 3240 bytes --]

On Tue, Apr 04 2017, Michael Wang wrote:

> Hi, Neil
>
> On 04/03/2017 11:25 PM, NeilBrown wrote:
>> On Mon, Apr 03 2017, Michael Wang wrote:
>> 
>>> blk_attempt_plug_merge() try to merge bio into request and chain them
>>> by 'bi_next', while after the bio is done inside request, we forgot to
>>> reset the 'bi_next'.
>>>
>>> This lead into BUG while removing all the underlying devices from md-raid1,
>>> the bio once go through:
>>>
>>>   md_do_sync()
>>>     sync_request()
>>>       generic_make_request()
>> 
>> This is a read request from the "first" device.
>> 
>>>         blk_queue_bio()
>>>           blk_attempt_plug_merge()
>>>             CHAINED HERE
>>>
>>> will keep chained and reused by:
>>>
>>>   raid1d()
>>>     sync_request_write()
>>>       generic_make_request()
>> 
>> This is a write request to some other device, isn't it?
>> 
>> If sync_request_write() is using a bio that has already been used, it
>> should call bio_reset() and fill in the details again.
>> However I don't see how that would happen.
>> Can you give specific details on the situation that triggers the bug?
>
> We have storage side mapping lv through scst to server, on server side
> we assemble them into multipath device, and then assemble these dm into
> two raid1.
>
> The test is firstly do mkfs.ext4 on raid1 then start fio on it, on storage
> side we unmap all the lv (could during mkfs or fio), then on server side
> we hit the BUG (reproducible).

So I assume the initial resync is still happening at this point?
And you unmap *all* the lv's so you expect IO to fail?
I can see that the code would behave strangely if you have a
bad-block-list configured (which is the default).
Do you have a bbl?  If you create the array without the bbl, does it
still crash?

>
> The path of bio was confirmed by add tracing, it is reused in sync_request_write()
> with 'bi_next' once chained inside blk_attempt_plug_merge().

I still don't see why it is re-used.
I assume you didn't explicitly ask for a check/repair (i.e. didn't write
to .../md/sync_action at all?).  In that case MD_RECOVERY_REQUESTED is
not set.
So sync_request() sends only one bio to generic_make_request():
   r1_bio->bios[r1_bio->read_disk];

then sync_request_write() *doesn't* send that bio again, but does send
all the others.

So where does it reuse a bio?

>
> We also tried to reset the bi_next inside sync_request_write() before
> generic_make_request() which also works.
>
> The testing was done with 4.4, but we found upstream also left bi_next
> chained after done in request, thus we post this RFC.
>
> Regarding raid1, we haven't found the place on path where the bio was
> reset... where does it supposed to be?

I'm not sure what you mean.
We only reset bios when they are being reused.
One place is in process_checks() where bio_reset() is called before
filling in all the details.


Maybe, in sync_request_write(), before

	wbio->bi_rw = WRITE;

add something like
  if (wbio->bi_next)
     printk("bi_next!= NULL i=%d read_disk=%d bi_end_io=%pf\n",
          i, r1_bio->read_disk, wbio->bi_end_io);

that might help narrow down what is happening.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH 6/8] nowait aio: ext4
From: Christoph Hellwig @ 2017-04-04  8:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: Goldwyn Rodrigues, linux-fsdevel, jack, hch, linux-block,
	linux-btrfs, linux-ext4, linux-xfs, sagi, avi, axboe, linux-api,
	willy, tom.leiming, Goldwyn Rodrigues
In-Reply-To: <20170404075853.GB28522@quack2.suse.cz>

On Tue, Apr 04, 2017 at 09:58:53AM +0200, Jan Kara wrote:
> FS_NOWAIT looks a bit too generic given these are filesystem feature flags.
> Can we call it FS_NOWAIT_IO?

It's way to generic as it's a feature of the particular file_operations
instance.  But once we switch to using RWF_* we can just the existing
per-op feature checks for thos and the per-fs flag should just go away.

^ permalink raw reply

* Re: [RFC PATCH] blk: reset 'bi_next' when bio is done inside request
From: Michael Wang @ 2017-04-04  8:13 UTC (permalink / raw)
  To: NeilBrown, linux-kernel@vger.kernel.org, linux-block, linux-raid
  Cc: Jens Axboe, Shaohua Li, Jinpu Wang
In-Reply-To: <877f31kwti.fsf@notabene.neil.brown.name>

Hi, Neil

On 04/03/2017 11:25 PM, NeilBrown wrote:
> On Mon, Apr 03 2017, Michael Wang wrote:
> 
>> blk_attempt_plug_merge() try to merge bio into request and chain them
>> by 'bi_next', while after the bio is done inside request, we forgot to
>> reset the 'bi_next'.
>>
>> This lead into BUG while removing all the underlying devices from md-raid1,
>> the bio once go through:
>>
>>   md_do_sync()
>>     sync_request()
>>       generic_make_request()
> 
> This is a read request from the "first" device.
> 
>>         blk_queue_bio()
>>           blk_attempt_plug_merge()
>>             CHAINED HERE
>>
>> will keep chained and reused by:
>>
>>   raid1d()
>>     sync_request_write()
>>       generic_make_request()
> 
> This is a write request to some other device, isn't it?
> 
> If sync_request_write() is using a bio that has already been used, it
> should call bio_reset() and fill in the details again.
> However I don't see how that would happen.
> Can you give specific details on the situation that triggers the bug?

We have storage side mapping lv through scst to server, on server side
we assemble them into multipath device, and then assemble these dm into
two raid1.

The test is firstly do mkfs.ext4 on raid1 then start fio on it, on storage
side we unmap all the lv (could during mkfs or fio), then on server side
we hit the BUG (reproducible).

The path of bio was confirmed by add tracing, it is reused in sync_request_write()
with 'bi_next' once chained inside blk_attempt_plug_merge().

We also tried to reset the bi_next inside sync_request_write() before
generic_make_request() which also works.

The testing was done with 4.4, but we found upstream also left bi_next
chained after done in request, thus we post this RFC.

Regarding raid1, we haven't found the place on path where the bio was
reset... where does it supposed to be?

BTW the fix_sync_read_error() also invoked and succeed before trigger
the BUG.

Regards,
Michael Wang

> 
> Thanks,
> NeilBrown
> 
> 
>>         BUG_ON(bio->bi_next)
>>
>> After reset the 'bi_next' this can no longer happen.
>>
>> Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
>> ---
>>  block/blk-core.c | 4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 43b7d06..91223b2 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -2619,8 +2619,10 @@ bool blk_update_request(struct request *req, int error, unsigned int nr_bytes)
>>                 struct bio *bio = req->bio;
>>                 unsigned bio_bytes = min(bio->bi_iter.bi_size, nr_bytes);
>>
>> -               if (bio_bytes == bio->bi_iter.bi_size)
>> +               if (bio_bytes == bio->bi_iter.bi_size) {
>>                         req->bio = bio->bi_next;
>> +                       bio->bi_next = NULL;
>> +               }
>>
>>                 req_bio_endio(req, bio, bio_bytes, error);
>>
>> -- 
>> 2.5.0

^ permalink raw reply

* Re: [PATCH 6/8] nowait aio: ext4
From: Jan Kara @ 2017-04-04  7:58 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-fsdevel, jack, hch, linux-block, linux-btrfs, linux-ext4,
	linux-xfs, sagi, avi, axboe, linux-api, willy, tom.leiming,
	Goldwyn Rodrigues
In-Reply-To: <20170403185307.6243-7-rgoldwyn@suse.de>

On Mon 03-04-17 13:53:05, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> Return EAGAIN if any of the following checks fail for direct I/O:
>  + i_rwsem is lockable
>  + Writing beyond end of file (will trigger allocation)
>  + Blocks are not allocated at the write location

Patches seem to be missing your Signed-off-by tag...

> @@ -235,9 +237,21 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  
>  	iocb->private = &overwrite;
>  	/* Check whether we do a DIO overwrite or not */
> -	if (o_direct && ext4_should_dioread_nolock(inode) && !unaligned_aio &&
> -	    ext4_overwrite_io(inode, iocb->ki_pos, iov_iter_count(from)))
> -		overwrite = 1;
> +	if (o_direct && !unaligned_aio) {
> +		struct ext4_map_blocks map;
> +		if (ext4_blocks_mapped(inode, iocb->ki_pos,
> +				      iov_iter_count(from), &map)) {
> +	 		/* To exclude unwritten extents, we need to check
> +			 * m_flags.
> +			 */
> +			if (ext4_should_dioread_nolock(inode) &&
> +			    (map.m_flags & EXT4_MAP_MAPPED))
> +				overwrite = 1;
> +		} else if (iocb->ki_flags & IOCB_NOWAIT) {
> +			ret = -EAGAIN;
> +			goto out;
> +		}
> +	}

Actually, overwriting unwritten extents is relatively complex in ext4 as
well. In particular we need to start a transaction and split out the
written part of the extent. So I don't think we can easily support this
without blocking. As a result I'd keep the condition for IOCB_NOWAIT the
same as for overwrite IO.

> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -117,7 +117,7 @@ static struct file_system_type ext2_fs_type = {
>  	.name		= "ext2",
>  	.mount		= ext4_mount,
>  	.kill_sb	= kill_block_super,
> -	.fs_flags	= FS_REQUIRES_DEV,
> +	.fs_flags	= FS_REQUIRES_DEV | FS_NOWAIT,

FS_NOWAIT looks a bit too generic given these are filesystem feature flags.
Can we call it FS_NOWAIT_IO?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH rfc 0/6] Automatic affinity settings for nvme over rdma
From: Max Gurtovoy @ 2017-04-04  7:51 UTC (permalink / raw)
  To: Sagi Grimberg, linux-rdma, linux-nvme, linux-block
  Cc: netdev, Saeed Mahameed, Or Gerlitz, Christoph Hellwig
In-Reply-To: <1491140492-25703-1-git-send-email-sagi@grimberg.me>

>
> Any feedback is welcome.

Hi Sagi,

the patchset looks good and of course we can add support for more 
drivers in the future.
have you run some performance testing with the nvmf initiator ?


>
> Sagi Grimberg (6):
>   mlx5: convert to generic pci_alloc_irq_vectors
>   mlx5: move affinity hints assignments to generic code
>   RDMA/core: expose affinity mappings per completion vector
>   mlx5: support ->get_vector_affinity
>   block: Add rdma affinity based queue mapping helper
>   nvme-rdma: use intelligent affinity based queue mappings
>
>  block/Kconfig                                      |   5 +
>  block/Makefile                                     |   1 +
>  block/blk-mq-rdma.c                                |  56 +++++++++++
>  drivers/infiniband/hw/mlx5/main.c                  |  10 ++
>  drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   5 +-
>  drivers/net/ethernet/mellanox/mlx5/core/eq.c       |   9 +-
>  drivers/net/ethernet/mellanox/mlx5/core/eswitch.c  |   2 +-
>  drivers/net/ethernet/mellanox/mlx5/core/health.c   |   2 +-
>  drivers/net/ethernet/mellanox/mlx5/core/main.c     | 106 +++------------------
>  .../net/ethernet/mellanox/mlx5/core/mlx5_core.h    |   1 -
>  drivers/nvme/host/rdma.c                           |  13 +++
>  include/linux/blk-mq-rdma.h                        |  10 ++
>  include/linux/mlx5/driver.h                        |   2 -
>  include/rdma/ib_verbs.h                            |  24 +++++
>  14 files changed, 138 insertions(+), 108 deletions(-)
>  create mode 100644 block/blk-mq-rdma.c
>  create mode 100644 include/linux/blk-mq-rdma.h
>

^ permalink raw reply

* Re: [PATCH rfc 5/6] block: Add rdma affinity based queue mapping helper
From: Max Gurtovoy @ 2017-04-04  7:46 UTC (permalink / raw)
  To: Sagi Grimberg, linux-rdma, linux-nvme, linux-block
  Cc: netdev, Saeed Mahameed, Or Gerlitz, Christoph Hellwig
In-Reply-To: <1491140492-25703-6-git-send-email-sagi@grimberg.me>


> diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
> new file mode 100644
> index 000000000000..d402f7c93528
> --- /dev/null
> +++ b/block/blk-mq-rdma.c
> @@ -0,0 +1,56 @@
> +/*
> + * Copyright (c) 2017 Sagi Grimberg.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + */

shouldn't you include  <linux/kobject.h> and <linux/blkdev.h> like in 
commit 8ec2ef2b66ea2f that fixes blk-mq-pci.c ?

> +#include <linux/blk-mq.h>
> +#include <linux/blk-mq-rdma.h>
> +#include <rdma/ib_verbs.h>
> +#include <linux/module.h>
> +#include "blk-mq.h"

Is this include needed ?


> +
> +/**
> + * blk_mq_rdma_map_queues - provide a default queue mapping for rdma device
> + * @set:	tagset to provide the mapping for
> + * @dev:	rdma device associated with @set.
> + * @first_vec:	first interrupt vectors to use for queues (usually 0)
> + *
> + * This function assumes the rdma device @dev has at least as many available
> + * interrupt vetors as @set has queues.  It will then query it's affinity mask
> + * and built queue mapping that maps a queue to the CPUs that have irq affinity
> + * for the corresponding vector.
> + *
> + * In case either the driver passed a @dev with less vectors than
> + * @set->nr_hw_queues, or @dev does not provide an affinity mask for a
> + * vector, we fallback to the naive mapping.
> + */
> +int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
> +		struct ib_device *dev, int first_vec)
> +{
> +	const struct cpumask *mask;
> +	unsigned int queue, cpu;
> +
> +	if (set->nr_hw_queues > dev->num_comp_vectors)
> +		goto fallback;
> +
> +	for (queue = 0; queue < set->nr_hw_queues; queue++) {
> +		mask = ib_get_vector_affinity(dev, first_vec + queue);
> +		if (!mask)
> +			goto fallback;

Christoph,
we can use fallback also in the blk-mq-pci.c in case 
pci_irq_get_affinity fails, right ?

> +
> +		for_each_cpu(cpu, mask)
> +			set->mq_map[cpu] = queue;
> +	}
> +
> +	return 0;
> +fallback:
> +	return blk_mq_map_queues(set);
> +}
> +EXPORT_SYMBOL_GPL(blk_mq_rdma_map_queues);

Otherwise, Looks good.

Reviewed-by: Max Gurtovoy <maxg@mellanox.com>

^ permalink raw reply

* Re: [PATCH] loop: Add PF_LESS_THROTTLE to block/loop device thread.
From: Christoph Hellwig @ 2017-04-04  7:10 UTC (permalink / raw)
  To: NeilBrown; +Cc: Jens Axboe, linux-block, linux-mm, LKML
In-Reply-To: <871staffus.fsf@notabene.neil.brown.name>

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

But if you actually care about performance in any way I'd suggest
to use the loop device in direct I/O mode..

^ permalink raw reply

* Re: [PATCH 6/7] T10: Move opencoded contants to common header
From: Christoph Hellwig @ 2017-04-04  7:09 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-kernel, linux-block, martin.petersen
In-Reply-To: <1491204212-9952-7-git-send-email-dmonakhov@openvz.org>

> -				if ((src->ref_tag == 0xffffffff) ||
> -				    (src->app_tag == 0xffff)) {
> +				if ((src->ref_tag == T10_REF_ESCAPE) ||
> +				    (src->app_tag == T10_APP_ESCAPE)) {

Please remove the inner braces while you're at it (also later in the
patch).

> index 9fba9dd..c96845c 100644
> --- a/include/linux/t10-pi.h
> +++ b/include/linux/t10-pi.h
> @@ -24,6 +24,9 @@ enum t10_dif_type {
>  	T10_PI_TYPE3_PROTECTION = 0x3,
>  };
>  
> +static const __be16 T10_APP_ESCAPE = (__force __be16) 0xffff;
> +static const __be32 T10_REF_ESCAPE = (__force __be32) 0xffffffff;

I'd do this as:

#define T10_APP_ESCAPE	cpu_to_be16(0xffff);
#define T10_REF_ESCAPE	cpu_to_be32(0xffffffff);

This avoids relying on the compiler to merge constants, and also gets
the endianess annotation right instead of force escaping it.

^ permalink raw reply

* Re: [PATCH 5/7] bio-integrity: add bio_integrity_setup helper
From: Christoph Hellwig @ 2017-04-04  7:06 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-kernel, linux-block, martin.petersen
In-Reply-To: <1491204212-9952-6-git-send-email-dmonakhov@openvz.org>

On Mon, Apr 03, 2017 at 11:23:30AM +0400, Dmitry Monakhov wrote:
> Currently all integrity prep hooks are open-coded, and if prepare fails
> we ignore it's code and fail bio with EIO. Let's return real error to
> upper layer, so later caller may react accordingly. For example retry in
> case of ENOMEM.

bio_integrity_enabled and bio_integrity_prep seem to be unused outside
of  bio_integrity_setup, so they can be removed / folded into
bio_integrity_setup.  Which at this point might just keep the
bio_integrity_prep name to fit into the blocking traditions :)

Also please update Documentation/block/data-integrity.txt for your
changes and add a kerneldoc comment for the new function.

^ permalink raw reply

* Re: [PATCH 4/7] bio-integrity: fix interface for bio_integrity_trim
From: Christoph Hellwig @ 2017-04-04  7:03 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-kernel, linux-block, martin.petersen
In-Reply-To: <1491204212-9952-5-git-send-email-dmonakhov@openvz.org>

On Mon, Apr 03, 2017 at 11:23:29AM +0400, Dmitry Monakhov wrote:
> bio_integrity_trim inherent it's interface from bio_trim and accept
> offset and size, but this API is error prone because data offset
> must always be insync with bio's data offset. That is why we have
> integrity update hook in bio_advance()
> 
> So only meaningful offset is 0. Let's just remove it completely.

I think we can get rid of size as well and derive it from the bio,
can't we?

^ permalink raw reply

* Re: [PATCH 3/7] bio-integrity: bio_trim should truncate integrity vector accordingly
From: Christoph Hellwig @ 2017-04-04  7:01 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-kernel, linux-block, martin.petersen
In-Reply-To: <1491204212-9952-4-git-send-email-dmonakhov@openvz.org>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCH 2/7] bio-integrity: save original iterator for verify stage
From: Christoph Hellwig @ 2017-04-04  7:01 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-kernel, linux-block, martin.petersen
In-Reply-To: <1491204212-9952-3-git-send-email-dmonakhov@openvz.org>

This is a pretty big increase in the bio_integrity_payload size,
but I guess we can't get around it..

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCH 1/7] bio-integrity: Do not allocate integrity context for bio w/o data
From: Christoph Hellwig @ 2017-04-04  7:00 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-kernel, linux-block, martin.petersen
In-Reply-To: <1491204212-9952-2-git-send-email-dmonakhov@openvz.org>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCH] blk-mq: Remove blk_mq_queue_data.list
From: Christoph Hellwig @ 2017-04-04  6:59 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Jens Axboe, linux-block@vger.kernel.org
In-Reply-To: <CY1PR0401MB15363601FA034CE8627DA41881080@CY1PR0401MB1536.namprd04.prod.outlook.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCH 1/2] scsi: convert unrecovered read error to -EILSEQ
From: Christoph Hellwig @ 2017-04-04  6:57 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-block, linux-scsi
In-Reply-To: <1491221029-1609-1-git-send-email-dmonakhov@openvz.org>

I'm planning to introduce new block-layer specific status code ASAP,
so I'd prefer not to add new errno special cases.

I'll port your patches to the new code and will send them out with
my series in a few days, though.

^ permalink raw reply

* Re: [PATCH 7/8] nowait aio: xfs
From: Christoph Hellwig @ 2017-04-04  6:52 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-fsdevel, jack, hch, linux-block, linux-btrfs, linux-ext4,
	linux-xfs, sagi, avi, axboe, linux-api, willy, tom.leiming,
	Goldwyn Rodrigues
In-Reply-To: <20170403185307.6243-8-rgoldwyn@suse.de>

> +	if (unaligned_io) {
> +		/* If we are going to wait for other DIO to finish, bail */
> +		if ((iocb->ki_flags & IOCB_NOWAIT) &&
> +		     atomic_read(&inode->i_dio_count))
> +			return -EAGAIN;
>  		inode_dio_wait(inode);

This checks i_dio_count twice in the nowait case, I think it should be:

	if (iocb->ki_flags & IOCB_NOWAIT) {
		if (atomic_read(&inode->i_dio_count))
			return -EAGAIN;
	} else {
		inode_dio_wait(inode);
	}

>  	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && xfs_is_reflink_inode(ip)) {
>  		if (flags & IOMAP_DIRECT) {
> +			/* A reflinked inode will result in CoW alloc */
> +			if (flags & IOMAP_NOWAIT) {
> +				error = -EAGAIN;
> +				goto out_unlock;
> +			}

This is a bit pessimistic - just because the inode has any shared
extents we could still write into unshared ones.  For now I think this
pessimistic check is fine, but the comment should be corrected.

^ permalink raw reply

* Re: [PATCH 5/8] nowait aio: return on congested block device
From: Christoph Hellwig @ 2017-04-04  6:49 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-fsdevel, jack, hch, linux-block, linux-btrfs, linux-ext4,
	linux-xfs, sagi, avi, axboe, linux-api, willy, tom.leiming,
	Goldwyn Rodrigues
In-Reply-To: <20170403185307.6243-6-rgoldwyn@suse.de>

Please make this a REQ_* flag so that it can be passed in the bio,
the request and as an argument to the get_request functions instead
of testing for a bio.

^ permalink raw reply

* Re: [PATCH 1/8] nowait aio: Introduce IOCB_RW_FLAG_NOWAIT
From: Christoph Hellwig @ 2017-04-04  6:48 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-fsdevel, jack, hch, linux-block, linux-btrfs, linux-ext4,
	linux-xfs, sagi, avi, axboe, linux-api, willy, tom.leiming,
	Goldwyn Rodrigues
In-Reply-To: <20170403185307.6243-2-rgoldwyn@suse.de>

On Mon, Apr 03, 2017 at 01:53:00PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> This flag informs kernel to bail out if an AIO request will block
> for reasons such as file allocations, or a writeback triggered,
> or would block while allocating requests while performing
> direct I/O.
> 
> Unfortunately, aio_flags is not checked for validity, which would
> break existing applications which have it set to anything besides zero
> or IOCB_FLAG_RESFD. So, we are using aio_reserved1 and renaming it
> to aio_rw_flags.
> 
> IOCB_RW_FLAG_NOWAIT is translated to IOCB_NOWAIT for
> iocb->ki_flags.

Please make this a flag in the RWF_* namespace, and as a preparation
support the existing RWF_* flags for aio.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox