* Re: [PATCH 1/8] nowait aio: Introduce IOCB_FLAG_NOWAIT
From: Christoph Hellwig @ 2017-03-01 15:36 UTC (permalink / raw)
To: Goldwyn Rodrigues
Cc: jack, hch, linux-fsdevel, linux-block, linux-btrfs, linux-ext4,
linux-xfs, Goldwyn Rodrigues
In-Reply-To: <20170228233610.25456-2-rgoldwyn@suse.de>
On Tue, Feb 28, 2017 at 05:36:03PM -0600, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>
> This flag informs kernel to bail out if an AIO request will block
> for reasons such as file allocations, or a writeback triggered,
> or would block while allocating requests while performing
> direct I/O.
>
> IOCB_FLAG_NOWAIT is translated to IOCB_NOWAIT for
> iocb->ki_flags.
Given that we aren't validating aio_flags in older kernels we can't
just add this flag as it will be a no-op in older kernels. I think
we will have to add IOCB_CMD_PREADV2/IOCB_CMD_WRITEV2 opcodes that
properly validate all reserved fields or flags first.
Once we do that I'd really prefer to use the same flags values
as preadv2/pwritev2 so that we'll only need one set of flags over
sync/async read/write ops.
^ permalink raw reply
* Re: [PATCH 2/8] nowait aio: Return if cannot get hold of i_rwsem
From: Christoph Hellwig @ 2017-03-01 15:37 UTC (permalink / raw)
To: Goldwyn Rodrigues
Cc: jack, hch, linux-fsdevel, linux-block, linux-btrfs, linux-ext4,
linux-xfs, Goldwyn Rodrigues
In-Reply-To: <20170228233610.25456-3-rgoldwyn@suse.de>
Looks fine,
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply
* Re: [PATCH 0/13 v2] block: Fix block device shutdown related races
From: Jan Kara @ 2017-03-01 15:37 UTC (permalink / raw)
To: Tejun Heo
Cc: Jan Kara, Jens Axboe, linux-block, Christoph Hellwig,
Dan Williams, Thiago Jung Bauermann, Lekshmi Pillai, NeilBrown,
Omar Sandoval
In-Reply-To: <20170228165441.GH15287@htj.duckdns.org>
Hello,
On Tue 28-02-17 11:54:41, Tejun Heo wrote:
> It generally looks good to me.
Thanks for review!
> The only worry I have is around wb_shutdown() synchronization and if that
> is actually an issue it shouldn't be too difficult to fix.
Yeah, I'll have a look at that.
> The other thing which came to mind is that the congested->__bdi sever
> semantics. IIRC, that one was also to support the "bdi must go away now"
> behavior. As bdi is refcnted now, I think we can probably just let cong
> hold onto the bdi rather than try to sever the ref there.
So currently I get away with __bdi not being a proper refcounted reference.
If we were to remove the clearing of __bdi, we'd have to make it into
refcounted reference which is sligthly ugly as we need to special-case
embedded bdi_writeback_congested structures. Maybe it will be a worthwhile
cleanup but for now I left it alone...
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply
* Re: [PATCH 07/12] xfs: implement failure-atomic writes
From: Christoph Hellwig @ 2017-03-01 15:17 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-fsdevel, linux-xfs, linux-block
In-Reply-To: <20170228230940.GB26319@birch.djwong.org>
On Tue, Feb 28, 2017 at 03:09:40PM -0800, Darrick J. Wong wrote:
> By the way, the copy on write code remembers the extents it has
> allocated for CoW staging in the refcount btree so that it can free them
> after a crash, which means that O_ATOMIC requires reflink to be enabled.
Yeah.
> There doesn't seem to be any explicit checking that reflink is even
> enabled, which will probably just lead to weird crashes on a pre-reflink
> xfs.
True. I had this earlier when I hat basic O_ATOMIC validity checking,
but that was dropped from the series I posted.
>
> FWIW I didn't see any checking anywhere (vfs or xfs) that the filesystem
> can actually support O_ATOMIC. If the FS doesn't support atomic writes,
> shouldn't the kernel send EINVAL or something back to userspace?
Older kernels can't check it, so having new ones check it creates even
more of a mess.
I'm still not feeling very well about O_ATOMIC - either we need an
open2 that checks for unknown flags, or I need to change this to
a per-op flag - RWF_ATOMIC for write (pwritev2 actually), and MAP_ATOMIC
for mmap. But given that pwritev2 isn't really supported in common
userland yet that might be rather painful.
> At the start of xfs_reflink.c is a long block comment describing how the
> copy on write mechanism works. Since O_ATOMIC is a variant on CoW (it's
> basically CoW with remapping deferred until fsync), please update the
> comment so that the comments capture the details of how atomic writes
> work.
>
> (IOWs: Dave asked me to leave the big comment, so I'm going to try to
> keep it fairly up to date.)
I'll add some information to it.
> I suppose it goes without saying that userspace will have to coordinate
> its O_ATOMIC writes to the file.
It does - but if you have multiple writers to a file they really need
to be coordinated anyway. If you have threads whose updates race
you'd need something like
open(O_TMPFILE)
clone file (or range) into tempfile
update tempfile
clone region you want atomically inserted back into the original file.
We can actually do that with existing primitives, but it's a bit more
heavyweight. We could opimize this a bit by checking if an extent
already points to the same physical blocks before replacing it in
clone_file_range.
> > + if (file->f_flags & O_ATOMIC)
> > + printk_ratelimited("O_ATOMIC!\n");
>
> Per above,
>
> if (file->f_flags & O_ATOMIC) {
> if (!xfs_sb_version_hasreflink(...))
> return -EPROTONOSUPPORT;
Yeah.
> printk_ratelimited("EXPERIMENTAL atomic writes feature in use!\n");
And that should just go away - it was a local debug aid :)
^ permalink raw reply
* Re: [PATCH 0/13 v2] block: Fix block device shutdown related races
From: Jan Kara @ 2017-03-01 15:11 UTC (permalink / raw)
To: Omar Sandoval
Cc: Jan Kara, Jens Axboe, linux-block, Christoph Hellwig,
Dan Williams, Thiago Jung Bauermann, Lekshmi Pillai, Tejun Heo,
NeilBrown
In-Reply-To: <20170301072653.GB9377@vader>
On Tue 28-02-17 23:26:53, Omar Sandoval wrote:
> On Tue, Feb 28, 2017 at 11:25:28PM -0800, Omar Sandoval wrote:
> > On Wed, Feb 22, 2017 at 11:24:25AM +0100, Jan Kara wrote:
> > > On Tue 21-02-17 10:19:28, Jens Axboe wrote:
> > > > On 02/21/2017 10:09 AM, Jan Kara wrote:
> > > > > Hello,
> > > > >
> > > > > this is a second revision of the patch set to fix several different races and
> > > > > issues I've found when testing device shutdown and reuse. The first three
> > > > > patches are fixes to problems in my previous series fixing BDI lifetime issues.
> > > > > Patch 4 fixes issues with reuse of BDI name with scsi devices. With it I cannot
> > > > > reproduce the BDI name reuse issues using Omar's stress test using scsi_debug
> > > > > so it can be used as a replacement of Dan's patches. Patches 5-11 fix oops that
> > > > > is triggered by __blkdev_put() calling inode_detach_wb() too early (the problem
> > > > > reported by Thiago). Patches 12 and 13 fix oops due to a bug in gendisk code
> > > > > where get_gendisk() can return already freed gendisk structure (again triggered
> > > > > by Omar's stress test).
> > > > >
> > > > > People, please have a look at patches. They are mostly simple however the
> > > > > interactions are rather complex so I may have missed something. Also I'm
> > > > > happy for any additional testing these patches can get - I've stressed them
> > > > > with Omar's script, tested memcg writeback, tested static (not udev managed)
> > > > > device inodes.
> > > > >
> > > > > Jens, I think at least patches 1-3 should go in together with my fixes you
> > > > > already have in your tree (or shortly after them). It is up to you whether
> > > > > you decide to delay my first fixes or pick these up quickly. Patch 4 is
> > > > > (IMHO a cleaner) replacement of Dan's patches so consider whether you want
> > > > > to use it instead of those patches.
> > > >
> > > > I have applied 1-3 to my for-linus branch, which will go in after
> > > > the initial pull request has been pulled by Linus. Consider fixing up
> > > > #4 so it applies, I like it.
> > >
> > > OK, attached is patch 4 rebased on top of Linus' tree from today which
> > > already has linux-block changes pulled in. I've left put_disk_devt() in
> > > blk_cleanup_queue() to maintain the logic in the original patch (now commit
> > > 0dba1314d4f8) that request_queue and gendisk each hold one devt reference.
> > > The bdi_unregister() call that is moved to del_gendisk() by this patch is
> > > now protected by the gendisk reference instead of the request_queue one
> > > so it still maintains the property that devt reference protects bdi
> > > registration-unregistration lifetime (as much as that is not needed anymore
> > > after this patch).
> > >
> > > I have also updated the comment in the code and the changelog - they were
> > > somewhat stale after changes to the whole series Tejun suggested.
> > >
> > > Honza
> >
> > Hey, Jan, I just tested this out when I was seeing similar crashes with
> > sr instead of sd, and this fixed it.
> >
> > Tested-by: Omar Sandoval <osandov@fb.com>
>
> Just realized it wasn't clear, I'm talking about patch 4 specifically.
Thanks for confirmation!
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply
* Re: [PATCH 05/12] fs: add a F_IOINFO fcntl
From: Christoph Hellwig @ 2017-03-01 15:11 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-fsdevel, linux-xfs, linux-block
In-Reply-To: <20170228165139.GI5297@birch.djwong.org>
On Tue, Feb 28, 2017 at 08:51:39AM -0800, Darrick J. Wong wrote:
> Hm... is fio_alignment is specified in units of bytes?
Yes.
> If so, then
> shouldn't this be a __u32 so that we can handle some weird future device
> that wants, say, 1MB alignment for its atomic IO?
That would be pretty useless. Anything bigger than sector / block
size would not really be usable for typical applications.
> Though, now that I look at the XFS ioinfo patch, I guess fio_alignment
> is set only for O_DIRECT files?
Yes.
> So it's really the required alignment
> for directio operations.
For buffered I/O we can write at byte granularity and still use the
atomic commits, but for direct I/O we can only COW at block size
granularity.
^ permalink raw reply
* Re: [RFC] failure atomic writes for file systems and block devices
From: Christoph Hellwig @ 2017-03-01 15:09 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-fsdevel, linux-xfs, linux-block
In-Reply-To: <20170228232204.GB5269@birch.djwong.org>
On Tue, Feb 28, 2017 at 03:22:04PM -0800, Darrick J. Wong wrote:
> (Assuming there's no syncv involved here...?)
No. While I think we could implement it for XFS similar how we roll
transactions over multiple inodes for a few transactions, the use case
is much more limited, and the potential pitfalls are much bigger.
> > have to check the F_IOINFO fcntl before, which is a bit of a killer.
> > Because of that I've also not implemented any other validity checks
> > yet, as they might make thing even worse when an open on a not supported
> > file system or device fails, but not on an old kernel. Maybe we need
> > a new open version that checks arguments properly first?
>
> Does fcntl(F_SETFL...) suffer from this?
Yes.
^ permalink raw reply
* Re: [RFC] failure atomic writes for file systems and block devices
From: Christoph Hellwig @ 2017-03-01 15:07 UTC (permalink / raw)
To: Amir Goldstein
Cc: Christoph Hellwig, linux-fsdevel, linux-xfs, linux-block,
linux-api, Michael Kerrisk (man-pages)
In-Reply-To: <CAOQ4uxjYwP+XGdDu-HiKmiAVEFydqYj6xfQbaKULB9WWUSuX5g@mail.gmail.com>
On Wed, Mar 01, 2017 at 01:21:41PM +0200, Amir Goldstein wrote:
> [CC += linux-api@vger.kernel.org] for that question and for the new API
We'll need to iterate over the API a few more times first I think..
^ permalink raw reply
* Re: [RFC] failure atomic writes for file systems and block devices
From: Christoph Hellwig @ 2017-03-01 15:07 UTC (permalink / raw)
To: Chris Mason; +Cc: Christoph Hellwig, linux-fsdevel, linux-xfs, linux-block
In-Reply-To: <e4bc2911-99ee-0049-a11d-3944c1770cff@fb.com>
On Tue, Feb 28, 2017 at 03:48:16PM -0500, Chris Mason wrote:
> One thing that isn't clear to me is how we're dealing with boundary bio
> mappings, which will get submitted by submit_page_section()
>
> sdio->boundary = buffer_boundary(map_bh);
The old dio code is not supported at all by this code at the moment.
We'll either need the new block dev direct I/O code on block
devices (and limit to BIO_MAX_PAGES, this is a bug in this patchset
if people ever have devices with > 1MB atomic write support. And thanks
to NVMe the failure case is silent, sigh..), or we need file system support
for out of place writes.
>
> In btrfs I'd just chain things together and do the extent pointer swap
> afterwards, but I didn't follow the XFS code well enough to see how its
> handled there. But either way it feels like an error prone surprise
> waiting for later, and one gap we really want to get right in the FS
> support is O_ATOMIC across a fragmented extent.
>
> If I'm reading the XFS patches right, the code always cows for atomic.
It doesn't really COW - it uses the COW infrastructure to write out of
place and then commit it into the file later. Because of that we don't
really care about things like boundary blocks (which XFS never used in
that form anyway) - data is written first, the cache is flushed and then
we swap around the extent pointers.
> Are
> you planning on adding an optimization to use atomic support in the device
> to skip COW when possible?
We could do that fairly easily for files that have a contiguous mapping
for the atomic write I/O. But at this point I have a lot more trust in
the fs code than the devices, especially due to the silent failure mode.
^ permalink raw reply
* RE: connect cmd error for nvme-rdma with eventual kernel crash
From: Parav Pandit @ 2017-03-01 15:06 UTC (permalink / raw)
To: Omar Sandoval
Cc: Jens Axboe, axboe@fb.com, linux-block@vger.kernel.org,
Sagi Grimberg, Christoph Hellwig, Max Gurtovoy
In-Reply-To: <20170301055045.GA27251@vader>
Thanks Omar.
> -----Original Message-----
> From: Omar Sandoval [mailto:osandov@osandov.com]
> Sent: Tuesday, February 28, 2017 11:51 PM
> To: Parav Pandit <parav@mellanox.com>
> Cc: Jens Axboe <axboe@kernel.dk>; axboe@fb.com; linux-
> block@vger.kernel.org; Sagi Grimberg <sagi@grimberg.me>; Christoph
> Hellwig <hch@lst.de>; Max Gurtovoy <maxg@mellanox.com>
> Subject: Re: connect cmd error for nvme-rdma with eventual kernel crash
>=20
> On Wed, Mar 01, 2017 at 04:55:23AM +0000, Parav Pandit wrote:
> > Hi Jens,
> >
> > > -----Original Message-----
> > > From: Jens Axboe [mailto:axboe@kernel.dk]
> > > Subject: Re: connect cmd error for nvme-rdma with eventual kernel
> > > crash
> > >
> > > > On Feb 28, 2017, at 5:57 PM, Parav Pandit <parav@mellanox.com>
> wrote:
> > > >
> > > > Hi Jens,
> > > >
> > > > With your commit 2af8cbe30531eca73c8f3ba277f155fc0020b01a in
> > > > linux-block git tree, There are two requests tables. Static and
> > > > dynamic of
> > > same size.
> > > > However function blk_mq_tag_to_rq() always try to get the tag from
> > > > the
> > > dynamic table which doesn't seem to be always initialized.
> > > >
> > > > I am running nvme-rdma initiator and it fails to find the request
> > > > for the
> > > given tag when command completes.
> > > > Command triggers error recovery with "tag not found" error.
> > > > Eventually kernel is crashing in blk_mq_queue_tag_busy_iter() with
> > > > NULL
> > > pointer. Seems to be additional bug in error recovery.
> > > >
> > > > To debug, I added initializing dynamic tags as well.
> > > >
> > > > blk_mq_alloc_rqs() {
> > > > tags->static_rqs[i] =3D rq;
> > > > + tags->rqs[i] =3D rq;
> > > >
> > > > This appears to resolve the issue. But that's not the fix.
> > > > It appears to me that nvme stack is broken in certain conditions
> > > > with recent
> > > static and dynamic rq tables change.
> > >
> > > Can you try my for-linus branch?
> >
> > I tried for-linus branch and it works.
> >
> > Seems like ac6e0c2d633ab0411810fe6b15a40808309041db fixes it.
> > __blk_mq_alloc_request()
> > data->hctx->tags->rqs[rq->tag] =3D rq;
> >
> > Commit says no functional difference but it is actually fixing this iss=
ue.
> >
> > Parav
>=20
> The fix is actually this one:
>=20
> commit f867f4804d55adeef42c68c89edad49cdf3058f7
> Author: Omar Sandoval <osandov@fb.com>
> Date: Mon Feb 27 10:28:27 2017 -0800
>=20
> blk-mq: make blk_mq_alloc_request_hctx() allocate a scheduler request
>=20
> blk_mq_alloc_request_hctx() allocates a driver request directly, unli=
ke
> its blk_mq_alloc_request() counterpart. It also crashes because it
> doesn't update the tags->rqs map.
>=20
> Fix it by making it allocate a scheduler request.
>=20
> Reported-by: Sagi Grimberg <sagi@grimberg.me>
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> Signed-off-by: Jens Axboe <axboe@fb.com>
> Tested-by: Sagi Grimberg <sagi@grimberg.me>
>=20
> The commit you pointed out would have also fixed it, but after this chang=
e
> it's a no-op.
^ permalink raw reply
* Re: [PATCH 15/16] mmc: queue: issue requests in massive parallel
From: Bartlomiej Zolnierkiewicz @ 2017-03-01 12:02 UTC (permalink / raw)
To: Linus Walleij
Cc: linux-mmc, Ulf Hansson, Adrian Hunter, Paolo Valente,
Chunyan Zhang, Baolin Wang, linux-block, Jens Axboe,
Christoph Hellwig, Arnd Bergmann
In-Reply-To: <20170209153403.9730-16-linus.walleij@linaro.org>
On Thursday, February 09, 2017 04:34:02 PM Linus Walleij wrote:
> This makes a crucial change to the issueing mechanism for the
> MMC requests:
>
> Before commit "mmc: core: move the asynchronous post-processing"
> some parallelism on the read/write requests was achieved by
> speculatively postprocessing a request and re-preprocess and
> re-issue the request if something went wrong, which we discover
> later when checking for an error.
>
> This is kind of ugly. Instead we need a mechanism like here:
>
> We issue requests, and when they come back from the hardware,
> we know if they finished successfully or not. If the request
> was successful, we complete the asynchronous request and let a
> new request immediately start on the hardware. If, and only if,
> it returned an error from the hardware we go down the error
> path.
>
> This is achieved by splitting the work path from the hardware
> in two: a successful path ending up calling down to
> mmc_blk_rw_done_success() and an errorpath calling down to
> mmc_blk_rw_done_error().
>
> This has a profound effect: we reintroduce the parallelism on
> the successful path as mmc_post_req() can now be called in
> while the next request is in transit (just like prior to
> commit "mmc: core: move the asynchronous post-processing")
> but ALSO we can call mmc_queue_bounce_post() and
> blk_end_request() in parallel.
>
> The latter has the profound effect of issuing a new request
> again so that we actually need to have at least three requests
> in transit at the same time: we haven't yet dropped the
> reference to our struct mmc_queue_req so we need at least
> three. I put the pool to 4 requests for now.
>
> I expect the imrovement to be noticeable on systems that use
> bounce buffers since they can now process requests in parallel
> with post-processing their bounce buffers, but I don't have a
> test target for that.
>
> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
> ---
> drivers/mmc/core/block.c | 61 +++++++++++++++++++++++++++++++++++++-----------
> drivers/mmc/core/block.h | 4 +++-
> drivers/mmc/core/core.c | 27 ++++++++++++++++++---
> drivers/mmc/core/queue.c | 2 +-
> 4 files changed, 75 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
> index acca15cc1807..f1008ce5376b 100644
> --- a/drivers/mmc/core/block.c
> +++ b/drivers/mmc/core/block.c
> @@ -1622,8 +1622,51 @@ static void mmc_blk_rw_try_restart(struct mmc_queue_req *mq_rq)
> mmc_restart_areq(mq->card->host, &mq_rq->areq);
> }
>
> -void mmc_blk_rw_done(struct mmc_async_req *areq,
> - enum mmc_blk_status status)
> +/**
> + * Final handling of an asynchronous request if there was no error.
> + * This is the common path that we take when everything is nice
> + * and smooth. The status from the command is always MMC_BLK_SUCCESS.
> + */
> +void mmc_blk_rw_done_success(struct mmc_async_req *areq)
> +{
> + struct mmc_queue_req *mq_rq;
> + struct mmc_blk_request *brq;
> + struct mmc_blk_data *md;
> + struct request *old_req;
> + bool req_pending;
> + int type;
> +
> + mq_rq = container_of(areq, struct mmc_queue_req, areq);
> + md = mq_rq->mq->blkdata;
> + brq = &mq_rq->brq;
> + old_req = mq_rq->req;
> + type = rq_data_dir(old_req) == READ ? MMC_BLK_READ : MMC_BLK_WRITE;
> +
> + mmc_queue_bounce_post(mq_rq);
> + mmc_blk_reset_success(md, type);
> + req_pending = blk_end_request(old_req, 0,
> + brq->data.bytes_xfered);
> + /*
> + * If the blk_end_request function returns non-zero even
> + * though all data has been transferred and no errors
> + * were returned by the host controller, it's a bug.
> + */
> + if (req_pending) {
> + pr_err("%s BUG rq_tot %d d_xfer %d\n",
> + __func__, blk_rq_bytes(old_req),
> + brq->data.bytes_xfered);
What has happened to mmc_blk_rw_cmd_abort() call?
> + return;
> + }
> +}
> +
> +/**
> + * Error, recapture, retry etc for asynchronous requests.
> + * This is the error path that we take when there is bad status
> + * coming back from the hardware and we need to do a bit of
> + * cleverness.
> + */
> +void mmc_blk_rw_done_error(struct mmc_async_req *areq,
> + enum mmc_blk_status status)
> {
> struct mmc_queue *mq;
> struct mmc_queue_req *mq_rq;
> @@ -1652,6 +1695,8 @@ void mmc_blk_rw_done(struct mmc_async_req *areq,
>
> switch (status) {
> case MMC_BLK_SUCCESS:
> + pr_err("%s: MMC_BLK_SUCCESS on error path\n", __func__);
> + /* This should not happen: anyway fall through */
> case MMC_BLK_PARTIAL:
> /*
> * A block was successfully transferred.
> @@ -1660,18 +1705,6 @@ void mmc_blk_rw_done(struct mmc_async_req *areq,
>
> req_pending = blk_end_request(old_req, 0,
> brq->data.bytes_xfered);
> - /*
> - * If the blk_end_request function returns non-zero even
> - * though all data has been transferred and no errors
> - * were returned by the host controller, it's a bug.
> - */
> - if (status == MMC_BLK_SUCCESS && req_pending) {
> - pr_err("%s BUG rq_tot %d d_xfer %d\n",
> - __func__, blk_rq_bytes(old_req),
> - brq->data.bytes_xfered);
> - mmc_blk_rw_cmd_abort(card, old_req);
> - return;
> - }
> break;
Best regards,
--
Bartlomiej Zolnierkiewicz
Samsung R&D Institute Poland
Samsung Electronics
^ permalink raw reply
* Re: [PATCH 11/16] mmc: block: shuffle retry and error handling
From: Bartlomiej Zolnierkiewicz @ 2017-03-01 11:45 UTC (permalink / raw)
To: Linus Walleij
Cc: linux-mmc, Ulf Hansson, Adrian Hunter, Paolo Valente,
Chunyan Zhang, Baolin Wang, linux-block, Jens Axboe,
Christoph Hellwig, Arnd Bergmann
In-Reply-To: <22219053.LlSbdLjSYi@amdc3058>
Hi,
On Tuesday, February 28, 2017 06:45:20 PM Bartlomiej Zolnierkiewicz wrote:
> On Thursday, February 09, 2017 04:33:58 PM Linus Walleij wrote:
> > Instead of doing retries at the same time as trying to submit new
> > requests, do the retries when the request is reported as completed
> > by the driver, in the finalization worker.
> >
> > This is achieved by letting the core worker call back into the block
> > layer using mmc_blk_rw_done(), that will read the status and repeatedly
> > try to hammer the request using single request etc by calling back to
> > the core layer using mmc_restart_areq()
> >
> > The beauty of it is that the completion will not complete until the
> > block layer has had the opportunity to hammer a bit at the card using
> > a bunch of different approaches in the while() loop in
> > mmc_blk_rw_done()
> >
> > The algorithm for recapture, retry and handle errors is essentially
> > identical to the one we used to have in mmc_blk_issue_rw_rq(),
> > only augmented to get called in another path.
> >
> > We have to add and initialize a pointer back to the struct mmc_queue
> > from the struct mmc_queue_req to find the queue from the asynchronous
> > request.
> >
> > Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
>
> It seems that after this change we can end up queuing more
> work for kthread from the kthread worker itself and wait
> inside it for this nested work to complete. I hope that
On the second look it seems that there is no waiting for
the retried areq to complete so I cannot see what protects
us from racing and trying to run two areq-s in parallel:
1st areq being retried (in the completion kthread):
mmc_blk_rw_done()->mmc_restart_areq()->__mmc_start_data_req()
2nd areq coming from the second request in the queue
(in the queuing kthread):
mmc_blk_issue_rw_rq()->mmc_start_areq()->__mmc_start_data_req()
(after mmc_blk_rw_done() is done in mmc_finalize_areq() 1st
areq is marked as completed by the completion kthread and
the waiting on host->areq in mmc_start_areq() of the queuing
kthread is done and 2nd areq is started while the 1st one
is still being retried)
?
Also retrying of areqs for MMC_BLK_RETRY status case got broken
(before change do {} while() loop increased retry variable,
now the loop is gone and retry variable will not be increased
correctly and we can loop forever).
Best regards,
--
Bartlomiej Zolnierkiewicz
Samsung R&D Institute Poland
Samsung Electronics
^ permalink raw reply
* Re: [RFC] failure atomic writes for file systems and block devices
From: Amir Goldstein @ 2017-03-01 11:21 UTC (permalink / raw)
To: Christoph Hellwig
Cc: linux-fsdevel, linux-xfs, linux-block, linux-api,
Michael Kerrisk (man-pages)
In-Reply-To: <20170228145737.19016-1-hch@lst.de>
On Tue, Feb 28, 2017 at 4:57 PM, Christoph Hellwig <hch@lst.de> wrote:
> Hi all,
>
> this series implements a new O_ATOMIC flag for failure atomic writes
> to files. It is based on and tries to unify to earlier proposals,
> the first one for block devices by Chris Mason:
>
> https://lwn.net/Articles/573092/
>
> and the second one for regular files, published by HP Research at
> Usenix FAST 2015:
>
> https://www.usenix.org/conference/fast15/technical-sessions/presentation/verma
>
> It adds a new O_ATOMIC flag for open, which requests writes to be
> failure-atomic, that is either the whole write makes it to persistent
> storage, or none of it, even in case of power of other failures.
>
> There are two implementation various of this: on block devices O_ATOMIC
> must be combined with O_(D)SYNC so that storage devices that can handle
> large writes atomically can simply do that without any additional work.
> This case is supported by NVMe.
>
> The second case is for file systems, where we simply write new blocks
> out of places and then remap them into the file atomically on either
> completion of an O_(D)SYNC write or when fsync is called explicitly.
>
> The semantics of the latter case are explained in detail at the Usenix
> paper above.
>
> Last but not least a new fcntl is implemented to provide information
> about I/O restrictions such as alignment requirements and the maximum
> atomic write size.
>
> The implementation is simple and clean, but I'm rather unhappy about
> the interface as it has too many failure modes to bullet proof. For
> one old kernels ignore unknown open flags silently, so applications
> have to check the F_IOINFO fcntl before, which is a bit of a killer.
> Because of that I've also not implemented any other validity checks
> yet, as they might make thing even worse when an open on a not supported
> file system or device fails, but not on an old kernel. Maybe we need
> a new open version that checks arguments properly first?
>
[CC += linux-api@vger.kernel.org] for that question and for the new API
> Also I'm really worried about the NVMe failure modes - devices simply
> advertise an atomic write size, with no way for the device to know
> that the host requested a given write to be atomic, and thus no
> error reporting. This is made worse by NVMe 1.2 adding per-namespace
> atomic I/O parameters that devices can use to introduce additional
> odd alignment quirks - while there is some language in the spec
> requiring them not to weaken the per-controller guarantees it all
> looks rather weak and I'm not too confident in all implementations
> getting everything right.
>
> Last but not least this depends on a few XFS patches, so to actually
> apply / run the patches please use this git tree:
>
> git://git.infradead.org/users/hch/vfs.git O_ATOMIC
>
> Gitweb:
>
> http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/O_ATOMIC
^ permalink raw reply
* Re: [PATCHv2 2/2] nvme: Complete all stuck requests
From: Artur Paszkiewicz @ 2017-03-01 8:54 UTC (permalink / raw)
To: Keith Busch
Cc: Jens Axboe, Sagi Grimberg, Marc MERLIN, linux-nvme, linux-block,
Christoph Hellwig
In-Reply-To: <20170228165719.GA23236@localhost.localdomain>
On 02/28/2017 05:57 PM, Keith Busch wrote:
> On Tue, Feb 28, 2017 at 08:42:19AM +0100, Artur Paszkiewicz wrote:
>>
>> I'm observing the same thing when hibernating during mdraid resync on
>> nvme - it hangs in blk_mq_freeze_queue_wait() after "Disabling non-boot
>> CPUs ...".
>
> The patch guarantees forward progress for blk-mq's hot-cpu notifier on
> nvme request queues by failing all entered requests. It sounds like some
> part of your setup needs those requests to succeed in order to hibernate.
>
> If your mdraid uses a stacking request_queue that submits retries while
> it's request queue is entered, that may explain how you remain stuck
> at blk_mq_freeze_queue_wait.
>
>> This patch did not help but when I put nvme_wait_freeze()
>> right after nvme_start_freeze() it appeared to be working. Maybe the
>> difference here is that requests are submitted from a non-freezable
>> kernel thread (md sync_thread)?
>
> Wait freeze prior to quiescing the queue is ok when the controller is
> functioning, but it'd be impossible to complete a reset if the controller
> is in a failed or degraded state.
>
> We probably want to give those requests a chance to succeed, and I think
> we'd need to be able to timeout the freeze wait. Below are two patches
> I tested. Prior to these, the fio test would report IO errors from some
> of its jobs; no errors with these.
With these patches it works fine. I tested multiple iterations on 2
platforms and they were able to hibernate and resume without issues.
Thanks,
Artur
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply
* Re: [PATCH 0/13 v2] block: Fix block device shutdown related races
From: Omar Sandoval @ 2017-03-01 7:26 UTC (permalink / raw)
To: Jan Kara
Cc: Jens Axboe, linux-block, Christoph Hellwig, Dan Williams,
Thiago Jung Bauermann, Lekshmi Pillai, Tejun Heo, NeilBrown
In-Reply-To: <20170301072528.GA9377@vader>
On Tue, Feb 28, 2017 at 11:25:28PM -0800, Omar Sandoval wrote:
> On Wed, Feb 22, 2017 at 11:24:25AM +0100, Jan Kara wrote:
> > On Tue 21-02-17 10:19:28, Jens Axboe wrote:
> > > On 02/21/2017 10:09 AM, Jan Kara wrote:
> > > > Hello,
> > > >
> > > > this is a second revision of the patch set to fix several different races and
> > > > issues I've found when testing device shutdown and reuse. The first three
> > > > patches are fixes to problems in my previous series fixing BDI lifetime issues.
> > > > Patch 4 fixes issues with reuse of BDI name with scsi devices. With it I cannot
> > > > reproduce the BDI name reuse issues using Omar's stress test using scsi_debug
> > > > so it can be used as a replacement of Dan's patches. Patches 5-11 fix oops that
> > > > is triggered by __blkdev_put() calling inode_detach_wb() too early (the problem
> > > > reported by Thiago). Patches 12 and 13 fix oops due to a bug in gendisk code
> > > > where get_gendisk() can return already freed gendisk structure (again triggered
> > > > by Omar's stress test).
> > > >
> > > > People, please have a look at patches. They are mostly simple however the
> > > > interactions are rather complex so I may have missed something. Also I'm
> > > > happy for any additional testing these patches can get - I've stressed them
> > > > with Omar's script, tested memcg writeback, tested static (not udev managed)
> > > > device inodes.
> > > >
> > > > Jens, I think at least patches 1-3 should go in together with my fixes you
> > > > already have in your tree (or shortly after them). It is up to you whether
> > > > you decide to delay my first fixes or pick these up quickly. Patch 4 is
> > > > (IMHO a cleaner) replacement of Dan's patches so consider whether you want
> > > > to use it instead of those patches.
> > >
> > > I have applied 1-3 to my for-linus branch, which will go in after
> > > the initial pull request has been pulled by Linus. Consider fixing up
> > > #4 so it applies, I like it.
> >
> > OK, attached is patch 4 rebased on top of Linus' tree from today which
> > already has linux-block changes pulled in. I've left put_disk_devt() in
> > blk_cleanup_queue() to maintain the logic in the original patch (now commit
> > 0dba1314d4f8) that request_queue and gendisk each hold one devt reference.
> > The bdi_unregister() call that is moved to del_gendisk() by this patch is
> > now protected by the gendisk reference instead of the request_queue one
> > so it still maintains the property that devt reference protects bdi
> > registration-unregistration lifetime (as much as that is not needed anymore
> > after this patch).
> >
> > I have also updated the comment in the code and the changelog - they were
> > somewhat stale after changes to the whole series Tejun suggested.
> >
> > Honza
>
> Hey, Jan, I just tested this out when I was seeing similar crashes with
> sr instead of sd, and this fixed it.
>
> Tested-by: Omar Sandoval <osandov@fb.com>
Just realized it wasn't clear, I'm talking about patch 4 specifically.
^ permalink raw reply
* Re: [PATCH 0/13 v2] block: Fix block device shutdown related races
From: Omar Sandoval @ 2017-03-01 7:25 UTC (permalink / raw)
To: Jan Kara
Cc: Jens Axboe, linux-block, Christoph Hellwig, Dan Williams,
Thiago Jung Bauermann, Lekshmi Pillai, Tejun Heo, NeilBrown
In-Reply-To: <20170222102425.GB23312@quack2.suse.cz>
On Wed, Feb 22, 2017 at 11:24:25AM +0100, Jan Kara wrote:
> On Tue 21-02-17 10:19:28, Jens Axboe wrote:
> > On 02/21/2017 10:09 AM, Jan Kara wrote:
> > > Hello,
> > >
> > > this is a second revision of the patch set to fix several different races and
> > > issues I've found when testing device shutdown and reuse. The first three
> > > patches are fixes to problems in my previous series fixing BDI lifetime issues.
> > > Patch 4 fixes issues with reuse of BDI name with scsi devices. With it I cannot
> > > reproduce the BDI name reuse issues using Omar's stress test using scsi_debug
> > > so it can be used as a replacement of Dan's patches. Patches 5-11 fix oops that
> > > is triggered by __blkdev_put() calling inode_detach_wb() too early (the problem
> > > reported by Thiago). Patches 12 and 13 fix oops due to a bug in gendisk code
> > > where get_gendisk() can return already freed gendisk structure (again triggered
> > > by Omar's stress test).
> > >
> > > People, please have a look at patches. They are mostly simple however the
> > > interactions are rather complex so I may have missed something. Also I'm
> > > happy for any additional testing these patches can get - I've stressed them
> > > with Omar's script, tested memcg writeback, tested static (not udev managed)
> > > device inodes.
> > >
> > > Jens, I think at least patches 1-3 should go in together with my fixes you
> > > already have in your tree (or shortly after them). It is up to you whether
> > > you decide to delay my first fixes or pick these up quickly. Patch 4 is
> > > (IMHO a cleaner) replacement of Dan's patches so consider whether you want
> > > to use it instead of those patches.
> >
> > I have applied 1-3 to my for-linus branch, which will go in after
> > the initial pull request has been pulled by Linus. Consider fixing up
> > #4 so it applies, I like it.
>
> OK, attached is patch 4 rebased on top of Linus' tree from today which
> already has linux-block changes pulled in. I've left put_disk_devt() in
> blk_cleanup_queue() to maintain the logic in the original patch (now commit
> 0dba1314d4f8) that request_queue and gendisk each hold one devt reference.
> The bdi_unregister() call that is moved to del_gendisk() by this patch is
> now protected by the gendisk reference instead of the request_queue one
> so it still maintains the property that devt reference protects bdi
> registration-unregistration lifetime (as much as that is not needed anymore
> after this patch).
>
> I have also updated the comment in the code and the changelog - they were
> somewhat stale after changes to the whole series Tejun suggested.
>
> Honza
Hey, Jan, I just tested this out when I was seeing similar crashes with
sr instead of sd, and this fixed it.
Tested-by: Omar Sandoval <osandov@fb.com>
^ permalink raw reply
* Re: connect cmd error for nvme-rdma with eventual kernel crash
From: Omar Sandoval @ 2017-03-01 5:50 UTC (permalink / raw)
To: Parav Pandit
Cc: Jens Axboe, axboe@fb.com, linux-block@vger.kernel.org,
Sagi Grimberg, Christoph Hellwig, Max Gurtovoy
In-Reply-To: <VI1PR0502MB3008EA53FCF29CF9E993918BD1290@VI1PR0502MB3008.eurprd05.prod.outlook.com>
On Wed, Mar 01, 2017 at 04:55:23AM +0000, Parav Pandit wrote:
> Hi Jens,
>
> > -----Original Message-----
> > From: Jens Axboe [mailto:axboe@kernel.dk]
> > Subject: Re: connect cmd error for nvme-rdma with eventual kernel crash
> >
> > > On Feb 28, 2017, at 5:57 PM, Parav Pandit <parav@mellanox.com> wrote:
> > >
> > > Hi Jens,
> > >
> > > With your commit 2af8cbe30531eca73c8f3ba277f155fc0020b01a in
> > > linux-block git tree, There are two requests tables. Static and dynamic of
> > same size.
> > > However function blk_mq_tag_to_rq() always try to get the tag from the
> > dynamic table which doesn't seem to be always initialized.
> > >
> > > I am running nvme-rdma initiator and it fails to find the request for the
> > given tag when command completes.
> > > Command triggers error recovery with "tag not found" error.
> > > Eventually kernel is crashing in blk_mq_queue_tag_busy_iter() with NULL
> > pointer. Seems to be additional bug in error recovery.
> > >
> > > To debug, I added initializing dynamic tags as well.
> > >
> > > blk_mq_alloc_rqs() {
> > > tags->static_rqs[i] = rq;
> > > + tags->rqs[i] = rq;
> > >
> > > This appears to resolve the issue. But that's not the fix.
> > > It appears to me that nvme stack is broken in certain conditions with recent
> > static and dynamic rq tables change.
> >
> > Can you try my for-linus branch?
>
> I tried for-linus branch and it works.
>
> Seems like ac6e0c2d633ab0411810fe6b15a40808309041db fixes it.
> __blk_mq_alloc_request()
> data->hctx->tags->rqs[rq->tag] = rq;
>
> Commit says no functional difference but it is actually fixing this issue.
>
> Parav
The fix is actually this one:
commit f867f4804d55adeef42c68c89edad49cdf3058f7
Author: Omar Sandoval <osandov@fb.com>
Date: Mon Feb 27 10:28:27 2017 -0800
blk-mq: make blk_mq_alloc_request_hctx() allocate a scheduler request
blk_mq_alloc_request_hctx() allocates a driver request directly, unlike
its blk_mq_alloc_request() counterpart. It also crashes because it
doesn't update the tags->rqs map.
Fix it by making it allocate a scheduler request.
Reported-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tested-by: Sagi Grimberg <sagi@grimberg.me>
The commit you pointed out would have also fixed it, but after this
change it's a no-op.
^ permalink raw reply
* RE: connect cmd error for nvme-rdma with eventual kernel crash
From: Parav Pandit @ 2017-03-01 4:55 UTC (permalink / raw)
To: Jens Axboe
Cc: axboe@fb.com, linux-block@vger.kernel.org, Sagi Grimberg,
Christoph Hellwig, Max Gurtovoy
In-Reply-To: <A520FEF3-9BB7-4265-ADBA-29ADE118B971@kernel.dk>
Hi Jens,
> -----Original Message-----
> From: Jens Axboe [mailto:axboe@kernel.dk]
> Subject: Re: connect cmd error for nvme-rdma with eventual kernel crash
>=20
> > On Feb 28, 2017, at 5:57 PM, Parav Pandit <parav@mellanox.com> wrote:
> >
> > Hi Jens,
> >
> > With your commit 2af8cbe30531eca73c8f3ba277f155fc0020b01a in
> > linux-block git tree, There are two requests tables. Static and dynamic=
of
> same size.
> > However function blk_mq_tag_to_rq() always try to get the tag from the
> dynamic table which doesn't seem to be always initialized.
> >
> > I am running nvme-rdma initiator and it fails to find the request for t=
he
> given tag when command completes.
> > Command triggers error recovery with "tag not found" error.
> > Eventually kernel is crashing in blk_mq_queue_tag_busy_iter() with NULL
> pointer. Seems to be additional bug in error recovery.
> >
> > To debug, I added initializing dynamic tags as well.
> >
> > blk_mq_alloc_rqs() {
> > tags->static_rqs[i] =3D rq;
> > + tags->rqs[i] =3D rq;
> >
> > This appears to resolve the issue. But that's not the fix.
> > It appears to me that nvme stack is broken in certain conditions with r=
ecent
> static and dynamic rq tables change.
>=20
> Can you try my for-linus branch?
I tried for-linus branch and it works.
Seems like ac6e0c2d633ab0411810fe6b15a40808309041db fixes it.
__blk_mq_alloc_request()=20
data->hctx->tags->rqs[rq->tag] =3D rq;
Commit says no functional difference but it is actually fixing this issue.
Parav
^ permalink raw reply
* Re: [PATCH 3/8] nowait aio: return if direct write will trigger writeback
From: Matthew Wilcox @ 2017-03-01 3:46 UTC (permalink / raw)
To: Goldwyn Rodrigues
Cc: jack, hch, linux-fsdevel, linux-block, linux-btrfs, linux-ext4,
linux-xfs, Goldwyn Rodrigues
In-Reply-To: <20170228233610.25456-4-rgoldwyn@suse.de>
On Tue, Feb 28, 2017 at 05:36:05PM -0600, Goldwyn Rodrigues wrote:
> Find out if the write will trigger a wait due to writeback. If yes,
> return -EAGAIN.
>
> This introduces a new function filemap_range_has_page() which
> returns true if the file's mapping has a page within the range
> mentioned.
Ugh, this is pretty inefficient. If that's all you want to know, then
using the radix tree directly will be far more efficient than spinning
up all the pagevec machinery only to discard the pages found.
But what's going to kick these pages out of cache? Shouldn't we rather
find the pages, kick them out if clean, start writeback if not, and *then*
return -EAGAIN?
So maybe we want to spin up the pagevec machinery after all so we can
do that extra work?
^ permalink raw reply
* Re: [PATCH 0/3] Separate zone requests from medium access requests
From: Martin K. Petersen @ 2017-03-01 3:21 UTC (permalink / raw)
To: Damien Le Moal
Cc: Martin K. Petersen, Christoph Hellwig, Jens Axboe, linux-block,
linux-scsi, MPT-FusionLinux.pdl, Hannes Reinecke, Bart Van Assche
In-Reply-To: <6f1c8be1-9366-b32e-621e-8b05f22a1753@wdc.com>
>>>>> "Damien" == Damien Le Moal <damien.lemoal@wdc.com> writes:
Damien,
Damien> The problem remains that the mpt3sas driver needs fixing. As you
Damien> suggest, we can do that in sd, or directly in mpt3sas. I tried
Damien> to do a clean fix in sd, but always end up consuming a lot of
Damien> aspirin because of all the potential corner cases to deal
Damien> with.
What? You expected this to be easy? :)
FWIW, I'm perfectly happy with the desire to shuffle the ZBC-specifics
over to sd_zbc.c.
Damien> The file sd.h has the inline function
Damien> scsi_medium_access_command() defined. We could move that to
Damien> include/scsi/scsi.h (or scsi_proto.h) and use it in place of
Damien> blk_rq_accesses_medium() in the mpt3sas driver to not force
Damien> unaligned resid corrections for zone commands. Would that be
Damien> acceptable ?
I'd still rather keep it in sd. If you don't have sufficient supplies of
aspirin for the sd_completed_bytes() approach then I'm OK with a simple
tweak to sd_done(). We need something we can reasonably Cc: to stable
for 4.10.
You are welcome to key off of scsi_medium_access_command() if you like
but I don't think it's necessary. Just having a default for the req_op
switch should suffice.
I believe that was your original sd approach. If you post it again I'll
have another look.
And then I'll put the bigger completion rework back on my todo list.
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply
* Re: [PATCH 0/3] Separate zone requests from medium access requests
From: Damien Le Moal @ 2017-03-01 3:02 UTC (permalink / raw)
To: Martin K. Petersen, Christoph Hellwig
Cc: Jens Axboe, linux-block, linux-scsi, MPT-FusionLinux.pdl,
Hannes Reinecke, Bart Van Assche
In-Reply-To: <yq160jtu2s0.fsf@oracle.com>
Martin,
On 3/1/17 11:52, Martin K. Petersen wrote:
>>>>>> "Christoph" =3D=3D Christoph Hellwig <hch@lst.de> writes:
> =
> Christoph> I don't really like this too much - this is too many SCSI
> Christoph> specifics for the block layer to care. Maybe using bios for
> Christoph> the zone ops was a mistake after all, and we should just have
> Christoph> operations in struct block_device instead..
> =
> Yeah, I'm afraid I don't like this either.
> =
> Short term I still think we should adjust in sd. And then maybe longer
> term revisit whether these hybrid commands should be something other
> than bios.
OK. I will think about it.
The problem remains that the mpt3sas driver needs fixing. As you
suggest, we can do that in sd, or directly in mpt3sas.
I tried to do a clean fix in sd, but always end up consuming a lot of
aspirin because of all the potential corner cases to deal with. So I
would prefer really just a fix in mpt3sas for now and think of the sd
changes (if any) together with a redesign of the zone commands execution
(not as BIOs).
The file sd.h has the inline function scsi_medium_access_command()
defined. We could move that to include/scsi/scsi.h (or scsi_proto.h) and
use it in place of blk_rq_accesses_medium() in the mpt3sas driver to not
force unaligned resid corrections for zone commands. Would that be
acceptable ?
Best regards.
-- =
Damien Le Moal, Ph.D.
Sr. Manager, System Software Research Group,
Western Digital Corporation
Damien.LeMoal@wdc.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa,
Kanagawa, 252-0888 Japan
www.wdc.com, www.hgst.com
Western Digital Corporation (and its subsidiaries) E-mail Confidentiality N=
otice & Disclaimer:
This e-mail and any files transmitted with it may contain confidential or l=
egally privileged information of WDC and/or its affiliates, and are intende=
d solely for the use of the individual or entity to which they are addresse=
d. If you are not the intended recipient, any disclosure, copying, distribu=
tion or any action taken or omitted to be taken in reliance on it, is prohi=
bited. If you have received this e-mail in error, please notify the sender =
immediately and delete the e-mail in its entirety from your system.
^ permalink raw reply
* Re: [PATCH 0/3] Separate zone requests from medium access requests
From: Martin K. Petersen @ 2017-03-01 2:52 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Damien Le Moal, Jens Axboe, linux-block, Martin K . Petersen,
linux-scsi, MPT-FusionLinux.pdl, Hannes Reinecke, Bart Van Assche
In-Reply-To: <20170228160259.GA22755@lst.de>
>>>>> "Christoph" == Christoph Hellwig <hch@lst.de> writes:
Christoph> I don't really like this too much - this is too many SCSI
Christoph> specifics for the block layer to care. Maybe using bios for
Christoph> the zone ops was a mistake after all, and we should just have
Christoph> operations in struct block_device instead..
Yeah, I'm afraid I don't like this either.
Short term I still think we should adjust in sd. And then maybe longer
term revisit whether these hybrid commands should be something other
than bios.
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply
* [PATCH] cfq-iosched: fix the delay of cfq_group's vdisktime under iops mode
From: Hou Tao @ 2017-03-01 2:07 UTC (permalink / raw)
To: axboe; +Cc: linux-block, jmoyer, jack, stable
When adding a cfq_group into the cfq service tree, we use CFQ_IDLE_DELAY
as the delay of cfq_group's vdisktime if there have been other cfq_groups
already.
When cfq is under iops mode, commit 9a7f38c42c2b ("cfq-iosched: Convert
from jiffies to nanoseconds") could result in a large iops delay and
lead to an abnormal io schedule delay for the added cfq_group. To fix
it, we just need to revert to the old CFQ_IDLE_DELAY value: HZ / 5
when iops mode is enabled.
Cc: <stable@vger.kernel.org> # 4.8+
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
block/cfq-iosched.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 1379447..fdeb70b 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1361,6 +1361,14 @@ cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
cfqg->vfraction = max_t(unsigned, vfr, 1);
}
+static inline u64 cfq_get_cfqg_vdisktime_delay(struct cfq_data *cfqd)
+{
+ if (!iops_mode(cfqd))
+ return CFQ_IDLE_DELAY;
+ else
+ return nsecs_to_jiffies64(CFQ_IDLE_DELAY);
+}
+
static void
cfq_group_notify_queue_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
{
@@ -1380,7 +1388,8 @@ cfq_group_notify_queue_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
n = rb_last(&st->rb);
if (n) {
__cfqg = rb_entry_cfqg(n);
- cfqg->vdisktime = __cfqg->vdisktime + CFQ_IDLE_DELAY;
+ cfqg->vdisktime = __cfqg->vdisktime +
+ cfq_get_cfqg_vdisktime_delay(cfqd);
} else
cfqg->vdisktime = st->min_vdisktime;
cfq_group_service_tree_add(st, cfqg);
--
2.5.0
^ permalink raw reply related
* Re: connect cmd error for nvme-rdma with eventual kernel crash
From: Jens Axboe @ 2017-03-01 1:26 UTC (permalink / raw)
To: Parav Pandit
Cc: axboe@fb.com, linux-block@vger.kernel.org, Sagi Grimberg,
Christoph Hellwig, Max Gurtovoy
In-Reply-To: <VI1PR0502MB3008851A3D20ECF93FC70FCCD1290@VI1PR0502MB3008.eurprd05.prod.outlook.com>
> On Feb 28, 2017, at 5:57 PM, Parav Pandit <parav@mellanox.com> wrote:
>=20
> Hi Jens,
>=20
> With your commit 2af8cbe30531eca73c8f3ba277f155fc0020b01a in linux-block g=
it tree,
> There are two requests tables. Static and dynamic of same size.
> However function blk_mq_tag_to_rq() always try to get the tag from the dyn=
amic table which doesn't seem to be always initialized.
>=20
> I am running nvme-rdma initiator and it fails to find the request for the g=
iven tag when command completes.
> Command triggers error recovery with "tag not found" error.
> Eventually kernel is crashing in blk_mq_queue_tag_busy_iter() with NULL po=
inter. Seems to be additional bug in error recovery.
>=20
> To debug, I added initializing dynamic tags as well.
>=20
> blk_mq_alloc_rqs() {
> tags->static_rqs[i] =3D rq;
> + tags->rqs[i] =3D rq;
>=20
> This appears to resolve the issue. But that's not the fix.
> It appears to me that nvme stack is broken in certain conditions with rece=
nt static and dynamic rq tables change.
Can you try my for-linus branch?
^ permalink raw reply
* [PATCH] libsas: add sas_unregister_devs_sas_addr in flutter case
From: Jason Yan @ 2017-03-01 1:23 UTC (permalink / raw)
To: linux-scsi, martin.petersen
Cc: axboe, linux-block, miaoxie, fangwei1, wangyijing, zhaohongjiang,
Jason Yan
In sas_rediscover_dev when we call sas_get_phy_attached_dev to find the
device is ok and when in the flutter case when we call
sas_ex_phy_discover the device is gone, the sas_addr was changed to
zero.
[300247.584696] sas: ex 500e004aaaaaaa1f phy0 originated
BROADCAST(CHANGE)
[300247.663516] sas: ex 500e004aaaaaaa1f phy00:U:0 attached:
0000000000000000 (no device)
[300247.663518] sas: ex 500e004aaaaaaa1f phy 0x0 broadcast flutter
When the device is up again, the libsas checked that the old sas_addr
zero so just add a new device.
[300247.846326] sas: Expander phy change count has changed
[300247.846418] sas: ex 500e004aaaaaaa1f phy0 originated
BROADCAST(CHANGE)
[300247.846420] sas: ex 500e004aaaaaaa1f phy0 new device attached
[300247.846519] sas: ex 500e004aaaaaaa1f phy00:U:A attached:
500e004aaaaaaa00 (stp)
[300247.875873] sas: done REVALIDATING DOMAIN on port 0, pid:12565, res
0x0
This will cause a panic when these two device were destroyed. Fix this
by call sas_unregister_devs_sas_addr in the flutter case if the sas_addr
is zero.
Signed-off-by: Jason Yan <yanaijie@huawei.com>
---
drivers/scsi/libsas/sas_expander.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/scsi/libsas/sas_expander.c b/drivers/scsi/libsas/sas_expander.c
index 570b2cb..1a784d7 100644
--- a/drivers/scsi/libsas/sas_expander.c
+++ b/drivers/scsi/libsas/sas_expander.c
@@ -2032,6 +2032,11 @@ static int sas_rediscover_dev(struct domain_device *dev, int phy_id, bool last)
sas_ex_phy_discover(dev, phy_id);
+ if (!SAS_ADDR(phy->attached_sas_addr)) {
+ sas_unregister_devs_sas_addr(dev, phy_id, last);
+ return res;
+ }
+
if (ata_dev && phy->attached_dev_type == SAS_SATA_PENDING)
action = ", needs recovery";
SAS_DPRINTK("ex %016llx phy 0x%x broadcast flutter%s\n",
--
2.5.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox