* Re: [PATCH 0/8 v2] Non-blocking AIO
From: Avi Kivity @ 2017-03-06 18:17 UTC (permalink / raw)
To: Jens Axboe, Jan Kara
Cc: Goldwyn Rodrigues, jack, hch, linux-fsdevel, linux-block,
linux-btrfs, linux-ext4, linux-xfs
In-Reply-To: <7aabb6b4-df8d-8554-fbe3-90504887fb8e@kernel.dk>
On 03/06/2017 07:06 PM, Jens Axboe wrote:
> On 03/06/2017 09:59 AM, Avi Kivity wrote:
>>
>> On 03/06/2017 06:08 PM, Jens Axboe wrote:
>>> On 03/06/2017 08:59 AM, Avi Kivity wrote:
>>>> On 03/06/2017 05:38 PM, Jens Axboe wrote:
>>>>> On 03/06/2017 08:29 AM, Avi Kivity wrote:
>>>>>> On 03/06/2017 05:19 PM, Jens Axboe wrote:
>>>>>>> On 03/06/2017 01:25 AM, Jan Kara wrote:
>>>>>>>> On Sun 05-03-17 16:56:21, Avi Kivity wrote:
>>>>>>>>>> The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
>>>>>>>>>> any of these conditions are met. This way userspace can push most
>>>>>>>>>> of the write()s to the kernel to the best of its ability to complete
>>>>>>>>>> and if it returns -EAGAIN, can defer it to another thread.
>>>>>>>>>>
>>>>>>>>> Is it not possible to push the iocb to a workqueue? This will allow
>>>>>>>>> existing userspace to work with the new functionality, unchanged. Any
>>>>>>>>> userspace implementation would have to do the same thing, so it's not like
>>>>>>>>> we're saving anything by pushing it there.
>>>>>>>> That is not easy because until IO is fully submitted, you need some parts
>>>>>>>> of the context of the process which submits the IO (e.g. memory mappings,
>>>>>>>> but possibly also other credentials). So you would need to somehow transfer
>>>>>>>> this information to the workqueue.
>>>>>>> Outside of technical challenges, the API also needs to return EAGAIN or
>>>>>>> start blocking at some point. We can't expose a direct connection to
>>>>>>> queue work like that, and let any user potentially create millions of
>>>>>>> pending work items (and IOs).
>>>>>> You wouldn't expect more concurrent events than the maxevents parameter
>>>>>> that was supplied to io_setup syscall; it should have reserved any
>>>>>> resources needed.
>>>>> Doesn't matter what limit you apply, my point still stands - at some
>>>>> point you have to return EAGAIN, or block. Returning EAGAIN without
>>>>> the caller having flagged support for that change of behavior would
>>>>> be problematic.
>>>> Doesn't it already return EAGAIN (or some other error) if you exceed
>>>> maxevents?
>>> It's a setup thing. We check these limits when someone creates an IO
>>> context, and carve out the specified entries form our global pool. Then
>>> we free those "resources" when the io context is freed.
>>>
>>> Right now I can setup an IO context with 1000 entries on it, yet that
>>> number has NO bearing on when io_submit() would potentially block or
>>> return EAGAIN.
>>>
>>> We can have a huge gap on the intent signaled by io context setup, and
>>> the reality imposed by what actually happens on the IO submission side.
>> Isn't that a bug? Shouldn't that 1001st incomplete io_submit() return
>> EAGAIN?
>>
>> Just tested it, and maxevents is not respected for this:
>>
>> io_setup(1, [0x7fc64537f000]) = 0
>> io_submit(0x7fc64537f000, 10, [{pread, fildes=3, buf=0x1eb4000,
>> nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096,
>> offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0},
>> {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread,
>> fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3,
>> buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000,
>> nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096,
>> offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0},
>> {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}]) = 10
>>
>> which is unexpected, to me.
> ioctx_alloc()
> {
> [...]
>
> /*
> * We keep track of the number of available ringbuffer slots, to prevent
> * overflow (reqs_available), and we also use percpu counters for this.
> *
> * So since up to half the slots might be on other cpu's percpu counters
> * and unavailable, double nr_events so userspace sees what they
> * expected: additionally, we move req_batch slots to/from percpu
> * counters at a time, so make sure that isn't 0:
> */
> nr_events = max(nr_events, num_possible_cpus() * 4);
> nr_events *= 2;
> }
On a 4-lcore desktop:
io_setup(1, [0x7fc210041000]) = 0
io_submit(0x7fc210041000, 10000, [big array]) = 126
io_submit(0x7fc210041000, 10000, [big array]) = -1 EAGAIN (Resource
temporarily unavailable)
so, the user should already expect EAGAIN from io_submit() due to
resource limits. I'm sure the check could be tightened so that if we do
have to use a workqueue, we respect the user's limit rather than some
inflated number.
^ permalink raw reply
* Re: [bdi_unregister] 165a5e22fa INFO: task swapper:1 blocked for more than 120 seconds.
From: James Bottomley @ 2017-03-06 17:25 UTC (permalink / raw)
To: Jan Kara
Cc: Fengguang Wu, Jens Axboe, linux-block, linux-kernel, LKP,
Martin K. Petersen, linux-scsi
In-Reply-To: <20170306161345.GB32207@quack2.suse.cz>
On Mon, 2017-03-06 at 17:13 +0100, Jan Kara wrote:
> On Mon 06-03-17 07:44:55, James Bottomley wrote:
> > On Mon, 2017-03-06 at 16:14 +0100, Jan Kara wrote:
> > > On Mon 06-03-17 06:35:21, James Bottomley wrote:
> > > > On Mon, 2017-03-06 at 13:01 +0100, Jan Kara wrote:
> > > > > On Mon 06-03-17 11:27:33, Jan Kara wrote:
> > > > > > Hi,
> > > > > >
> > > > > > On Sun 05-03-17 10:21:11, Wu Fengguang wrote:
> > > > > > > FYI next-20170303 is good while mainline is bad with this
> > > > > > > error.
> > > > > > > The attached reproduce-* may help reproduce the issue.
> > > > > >
> > > > > > Thanks for report! So from the stacktrace we are in the
> > > > > > path
> > > > > > testing removal of a device immediately after it has been
> > > > > > probed
> > > > > > and for some reason bdi_unregister() hangs - likely waiting
> > > > > > for
> > > > > > cgroup-writeback references to drop. Given how early this
> > > > > > happens
> > > > > > my guess is we fail to initialize something but for now I
> > > > > > don't
> > > > > > see
> > > > > > how my patch could make a difference. I'm trying to
> > > > > > reproduce
> > > > > > this
> > > > > > to be able to debug more...
> > > > >
> > > > > OK, so after some debugging I think this is yet another
> > > > > problem
> > > > > in
> > > > > SCSI initialization / destruction code which my patch only
> > > > > makes
> > > > > visible (added relevant maintainers).
> > > > >
> > > > > I can reproduce the problem reliably with enabling:
> > > > >
> > > > > CONFIG_DEBUG_TEST_DRIVER_REMOVE=y
> > > > > CONFIG_SCSI_DEBUG=m
> > > > > CONFIG_BLK_CGROUP=y
> > > > > CONFIG_MEMCG=y
> > > > > (and thus CONFIG_CGROUP_WRITEBACK=y)
> > > > >
> > > > > then 'modprobe scsi_debug' is all it takes to reproduce hang.
> > > > > Relevant kernel messages with some of my debugging added
> > > > > (attached is
> > > > > a patch that adds those debug messages):
> > > >
> > > > This looks to be precisely the same problem Dan Williams was
> > > > debugging
> > > > for us.
> > > >
> > > > > [ 58.721765] scsi host0: scsi_debug: version 1.86
> > > > > [20160430]
> > > > > [ 58.721765] dev_size_mb=8, opts=0x0, submit_queues=1,
> > > > > statistics=0
> > > > > [ 58.728946] CGWB init ffff88007fbb2000
> > > > > [ 58.730095] Created sdev ffff880078e1a000
> > > > > [ 58.731611] scsi 0:0:0:0: Direct-Access Linux
> > > > > scsi_debug
> > > > > 0186 PQ : 0 ANSI: 7
> > > > > [ 58.782246] sd 0:0:0:0: [sda] 16384 512-byte logical
> > > > > blocks:
> > > > > (8.39
> > > > > MB/8.00 MiB)
> > > > > [ 58.789687] sd 0:0:0:0: [sda] Write Protect is off
> > > > > [ 58.791140] sd 0:0:0:0: [sda] Mode Sense: 73 00 10 08
> > > > > [ 58.800879] sd 0:0:0:0: [sda] Write cache: enabled, read
> > > > > cache:
> > > > > enabled, supports DPO and FUA
> > > > > [ 58.893738] sd 0:0:0:0: [sda] Attached SCSI disk
> > > > > [ 58.896808] Unreg1
> > > > > [ 58.897960] Unreg2
> > > > > [ 58.898637] Unreg3
> > > > > [ 58.899100] CGWB ffff88007fbb2000 usage_cnt: 0
> > > > > [ 58.900004] Unreg4
> > > > > [ 58.904976] sd 0:0:0:0: [sda] Synchronizing SCSI cache
> > > >
> > > > OK, can you put a WARN_ON trace in sd_shutdown and tell us
> > > > where
> > > > this
> > > > is coming from. For the device to be reused after this we have
> > > > to
> > > > be
> > > > calling sd_shutdown() without going into SDEV_DEL.
> > >
> > > Sure. The call trace is:
> > >
> > > [ 41.919244] ------------[ cut here ]------------
> > > [ 41.919263] WARNING: CPU: 4 PID: 2335 at
> > > drivers/scsi/sd.c:3332
> > > sd_shutdown+0x2f/0x100
> > > [ 41.919268] Modules linked in: scsi_debug(+) netconsole loop
> > > btrfs
> > > raid6_pq zlib_deflate lzo_compress xor
> > > [ 41.919319] CPU: 4 PID: 2335 Comm: modprobe Not tainted 4.11.0
> > > -rc1
> > > -xen+ #49
> > > [ 41.919325] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> > > [ 41.919331] Call Trace:
> > > [ 41.919343] dump_stack+0x8e/0xf0
> > > [ 41.919354] __warn+0x116/0x120
> > > [ 41.919361] warn_slowpath_null+0x1d/0x20
> > > [ 41.919368] sd_shutdown+0x2f/0x100
> > > [ 41.919374] sd_remove+0x70/0xd0
> > >
> > > *** Here is the unexpected step I guess...
> > >
> > > [ 41.919383] driver_probe_device+0xe0/0x4c0
> >
> > Exactly. It's this, I think
> >
> > bool test_remove = IS_ENABLED(CONFIG_DEBUG_TEST_DRIVER_REMOVE)
> > &&
> > !drv->suppress_bind_attrs;
> >
> > You have that config option set.
>
> Yes - or better said 0-day testing has it set. Maybe that is not a
> sane default for 0-day tests? The option is explicitely marked as
> "unstable"... Fengguang?
>
> > So the drivers base layer is calling ->remove after probe and
> > triggering the destruction of the queue.
> >
> > What to do about this (apart from nuke such a stupid option) is
> > somewhat more problematic.
>
> I guess this is between you and Greg :).
Nice try, sport ... you qualify just behind Dan in the "not my problem"
olympic hurdles event. I'm annoyed because there's no indication in
the log that the driver add behaviour is radically altered, so I've
been staring at the wrong code for weeks. However, the unbind/rebind
should work, so the real issue is that your bdi changes (or perhaps
something else in block) have induced a regression in the unbinding of
upper layer drivers. If you're going to release the bdi in
del_gendisk, you have to have some mechanism for re-acquiring it on re
-probe (likely with the same name) otherwise rebind is broken for every
block driver.
The fact that the second rebind tried with a different name indicates
that sd_devt_release wasn't called, so some vestige of the devt remains
on the subsequent rebind. Here's the problem: the queue belongs to
SCSI (the lower layer), so it's not going to change because it doesn't
see the release. The gendisk and its allied stuff belongs to sd so it
gets freed and re-created for the same queue. Something in block is
very confused when this happens.
James
^ permalink raw reply
* Re: [PATCH 0/8 v2] Non-blocking AIO
From: Jens Axboe @ 2017-03-06 17:06 UTC (permalink / raw)
To: Avi Kivity, Jan Kara
Cc: Goldwyn Rodrigues, jack, hch, linux-fsdevel, linux-block,
linux-btrfs, linux-ext4, linux-xfs
In-Reply-To: <56ae3a64-5e27-d7d4-5ab5-f5f68eef8b78@scylladb.com>
On 03/06/2017 09:59 AM, Avi Kivity wrote:
>
>
> On 03/06/2017 06:08 PM, Jens Axboe wrote:
>> On 03/06/2017 08:59 AM, Avi Kivity wrote:
>>> On 03/06/2017 05:38 PM, Jens Axboe wrote:
>>>> On 03/06/2017 08:29 AM, Avi Kivity wrote:
>>>>> On 03/06/2017 05:19 PM, Jens Axboe wrote:
>>>>>> On 03/06/2017 01:25 AM, Jan Kara wrote:
>>>>>>> On Sun 05-03-17 16:56:21, Avi Kivity wrote:
>>>>>>>>> The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
>>>>>>>>> any of these conditions are met. This way userspace can push most
>>>>>>>>> of the write()s to the kernel to the best of its ability to complete
>>>>>>>>> and if it returns -EAGAIN, can defer it to another thread.
>>>>>>>>>
>>>>>>>> Is it not possible to push the iocb to a workqueue? This will allow
>>>>>>>> existing userspace to work with the new functionality, unchanged. Any
>>>>>>>> userspace implementation would have to do the same thing, so it's not like
>>>>>>>> we're saving anything by pushing it there.
>>>>>>> That is not easy because until IO is fully submitted, you need some parts
>>>>>>> of the context of the process which submits the IO (e.g. memory mappings,
>>>>>>> but possibly also other credentials). So you would need to somehow transfer
>>>>>>> this information to the workqueue.
>>>>>> Outside of technical challenges, the API also needs to return EAGAIN or
>>>>>> start blocking at some point. We can't expose a direct connection to
>>>>>> queue work like that, and let any user potentially create millions of
>>>>>> pending work items (and IOs).
>>>>> You wouldn't expect more concurrent events than the maxevents parameter
>>>>> that was supplied to io_setup syscall; it should have reserved any
>>>>> resources needed.
>>>> Doesn't matter what limit you apply, my point still stands - at some
>>>> point you have to return EAGAIN, or block. Returning EAGAIN without
>>>> the caller having flagged support for that change of behavior would
>>>> be problematic.
>>> Doesn't it already return EAGAIN (or some other error) if you exceed
>>> maxevents?
>> It's a setup thing. We check these limits when someone creates an IO
>> context, and carve out the specified entries form our global pool. Then
>> we free those "resources" when the io context is freed.
>>
>> Right now I can setup an IO context with 1000 entries on it, yet that
>> number has NO bearing on when io_submit() would potentially block or
>> return EAGAIN.
>>
>> We can have a huge gap on the intent signaled by io context setup, and
>> the reality imposed by what actually happens on the IO submission side.
>
> Isn't that a bug? Shouldn't that 1001st incomplete io_submit() return
> EAGAIN?
>
> Just tested it, and maxevents is not respected for this:
>
> io_setup(1, [0x7fc64537f000]) = 0
> io_submit(0x7fc64537f000, 10, [{pread, fildes=3, buf=0x1eb4000,
> nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096,
> offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0},
> {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread,
> fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3,
> buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000,
> nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096,
> offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0},
> {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}]) = 10
>
> which is unexpected, to me.
ioctx_alloc()
{
[...]
/*
* We keep track of the number of available ringbuffer slots, to prevent
* overflow (reqs_available), and we also use percpu counters for this.
*
* So since up to half the slots might be on other cpu's percpu counters
* and unavailable, double nr_events so userspace sees what they
* expected: additionally, we move req_batch slots to/from percpu
* counters at a time, so make sure that isn't 0:
*/
nr_events = max(nr_events, num_possible_cpus() * 4);
nr_events *= 2;
}
--
Jens Axboe
^ permalink raw reply
* Re: [PATCH 0/8 v2] Non-blocking AIO
From: Avi Kivity @ 2017-03-06 16:59 UTC (permalink / raw)
To: Jens Axboe, Jan Kara
Cc: Goldwyn Rodrigues, jack, hch, linux-fsdevel, linux-block,
linux-btrfs, linux-ext4, linux-xfs
In-Reply-To: <f1415241-ddd0-8918-4a68-c5e1adf06e09@kernel.dk>
On 03/06/2017 06:08 PM, Jens Axboe wrote:
> On 03/06/2017 08:59 AM, Avi Kivity wrote:
>> On 03/06/2017 05:38 PM, Jens Axboe wrote:
>>> On 03/06/2017 08:29 AM, Avi Kivity wrote:
>>>> On 03/06/2017 05:19 PM, Jens Axboe wrote:
>>>>> On 03/06/2017 01:25 AM, Jan Kara wrote:
>>>>>> On Sun 05-03-17 16:56:21, Avi Kivity wrote:
>>>>>>>> The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
>>>>>>>> any of these conditions are met. This way userspace can push most
>>>>>>>> of the write()s to the kernel to the best of its ability to complete
>>>>>>>> and if it returns -EAGAIN, can defer it to another thread.
>>>>>>>>
>>>>>>> Is it not possible to push the iocb to a workqueue? This will allow
>>>>>>> existing userspace to work with the new functionality, unchanged. Any
>>>>>>> userspace implementation would have to do the same thing, so it's not like
>>>>>>> we're saving anything by pushing it there.
>>>>>> That is not easy because until IO is fully submitted, you need some parts
>>>>>> of the context of the process which submits the IO (e.g. memory mappings,
>>>>>> but possibly also other credentials). So you would need to somehow transfer
>>>>>> this information to the workqueue.
>>>>> Outside of technical challenges, the API also needs to return EAGAIN or
>>>>> start blocking at some point. We can't expose a direct connection to
>>>>> queue work like that, and let any user potentially create millions of
>>>>> pending work items (and IOs).
>>>> You wouldn't expect more concurrent events than the maxevents parameter
>>>> that was supplied to io_setup syscall; it should have reserved any
>>>> resources needed.
>>> Doesn't matter what limit you apply, my point still stands - at some
>>> point you have to return EAGAIN, or block. Returning EAGAIN without
>>> the caller having flagged support for that change of behavior would
>>> be problematic.
>> Doesn't it already return EAGAIN (or some other error) if you exceed
>> maxevents?
> It's a setup thing. We check these limits when someone creates an IO
> context, and carve out the specified entries form our global pool. Then
> we free those "resources" when the io context is freed.
>
> Right now I can setup an IO context with 1000 entries on it, yet that
> number has NO bearing on when io_submit() would potentially block or
> return EAGAIN.
>
> We can have a huge gap on the intent signaled by io context setup, and
> the reality imposed by what actually happens on the IO submission side.
Isn't that a bug? Shouldn't that 1001st incomplete io_submit() return
EAGAIN?
Just tested it, and maxevents is not respected for this:
io_setup(1, [0x7fc64537f000]) = 0
io_submit(0x7fc64537f000, 10, [{pread, fildes=3, buf=0x1eb4000,
nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096,
offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0},
{pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread,
fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3,
buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000,
nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096,
offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0},
{pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}]) = 10
which is unexpected, to me.
^ permalink raw reply
* [PATCH 11/11] block: Fix oops scsi_disk_get()
From: Jan Kara @ 2017-03-06 16:34 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Dan Williams,
Thiago Jung Bauermann, Lekshmi Pillai, Tejun Heo, NeilBrown,
Omar Sandoval, Jan Kara
In-Reply-To: <20170306163404.1238-1-jack@suse.cz>
When device open races with device shutdown, we can get the following
oops in scsi_disk_get():
[11863.044351] general protection fault: 0000 [#1] SMP
[11863.045561] Modules linked in: scsi_debug xfs libcrc32c netconsole btrfs raid6_pq zlib_deflate lzo_compress xor [last unloaded: loop]
[11863.047853] CPU: 3 PID: 13042 Comm: hald-probe-stor Tainted: G W 4.10.0-rc2-xen+ #35
[11863.048030] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[11863.048030] task: ffff88007f438200 task.stack: ffffc90000fd0000
[11863.048030] RIP: 0010:scsi_disk_get+0x43/0x70
[11863.048030] RSP: 0018:ffffc90000fd3a08 EFLAGS: 00010202
[11863.048030] RAX: 6b6b6b6b6b6b6b6b RBX: ffff88007f56d000 RCX: 0000000000000000
[11863.048030] RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffffffff81a8d880
[11863.048030] RBP: ffffc90000fd3a18 R08: 0000000000000000 R09: 0000000000000001
[11863.059217] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000fffffffa
[11863.059217] R13: ffff880078872800 R14: ffff880070915540 R15: 000000000000001d
[11863.059217] FS: 00007f2611f71800(0000) GS:ffff88007f0c0000(0000) knlGS:0000000000000000
[11863.059217] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[11863.059217] CR2: 000000000060e048 CR3: 00000000778d4000 CR4: 00000000000006e0
[11863.059217] Call Trace:
[11863.059217] ? disk_get_part+0x22/0x1f0
[11863.059217] sd_open+0x39/0x130
[11863.059217] __blkdev_get+0x69/0x430
[11863.059217] ? bd_acquire+0x7f/0xc0
[11863.059217] ? bd_acquire+0x96/0xc0
[11863.059217] ? blkdev_get+0x350/0x350
[11863.059217] blkdev_get+0x126/0x350
[11863.059217] ? _raw_spin_unlock+0x2b/0x40
[11863.059217] ? bd_acquire+0x7f/0xc0
[11863.059217] ? blkdev_get+0x350/0x350
[11863.059217] blkdev_open+0x65/0x80
...
As you can see RAX value is already poisoned showing that gendisk we got
is already freed. The problem is that get_gendisk() looks up device
number in ext_devt_idr and then does get_disk() which does kobject_get()
on the disks kobject. However the disk gets removed from ext_devt_idr
only in disk_release() (through blk_free_devt()) at which moment it has
already 0 refcount and is already on its way to be freed. Indeed we've
got a warning from kobject_get() about 0 refcount shortly before the
oops.
We fix the problem by using kobject_get_unless_zero() in get_disk() so
that get_disk() cannot get reference on a disk that is already being
freed.
Tested-by: Lekshmi Pillai <lekshmicpillai@in.ibm.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
block/genhd.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/block/genhd.c b/block/genhd.c
index e8df37de03af..7fa59bc231dd 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1374,7 +1374,7 @@ struct kobject *get_disk(struct gendisk *disk)
owner = disk->fops->owner;
if (owner && !try_module_get(owner))
return NULL;
- kobj = kobject_get(&disk_to_dev(disk)->kobj);
+ kobj = kobject_get_unless_zero(&disk_to_dev(disk)->kobj);
if (kobj == NULL) {
module_put(owner);
return NULL;
--
2.10.2
^ permalink raw reply related
* [PATCH 10/11] kobject: Export kobject_get_unless_zero()
From: Jan Kara @ 2017-03-06 16:34 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Dan Williams,
Thiago Jung Bauermann, Lekshmi Pillai, Tejun Heo, NeilBrown,
Omar Sandoval, Jan Kara, Greg Kroah-Hartman
In-Reply-To: <20170306163404.1238-1-jack@suse.cz>
Make the function available for outside use and fortify it against NULL
kobject.
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
include/linux/kobject.h | 2 ++
lib/kobject.c | 5 ++++-
2 files changed, 6 insertions(+), 1 deletion(-)
diff --git a/include/linux/kobject.h b/include/linux/kobject.h
index e6284591599e..ca85cb80e99a 100644
--- a/include/linux/kobject.h
+++ b/include/linux/kobject.h
@@ -108,6 +108,8 @@ extern int __must_check kobject_rename(struct kobject *, const char *new_name);
extern int __must_check kobject_move(struct kobject *, struct kobject *);
extern struct kobject *kobject_get(struct kobject *kobj);
+extern struct kobject * __must_check kobject_get_unless_zero(
+ struct kobject *kobj);
extern void kobject_put(struct kobject *kobj);
extern const void *kobject_namespace(struct kobject *kobj);
diff --git a/lib/kobject.c b/lib/kobject.c
index 445dcaeb0f56..763d70a18941 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -601,12 +601,15 @@ struct kobject *kobject_get(struct kobject *kobj)
}
EXPORT_SYMBOL(kobject_get);
-static struct kobject * __must_check kobject_get_unless_zero(struct kobject *kobj)
+struct kobject * __must_check kobject_get_unless_zero(struct kobject *kobj)
{
+ if (!kobj)
+ return NULL;
if (!kref_get_unless_zero(&kobj->kref))
kobj = NULL;
return kobj;
}
+EXPORT_SYMBOL(kobject_get_unless_zero);
/*
* kobject_cleanup - free kobject resources.
--
2.10.2
^ permalink raw reply related
* [PATCH 09/11] block: Fix oops in locked_inode_to_wb_and_lock_list()
From: Jan Kara @ 2017-03-06 16:34 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Dan Williams,
Thiago Jung Bauermann, Lekshmi Pillai, Tejun Heo, NeilBrown,
Omar Sandoval, Jan Kara
In-Reply-To: <20170306163404.1238-1-jack@suse.cz>
When block device is closed, we call inode_detach_wb() in __blkdev_put()
which sets inode->i_wb to NULL. That is contrary to expectations that
inode->i_wb stays valid once set during the whole inode's lifetime and
leads to oops in wb_get() in locked_inode_to_wb_and_lock_list() because
inode_to_wb() returned NULL.
The reason why we called inode_detach_wb() is not valid anymore though.
BDI is guaranteed to stay along until we call bdi_put() from
bdev_evict_inode() so we can postpone calling inode_detach_wb() to that
moment.
Also add a warning to catch if someone uses inode_detach_wb() in a
dangerous way.
Reported-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
fs/block_dev.c | 8 ++------
include/linux/writeback.h | 1 +
2 files changed, 3 insertions(+), 6 deletions(-)
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 9e1993a2827f..4b8a0f871f2d 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -885,6 +885,8 @@ static void bdev_evict_inode(struct inode *inode)
spin_lock(&bdev_lock);
list_del_init(&bdev->bd_list);
spin_unlock(&bdev_lock);
+ /* Detach inode from wb early as bdi_put() may free bdi->wb */
+ inode_detach_wb(inode);
if (bdev->bd_bdi != &noop_backing_dev_info) {
bdi_put(bdev->bd_bdi);
bdev->bd_bdi = &noop_backing_dev_info;
@@ -1878,12 +1880,6 @@ static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
kill_bdev(bdev);
bdev_write_inode(bdev);
- /*
- * Detaching bdev inode from its wb in __destroy_inode()
- * is too late: the queue which embeds its bdi (along with
- * root wb) can be gone as soon as we put_disk() below.
- */
- inode_detach_wb(bdev->bd_inode);
}
if (bdev->bd_contains == bdev) {
if (disk->fops->release)
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index a3c0cbd7c888..d5815794416c 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -237,6 +237,7 @@ static inline void inode_attach_wb(struct inode *inode, struct page *page)
static inline void inode_detach_wb(struct inode *inode)
{
if (inode->i_wb) {
+ WARN_ON_ONCE(!(inode->i_state & I_CLEAR));
wb_put(inode->i_wb);
inode->i_wb = NULL;
}
--
2.10.2
^ permalink raw reply related
* [PATCH 08/11] bdi: Rename cgwb_bdi_destroy() to cgwb_bdi_unregister()
From: Jan Kara @ 2017-03-06 16:34 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Dan Williams,
Thiago Jung Bauermann, Lekshmi Pillai, Tejun Heo, NeilBrown,
Omar Sandoval, Jan Kara
In-Reply-To: <20170306163404.1238-1-jack@suse.cz>
Rename cgwb_bdi_destroy() to cgwb_bdi_unregister() as it gets called
from bdi_unregister() which is not necessarily called from bdi_destroy()
and thus the name is somewhat misleading.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
mm/backing-dev.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 7592f4a36776..54c1ff39ea05 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -700,7 +700,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
return ret;
}
-static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
+static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
{
struct radix_tree_iter iter;
struct rb_node *rbn;
@@ -790,7 +790,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
return 0;
}
-static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
+static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
{
wb_congested_put(bdi->wb_congested);
}
@@ -903,7 +903,7 @@ void bdi_unregister(struct backing_dev_info *bdi)
/* make sure nobody finds us on the bdi_list anymore */
bdi_remove_from_list(bdi);
wb_shutdown(&bdi->wb);
- cgwb_bdi_destroy(bdi);
+ cgwb_bdi_unregister(bdi);
if (bdi->dev) {
bdi_debug_unregister(bdi);
--
2.10.2
^ permalink raw reply related
* [PATCH 07/11] bdi: Do not wait for cgwbs release in bdi_unregister()
From: Jan Kara @ 2017-03-06 16:34 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Dan Williams,
Thiago Jung Bauermann, Lekshmi Pillai, Tejun Heo, NeilBrown,
Omar Sandoval, Jan Kara
In-Reply-To: <20170306163404.1238-1-jack@suse.cz>
Currently we wait for all cgwbs to get released in cgwb_bdi_destroy()
(called from bdi_unregister()). That is however unnecessary now when
cgwb->bdi is a proper refcounted reference (thus bdi cannot get
released before all cgwbs are released) and when cgwb_bdi_destroy()
shuts down writeback directly.
Signed-off-by: Jan Kara <jack@suse.cz>
---
include/linux/backing-dev-defs.h | 1 -
mm/backing-dev.c | 18 +-----------------
2 files changed, 1 insertion(+), 18 deletions(-)
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 8af720f22a2d..e66d4722db8e 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -164,7 +164,6 @@ struct backing_dev_info {
#ifdef CONFIG_CGROUP_WRITEBACK
struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */
struct rb_root cgwb_congested_tree; /* their congested states */
- atomic_t usage_cnt; /* counts both cgwbs and cgwb_contested's */
#else
struct bdi_writeback_congested *wb_congested;
#endif
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index d084c89922f3..7592f4a36776 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -406,11 +406,9 @@ static void wb_exit(struct bdi_writeback *wb)
/*
* cgwb_lock protects bdi->cgwb_tree, bdi->cgwb_congested_tree,
* blkcg->cgwb_list, and memcg->cgwb_list. bdi->cgwb_tree is also RCU
- * protected. cgwb_release_wait is used to wait for the completion of cgwb
- * releases from bdi destruction path.
+ * protected.
*/
static DEFINE_SPINLOCK(cgwb_lock);
-static DECLARE_WAIT_QUEUE_HEAD(cgwb_release_wait);
/**
* wb_congested_get_create - get or create a wb_congested
@@ -505,7 +503,6 @@ static void cgwb_release_workfn(struct work_struct *work)
{
struct bdi_writeback *wb = container_of(work, struct bdi_writeback,
release_work);
- struct backing_dev_info *bdi = wb->bdi;
wb_shutdown(wb);
@@ -516,9 +513,6 @@ static void cgwb_release_workfn(struct work_struct *work)
percpu_ref_exit(&wb->refcnt);
wb_exit(wb);
kfree_rcu(wb, rcu);
-
- if (atomic_dec_and_test(&bdi->usage_cnt))
- wake_up_all(&cgwb_release_wait);
}
static void cgwb_release(struct percpu_ref *refcnt)
@@ -608,7 +602,6 @@ static int cgwb_create(struct backing_dev_info *bdi,
/* we might have raced another instance of this function */
ret = radix_tree_insert(&bdi->cgwb_tree, memcg_css->id, wb);
if (!ret) {
- atomic_inc(&bdi->usage_cnt);
list_add_tail_rcu(&wb->bdi_node, &bdi->wb_list);
list_add(&wb->memcg_node, memcg_cgwb_list);
list_add(&wb->blkcg_node, blkcg_cgwb_list);
@@ -698,7 +691,6 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
INIT_RADIX_TREE(&bdi->cgwb_tree, GFP_ATOMIC);
bdi->cgwb_congested_tree = RB_ROOT;
- atomic_set(&bdi->usage_cnt, 1);
ret = wb_init(&bdi->wb, bdi, 1, GFP_KERNEL);
if (!ret) {
@@ -739,14 +731,6 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
}
spin_unlock_irq(&cgwb_lock);
-
- /*
- * All cgwb's and their congested states must be shutdown and
- * released before returning. Drain the usage counter to wait for
- * all cgwb's and cgwb_congested's ever created on @bdi.
- */
- atomic_dec(&bdi->usage_cnt);
- wait_event(cgwb_release_wait, !atomic_read(&bdi->usage_cnt));
}
/**
--
2.10.2
^ permalink raw reply related
* [PATCH 06/11] bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy()
From: Jan Kara @ 2017-03-06 16:33 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Dan Williams,
Thiago Jung Bauermann, Lekshmi Pillai, Tejun Heo, NeilBrown,
Omar Sandoval, Jan Kara
In-Reply-To: <20170306163404.1238-1-jack@suse.cz>
Currently we waited for all cgwbs to get freed in cgwb_bdi_destroy()
which also means that writeback has been shutdown on them. Since this
wait is going away, directly shutdown writeback on cgwbs from
cgwb_bdi_destroy() to avoid live writeback structures after
bdi_unregister() has finished. To make that safe with concurrent
shutdown from cgwb_release_workfn(), we also have to make sure
wb_shutdown() returns only after the bdi_writeback structure is really
shutdown.
Signed-off-by: Jan Kara <jack@suse.cz>
---
include/linux/backing-dev-defs.h | 1 +
mm/backing-dev.c | 22 ++++++++++++++++++++++
2 files changed, 23 insertions(+)
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 8fb3dcdebc80..8af720f22a2d 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -21,6 +21,7 @@ struct dentry;
*/
enum wb_state {
WB_registered, /* bdi_register() was done */
+ WB_shutting_down, /* wb_shutdown() in progress */
WB_writeback_running, /* Writeback is in progress */
WB_has_dirty_io, /* Dirty inodes on ->b_{dirty|io|more_io} */
};
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 4dffae6e2d7b..d084c89922f3 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -356,8 +356,15 @@ static void wb_shutdown(struct bdi_writeback *wb)
spin_lock_bh(&wb->work_lock);
if (!test_and_clear_bit(WB_registered, &wb->state)) {
spin_unlock_bh(&wb->work_lock);
+ /*
+ * Wait for wb shutdown to finish if someone else is just
+ * running wb_shutdown(). Otherwise we could proceed to wb /
+ * bdi destruction before wb_shutdown() is finished.
+ */
+ wait_on_bit(&wb->state, WB_shutting_down, TASK_UNINTERRUPTIBLE);
return;
}
+ set_bit(WB_shutting_down, &wb->state);
spin_unlock_bh(&wb->work_lock);
cgwb_remove_from_bdi_list(wb);
@@ -369,6 +376,12 @@ static void wb_shutdown(struct bdi_writeback *wb)
mod_delayed_work(bdi_wq, &wb->dwork, 0);
flush_delayed_work(&wb->dwork);
WARN_ON(!list_empty(&wb->work_list));
+ /*
+ * Make sure bit gets cleared after shutdown is finished. Matches with
+ * the barrier provided by test_and_clear_bit() above.
+ */
+ smp_wmb();
+ clear_bit(WB_shutting_down, &wb->state);
}
static void wb_exit(struct bdi_writeback *wb)
@@ -700,6 +713,7 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
struct radix_tree_iter iter;
struct rb_node *rbn;
void **slot;
+ struct bdi_writeback *wb;
WARN_ON(test_bit(WB_registered, &bdi->wb.state));
@@ -716,6 +730,14 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
congested->__bdi = NULL; /* mark @congested unlinked */
}
+ while (!list_empty(&bdi->wb_list)) {
+ wb = list_first_entry(&bdi->wb_list, struct bdi_writeback,
+ bdi_node);
+ spin_unlock_irq(&cgwb_lock);
+ wb_shutdown(wb);
+ spin_lock_irq(&cgwb_lock);
+ }
+
spin_unlock_irq(&cgwb_lock);
/*
--
2.10.2
^ permalink raw reply related
* [PATCH 05/11] bdi: Move removal from bdi->wb_list into wb_shutdown()
From: Jan Kara @ 2017-03-06 16:33 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Dan Williams,
Thiago Jung Bauermann, Lekshmi Pillai, Tejun Heo, NeilBrown,
Omar Sandoval, Jan Kara
In-Reply-To: <20170306163404.1238-1-jack@suse.cz>
Currently the removal from bdi->wb_list happens directly in
cgwb_release_workfn(). Move it to wb_shutdown() which is functionally
equivalent and more logical (the list gets only used for distributing
writeback works among bdi_writeback structures). It will also allow us
to simplify writeback shutdown in cgwb_bdi_destroy().
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
mm/backing-dev.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 37cd1adae979..4dffae6e2d7b 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -345,6 +345,8 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
return err;
}
+static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb);
+
/*
* Remove bdi from the global list and shutdown any threads we have running
*/
@@ -358,6 +360,7 @@ static void wb_shutdown(struct bdi_writeback *wb)
}
spin_unlock_bh(&wb->work_lock);
+ cgwb_remove_from_bdi_list(wb);
/*
* Drain work list and shutdown the delayed_work. !WB_registered
* tells wb_workfn() that @wb is dying and its work_list needs to
@@ -491,10 +494,6 @@ static void cgwb_release_workfn(struct work_struct *work)
release_work);
struct backing_dev_info *bdi = wb->bdi;
- spin_lock_irq(&cgwb_lock);
- list_del_rcu(&wb->bdi_node);
- spin_unlock_irq(&cgwb_lock);
-
wb_shutdown(wb);
css_put(wb->memcg_css);
@@ -526,6 +525,13 @@ static void cgwb_kill(struct bdi_writeback *wb)
percpu_ref_kill(&wb->refcnt);
}
+static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb)
+{
+ spin_lock_irq(&cgwb_lock);
+ list_del_rcu(&wb->bdi_node);
+ spin_unlock_irq(&cgwb_lock);
+}
+
static int cgwb_create(struct backing_dev_info *bdi,
struct cgroup_subsys_state *memcg_css, gfp_t gfp)
{
@@ -783,6 +789,8 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
wb_congested_put(bdi->wb_congested);
}
+static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb) { }
+
#endif /* CONFIG_CGROUP_WRITEBACK */
int bdi_init(struct backing_dev_info *bdi)
--
2.10.2
^ permalink raw reply related
* [PATCH 04/11] bdi: Make wb->bdi a proper reference
From: Jan Kara @ 2017-03-06 16:33 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Dan Williams,
Thiago Jung Bauermann, Lekshmi Pillai, Tejun Heo, NeilBrown,
Omar Sandoval, Jan Kara
In-Reply-To: <20170306163404.1238-1-jack@suse.cz>
Make wb->bdi a proper refcounted reference to bdi for all bdi_writeback
structures except for the one embedded inside struct backing_dev_info.
That will allow us to simplify bdi unregistration.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
mm/backing-dev.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index fec27da14aea..37cd1adae979 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -294,6 +294,8 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
memset(wb, 0, sizeof(*wb));
+ if (wb != &bdi->wb)
+ bdi_get(bdi);
wb->bdi = bdi;
wb->last_old_flush = jiffies;
INIT_LIST_HEAD(&wb->b_dirty);
@@ -314,8 +316,10 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
wb->dirty_sleep = jiffies;
wb->congested = wb_congested_get_create(bdi, blkcg_id, gfp);
- if (!wb->congested)
- return -ENOMEM;
+ if (!wb->congested) {
+ err = -ENOMEM;
+ goto out_put_bdi;
+ }
err = fprop_local_init_percpu(&wb->completions, gfp);
if (err)
@@ -335,6 +339,9 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
fprop_local_destroy_percpu(&wb->completions);
out_put_cong:
wb_congested_put(wb->congested);
+out_put_bdi:
+ if (wb != &bdi->wb)
+ bdi_put(bdi);
return err;
}
@@ -372,6 +379,8 @@ static void wb_exit(struct bdi_writeback *wb)
fprop_local_destroy_percpu(&wb->completions);
wb_congested_put(wb->congested);
+ if (wb != &wb->bdi->wb)
+ bdi_put(wb->bdi);
}
#ifdef CONFIG_CGROUP_WRITEBACK
--
2.10.2
^ permalink raw reply related
* [PATCH 03/11] bdi: Mark congested->bdi as internal
From: Jan Kara @ 2017-03-06 16:33 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Dan Williams,
Thiago Jung Bauermann, Lekshmi Pillai, Tejun Heo, NeilBrown,
Omar Sandoval, Jan Kara
In-Reply-To: <20170306163404.1238-1-jack@suse.cz>
congested->bdi pointer is used only to be able to remove congested
structure from bdi->cgwb_congested_tree on structure release. Moreover
the pointer can become NULL when we unregister the bdi. Rename the field
to __bdi and add a comment to make it more explicit this is internal
stuff of memcg writeback code and people should not use the field as
such use will be likely race prone.
We do not bother with converting congested->bdi to a proper refcounted
reference. It will be slightly ugly to special-case bdi->wb.congested to
avoid effectively a cyclic reference of bdi to itself and the reference
gets cleared from bdi_unregister() making it impossible to reference
a freed bdi.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
include/linux/backing-dev-defs.h | 4 +++-
mm/backing-dev.c | 10 +++++-----
2 files changed, 8 insertions(+), 6 deletions(-)
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index ad955817916d..8fb3dcdebc80 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -54,7 +54,9 @@ struct bdi_writeback_congested {
atomic_t refcnt; /* nr of attached wb's and blkg */
#ifdef CONFIG_CGROUP_WRITEBACK
- struct backing_dev_info *bdi; /* the associated bdi */
+ struct backing_dev_info *__bdi; /* the associated bdi, set to NULL
+ * on bdi unregistration. For memcg-wb
+ * internal use only! */
int blkcg_id; /* ID of the associated blkcg */
struct rb_node rb_node; /* on bdi->cgwb_congestion_tree */
#endif
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 6d861d090e9f..fec27da14aea 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -438,7 +438,7 @@ wb_congested_get_create(struct backing_dev_info *bdi, int blkcg_id, gfp_t gfp)
return NULL;
atomic_set(&new_congested->refcnt, 0);
- new_congested->bdi = bdi;
+ new_congested->__bdi = bdi;
new_congested->blkcg_id = blkcg_id;
goto retry;
@@ -466,10 +466,10 @@ void wb_congested_put(struct bdi_writeback_congested *congested)
}
/* bdi might already have been destroyed leaving @congested unlinked */
- if (congested->bdi) {
+ if (congested->__bdi) {
rb_erase(&congested->rb_node,
- &congested->bdi->cgwb_congested_tree);
- congested->bdi = NULL;
+ &congested->__bdi->cgwb_congested_tree);
+ congested->__bdi = NULL;
}
spin_unlock_irqrestore(&cgwb_lock, flags);
@@ -698,7 +698,7 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
rb_entry(rbn, struct bdi_writeback_congested, rb_node);
rb_erase(rbn, &bdi->cgwb_congested_tree);
- congested->bdi = NULL; /* mark @congested unlinked */
+ congested->__bdi = NULL; /* mark @congested unlinked */
}
spin_unlock_irq(&cgwb_lock);
--
2.10.2
^ permalink raw reply related
* [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown
From: Jan Kara @ 2017-03-06 16:33 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Dan Williams,
Thiago Jung Bauermann, Lekshmi Pillai, Tejun Heo, NeilBrown,
Omar Sandoval, Jan Kara
In-Reply-To: <20170306163404.1238-1-jack@suse.cz>
blkdev_open() may race with gendisk shutdown in such a way that
del_gendisk() has already unhashed block device inode (and thus
bd_acquire() will end up creating new block device inode) however
gen_gendisk() will still return the gendisk that is being destroyed.
This will result in the new bdev inode being associated with bdi of the
request queue that is going away and when the device number gets
eventually reused, the block device will still be pointing to the stale
bdi.
Fix the problem by checking whether the gendisk is still alive when
associating bdev inode with it and its bdi. That way we are sure that
once we are unhashing block device inodes in del_gendisk(), newly
created bdev inodes cannot be associated with bdi of the deleted gendisk
anymore. We actually already do this check when opening a partition so
we need to add it only for the case when the whole device is opened.
Also add a warning that will tell us about unexpected inconsistencies
between bdi associated with the bdev inode and bdi associated with the
disk.
Signed-off-by: Jan Kara <jack@suse.cz>
---
block/genhd.c | 7 ++++++-
fs/block_dev.c | 5 ++++-
2 files changed, 10 insertions(+), 2 deletions(-)
diff --git a/block/genhd.c b/block/genhd.c
index b26a5ea115d0..e8df37de03af 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -662,6 +662,12 @@ void del_gendisk(struct gendisk *disk)
struct disk_part_iter piter;
struct hd_struct *part;
+ disk->flags &= ~GENHD_FL_UP;
+ /*
+ * Make sure __blkdev_open() sees the disk is going away before
+ * starting to unhash bdev inodes.
+ */
+ smp_wmb();
blk_integrity_del(disk);
disk_del_events(disk);
@@ -678,7 +684,6 @@ void del_gendisk(struct gendisk *disk)
invalidate_partition(disk, 0);
bdev_unhash_inode(disk_devt(disk));
set_capacity(disk, 0);
- disk->flags &= ~GENHD_FL_UP;
sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi");
/*
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 53e2389ae4d4..9e1993a2827f 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1560,7 +1560,7 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
if (!partno) {
ret = -ENXIO;
bdev->bd_part = disk_get_part(disk, partno);
- if (!bdev->bd_part)
+ if (!(disk->flags & GENHD_FL_UP) || !bdev->bd_part)
goto out_clear;
ret = 0;
@@ -1623,6 +1623,9 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
if (bdev->bd_bdi == &noop_backing_dev_info)
bdev->bd_bdi = bdi_get(disk->queue->backing_dev_info);
+ else
+ WARN_ON_ONCE(bdev->bd_bdi !=
+ disk->queue->backing_dev_info);
} else {
if (bdev->bd_contains == bdev) {
ret = 0;
--
2.10.2
^ permalink raw reply related
* [PATCH 01/11] block: Fix bdi assignment to bdev inode when racing with disk delete
From: Jan Kara @ 2017-03-06 16:33 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Dan Williams,
Thiago Jung Bauermann, Lekshmi Pillai, Tejun Heo, NeilBrown,
Omar Sandoval, Jan Kara
In-Reply-To: <20170306163404.1238-1-jack@suse.cz>
When disk->fops->open() in __blkdev_get() returns -ERESTARTSYS, we
restart the process of opening the block device. However we forget to
switch bdev->bd_bdi back to noop_backing_dev_info and as a result bdev
inode will be pointing to a stale bdi. Fix the problem by setting
bdev->bd_bdi later when __blkdev_get() is already guaranteed to succeed.
Signed-off-by: Jan Kara <jack@suse.cz>
---
fs/block_dev.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 2eca00ec4370..53e2389ae4d4 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1556,8 +1556,6 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
bdev->bd_disk = disk;
bdev->bd_queue = disk->queue;
bdev->bd_contains = bdev;
- if (bdev->bd_bdi == &noop_backing_dev_info)
- bdev->bd_bdi = bdi_get(disk->queue->backing_dev_info);
if (!partno) {
ret = -ENXIO;
@@ -1622,6 +1620,9 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
}
bd_set_size(bdev, (loff_t)bdev->bd_part->nr_sects << 9);
}
+
+ if (bdev->bd_bdi == &noop_backing_dev_info)
+ bdev->bd_bdi = bdi_get(disk->queue->backing_dev_info);
} else {
if (bdev->bd_contains == bdev) {
ret = 0;
@@ -1653,8 +1654,6 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
bdev->bd_disk = NULL;
bdev->bd_part = NULL;
bdev->bd_queue = NULL;
- bdi_put(bdev->bd_bdi);
- bdev->bd_bdi = &noop_backing_dev_info;
if (bdev != bdev->bd_contains)
__blkdev_put(bdev->bd_contains, mode, 1);
bdev->bd_contains = NULL;
--
2.10.2
^ permalink raw reply related
* [PATCH 0/11 v3] block: Fix block device shutdown related races
From: Jan Kara @ 2017-03-06 16:33 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Dan Williams,
Thiago Jung Bauermann, Lekshmi Pillai, Tejun Heo, NeilBrown,
Omar Sandoval, Jan Kara
Hello,
this is a series with the remaining patches (on top of 3.11-rc1) to fix several
different races and issues I've found when testing device shutdown and reuse.
The first two patches fix possible (theoretical) problems when opening of a
block device races with shutdown of a gendisk structure. Patches 3-9 fix oops
that is triggered by __blkdev_put() calling inode_detach_wb() too early (the
problem reported by Thiago). Patches 10 and 11 fix oops due to a bug in gendisk
code where get_gendisk() can return already freed gendisk structure (again
triggered by Omar's stress test).
People, please have a look at patches. They are mostly simple however the
interactions are rather complex so I may have missed something. Also I'm
happy for any additional testing these patches can get - I've stressed them
with Omar's script, tested memcg writeback, tested static (not udev managed)
device inodes.
Changes since v2:
* Added Tejun's acks
* Rebased on top of 4.11-rc1
* Fixed two possible races between blkdev_open() and del_gendisk()
* Fixed possible race between concurrent shutdown of cgwb spotted by Tejun
Changes since v1:
* Added Acks and Tested-by tags for patches in areas that did not change
* Reworked inode_detach_wb() related fixes based on Tejun's feedback
Honza
^ permalink raw reply
* Re: [bdi_unregister] 165a5e22fa INFO: task swapper:1 blocked for more than 120 seconds.
From: Jan Kara @ 2017-03-06 16:13 UTC (permalink / raw)
To: James Bottomley
Cc: Jan Kara, Fengguang Wu, Jens Axboe, linux-block, linux-kernel,
LKP, Martin K. Petersen, linux-scsi
In-Reply-To: <1488815095.25848.7.camel@linux.vnet.ibm.com>
On Mon 06-03-17 07:44:55, James Bottomley wrote:
> On Mon, 2017-03-06 at 16:14 +0100, Jan Kara wrote:
> > On Mon 06-03-17 06:35:21, James Bottomley wrote:
> > > On Mon, 2017-03-06 at 13:01 +0100, Jan Kara wrote:
> > > > On Mon 06-03-17 11:27:33, Jan Kara wrote:
> > > > > Hi,
> > > > >
> > > > > On Sun 05-03-17 10:21:11, Wu Fengguang wrote:
> > > > > > FYI next-20170303 is good while mainline is bad with this
> > > > > > error.
> > > > > > The attached reproduce-* may help reproduce the issue.
> > > > >
> > > > > Thanks for report! So from the stacktrace we are in the path
> > > > > testing removal of a device immediately after it has been
> > > > > probed
> > > > > and for some reason bdi_unregister() hangs - likely waiting for
> > > > > cgroup-writeback references to drop. Given how early this
> > > > > happens
> > > > > my guess is we fail to initialize something but for now I don't
> > > > > see
> > > > > how my patch could make a difference. I'm trying to reproduce
> > > > > this
> > > > > to be able to debug more...
> > > >
> > > > OK, so after some debugging I think this is yet another problem
> > > > in
> > > > SCSI initialization / destruction code which my patch only makes
> > > > visible (added relevant maintainers).
> > > >
> > > > I can reproduce the problem reliably with enabling:
> > > >
> > > > CONFIG_DEBUG_TEST_DRIVER_REMOVE=y
> > > > CONFIG_SCSI_DEBUG=m
> > > > CONFIG_BLK_CGROUP=y
> > > > CONFIG_MEMCG=y
> > > > (and thus CONFIG_CGROUP_WRITEBACK=y)
> > > >
> > > > then 'modprobe scsi_debug' is all it takes to reproduce hang.
> > > > Relevant kernel messages with some of my debugging added
> > > > (attached is
> > > > a patch that adds those debug messages):
> > >
> > > This looks to be precisely the same problem Dan Williams was
> > > debugging
> > > for us.
> > >
> > > > [ 58.721765] scsi host0: scsi_debug: version 1.86 [20160430]
> > > > [ 58.721765] dev_size_mb=8, opts=0x0, submit_queues=1,
> > > > statistics=0
> > > > [ 58.728946] CGWB init ffff88007fbb2000
> > > > [ 58.730095] Created sdev ffff880078e1a000
> > > > [ 58.731611] scsi 0:0:0:0: Direct-Access Linux
> > > > scsi_debug
> > > > 0186 PQ : 0 ANSI: 7
> > > > [ 58.782246] sd 0:0:0:0: [sda] 16384 512-byte logical blocks:
> > > > (8.39
> > > > MB/8.00 MiB)
> > > > [ 58.789687] sd 0:0:0:0: [sda] Write Protect is off
> > > > [ 58.791140] sd 0:0:0:0: [sda] Mode Sense: 73 00 10 08
> > > > [ 58.800879] sd 0:0:0:0: [sda] Write cache: enabled, read
> > > > cache:
> > > > enabled, supports DPO and FUA
> > > > [ 58.893738] sd 0:0:0:0: [sda] Attached SCSI disk
> > > > [ 58.896808] Unreg1
> > > > [ 58.897960] Unreg2
> > > > [ 58.898637] Unreg3
> > > > [ 58.899100] CGWB ffff88007fbb2000 usage_cnt: 0
> > > > [ 58.900004] Unreg4
> > > > [ 58.904976] sd 0:0:0:0: [sda] Synchronizing SCSI cache
> > >
> > > OK, can you put a WARN_ON trace in sd_shutdown and tell us where
> > > this
> > > is coming from. For the device to be reused after this we have to
> > > be
> > > calling sd_shutdown() without going into SDEV_DEL.
> >
> > Sure. The call trace is:
> >
> > [ 41.919244] ------------[ cut here ]------------
> > [ 41.919263] WARNING: CPU: 4 PID: 2335 at drivers/scsi/sd.c:3332
> > sd_shutdown+0x2f/0x100
> > [ 41.919268] Modules linked in: scsi_debug(+) netconsole loop btrfs
> > raid6_pq zlib_deflate lzo_compress xor
> > [ 41.919319] CPU: 4 PID: 2335 Comm: modprobe Not tainted 4.11.0-rc1
> > -xen+ #49
> > [ 41.919325] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> > [ 41.919331] Call Trace:
> > [ 41.919343] dump_stack+0x8e/0xf0
> > [ 41.919354] __warn+0x116/0x120
> > [ 41.919361] warn_slowpath_null+0x1d/0x20
> > [ 41.919368] sd_shutdown+0x2f/0x100
> > [ 41.919374] sd_remove+0x70/0xd0
> >
> > *** Here is the unexpected step I guess...
> >
> > [ 41.919383] driver_probe_device+0xe0/0x4c0
>
> Exactly. It's this, I think
>
> bool test_remove = IS_ENABLED(CONFIG_DEBUG_TEST_DRIVER_REMOVE) &&
> !drv->suppress_bind_attrs;
>
> You have that config option set.
Yes - or better said 0-day testing has it set. Maybe that is not a sane
default for 0-day tests? The option is explicitely marked as "unstable"...
Fengguang?
> So the drivers base layer is calling ->remove after probe and
> triggering the destruction of the queue.
>
> What to do about this (apart from nuke such a stupid option) is
> somewhat more problematic.
I guess this is between you and Greg :).
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply
* Re: [PATCH] block/sed: Fix opal user range check and unused variables
From: Scott Bauer @ 2017-03-06 16:13 UTC (permalink / raw)
To: Jon Derrick
Cc: linux-block, Rafael Antognolli, linux-nvme, Jens Axboe,
David Binderman, Keith Busch, Christoph Hellwig
In-Reply-To: <1488814864-28478-1-git-send-email-jonathan.derrick@intel.com>
On Mon, Mar 06, 2017 at 08:41:04AM -0700, Jon Derrick wrote:
> Fixes check that the opal user is within the range, and cleans up unused
> method variables.
>
> Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
Reviewed-by: Scott Bauer <scott.bauer@intel.com>
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply
* Re: [PATCH 0/8 v2] Non-blocking AIO
From: Jens Axboe @ 2017-03-06 16:08 UTC (permalink / raw)
To: Avi Kivity, Jan Kara
Cc: Goldwyn Rodrigues, jack, hch, linux-fsdevel, linux-block,
linux-btrfs, linux-ext4, linux-xfs
In-Reply-To: <c8c3cede-d69a-a47a-0b08-85072b386d77@scylladb.com>
On 03/06/2017 08:59 AM, Avi Kivity wrote:
> On 03/06/2017 05:38 PM, Jens Axboe wrote:
>> On 03/06/2017 08:29 AM, Avi Kivity wrote:
>>>
>>> On 03/06/2017 05:19 PM, Jens Axboe wrote:
>>>> On 03/06/2017 01:25 AM, Jan Kara wrote:
>>>>> On Sun 05-03-17 16:56:21, Avi Kivity wrote:
>>>>>>> The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
>>>>>>> any of these conditions are met. This way userspace can push most
>>>>>>> of the write()s to the kernel to the best of its ability to complete
>>>>>>> and if it returns -EAGAIN, can defer it to another thread.
>>>>>>>
>>>>>> Is it not possible to push the iocb to a workqueue? This will allow
>>>>>> existing userspace to work with the new functionality, unchanged. Any
>>>>>> userspace implementation would have to do the same thing, so it's not like
>>>>>> we're saving anything by pushing it there.
>>>>> That is not easy because until IO is fully submitted, you need some parts
>>>>> of the context of the process which submits the IO (e.g. memory mappings,
>>>>> but possibly also other credentials). So you would need to somehow transfer
>>>>> this information to the workqueue.
>>>> Outside of technical challenges, the API also needs to return EAGAIN or
>>>> start blocking at some point. We can't expose a direct connection to
>>>> queue work like that, and let any user potentially create millions of
>>>> pending work items (and IOs).
>>> You wouldn't expect more concurrent events than the maxevents parameter
>>> that was supplied to io_setup syscall; it should have reserved any
>>> resources needed.
>> Doesn't matter what limit you apply, my point still stands - at some
>> point you have to return EAGAIN, or block. Returning EAGAIN without
>> the caller having flagged support for that change of behavior would
>> be problematic.
>
> Doesn't it already return EAGAIN (or some other error) if you exceed
> maxevents?
It's a setup thing. We check these limits when someone creates an IO
context, and carve out the specified entries form our global pool. Then
we free those "resources" when the io context is freed.
Right now I can setup an IO context with 1000 entries on it, yet that
number has NO bearing on when io_submit() would potentially block or
return EAGAIN.
We can have a huge gap on the intent signaled by io context setup, and
the reality imposed by what actually happens on the IO submission side.
--
Jens Axboe
^ permalink raw reply
* Re: [PATCH 0/8 v2] Non-blocking AIO
From: Avi Kivity @ 2017-03-06 15:59 UTC (permalink / raw)
To: Jens Axboe, Jan Kara
Cc: Goldwyn Rodrigues, jack, hch, linux-fsdevel, linux-block,
linux-btrfs, linux-ext4, linux-xfs
In-Reply-To: <57c873b2-fed6-e717-fc4e-ed2e328173b6@kernel.dk>
On 03/06/2017 05:38 PM, Jens Axboe wrote:
> On 03/06/2017 08:29 AM, Avi Kivity wrote:
>>
>> On 03/06/2017 05:19 PM, Jens Axboe wrote:
>>> On 03/06/2017 01:25 AM, Jan Kara wrote:
>>>> On Sun 05-03-17 16:56:21, Avi Kivity wrote:
>>>>>> The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
>>>>>> any of these conditions are met. This way userspace can push most
>>>>>> of the write()s to the kernel to the best of its ability to complete
>>>>>> and if it returns -EAGAIN, can defer it to another thread.
>>>>>>
>>>>> Is it not possible to push the iocb to a workqueue? This will allow
>>>>> existing userspace to work with the new functionality, unchanged. Any
>>>>> userspace implementation would have to do the same thing, so it's not like
>>>>> we're saving anything by pushing it there.
>>>> That is not easy because until IO is fully submitted, you need some parts
>>>> of the context of the process which submits the IO (e.g. memory mappings,
>>>> but possibly also other credentials). So you would need to somehow transfer
>>>> this information to the workqueue.
>>> Outside of technical challenges, the API also needs to return EAGAIN or
>>> start blocking at some point. We can't expose a direct connection to
>>> queue work like that, and let any user potentially create millions of
>>> pending work items (and IOs).
>> You wouldn't expect more concurrent events than the maxevents parameter
>> that was supplied to io_setup syscall; it should have reserved any
>> resources needed.
> Doesn't matter what limit you apply, my point still stands - at some
> point you have to return EAGAIN, or block. Returning EAGAIN without
> the caller having flagged support for that change of behavior would
> be problematic.
Doesn't it already return EAGAIN (or some other error) if you exceed
maxevents?
> And for this to really work, aio would need some serious help in
> how it applies limits. It looks like a hot mess.
For sure. I think it would be a shame to create more user-facing
complexity.
^ permalink raw reply
* Re: [bdi_unregister] 165a5e22fa INFO: task swapper:1 blocked for more than 120 seconds.
From: James Bottomley @ 2017-03-06 15:44 UTC (permalink / raw)
To: Jan Kara
Cc: Fengguang Wu, Jens Axboe, linux-block, linux-kernel, LKP,
Martin K. Petersen, linux-scsi
In-Reply-To: <20170306151415.GA32207@quack2.suse.cz>
On Mon, 2017-03-06 at 16:14 +0100, Jan Kara wrote:
> On Mon 06-03-17 06:35:21, James Bottomley wrote:
> > On Mon, 2017-03-06 at 13:01 +0100, Jan Kara wrote:
> > > On Mon 06-03-17 11:27:33, Jan Kara wrote:
> > > > Hi,
> > > >
> > > > On Sun 05-03-17 10:21:11, Wu Fengguang wrote:
> > > > > FYI next-20170303 is good while mainline is bad with this
> > > > > error.
> > > > > The attached reproduce-* may help reproduce the issue.
> > > >
> > > > Thanks for report! So from the stacktrace we are in the path
> > > > testing removal of a device immediately after it has been
> > > > probed
> > > > and for some reason bdi_unregister() hangs - likely waiting for
> > > > cgroup-writeback references to drop. Given how early this
> > > > happens
> > > > my guess is we fail to initialize something but for now I don't
> > > > see
> > > > how my patch could make a difference. I'm trying to reproduce
> > > > this
> > > > to be able to debug more...
> > >
> > > OK, so after some debugging I think this is yet another problem
> > > in
> > > SCSI initialization / destruction code which my patch only makes
> > > visible (added relevant maintainers).
> > >
> > > I can reproduce the problem reliably with enabling:
> > >
> > > CONFIG_DEBUG_TEST_DRIVER_REMOVE=y
> > > CONFIG_SCSI_DEBUG=m
> > > CONFIG_BLK_CGROUP=y
> > > CONFIG_MEMCG=y
> > > (and thus CONFIG_CGROUP_WRITEBACK=y)
> > >
> > > then 'modprobe scsi_debug' is all it takes to reproduce hang.
> > > Relevant kernel messages with some of my debugging added
> > > (attached is
> > > a patch that adds those debug messages):
> >
> > This looks to be precisely the same problem Dan Williams was
> > debugging
> > for us.
> >
> > > [ 58.721765] scsi host0: scsi_debug: version 1.86 [20160430]
> > > [ 58.721765] dev_size_mb=8, opts=0x0, submit_queues=1,
> > > statistics=0
> > > [ 58.728946] CGWB init ffff88007fbb2000
> > > [ 58.730095] Created sdev ffff880078e1a000
> > > [ 58.731611] scsi 0:0:0:0: Direct-Access Linux
> > > scsi_debug
> > > 0186 PQ : 0 ANSI: 7
> > > [ 58.782246] sd 0:0:0:0: [sda] 16384 512-byte logical blocks:
> > > (8.39
> > > MB/8.00 MiB)
> > > [ 58.789687] sd 0:0:0:0: [sda] Write Protect is off
> > > [ 58.791140] sd 0:0:0:0: [sda] Mode Sense: 73 00 10 08
> > > [ 58.800879] sd 0:0:0:0: [sda] Write cache: enabled, read
> > > cache:
> > > enabled, supports DPO and FUA
> > > [ 58.893738] sd 0:0:0:0: [sda] Attached SCSI disk
> > > [ 58.896808] Unreg1
> > > [ 58.897960] Unreg2
> > > [ 58.898637] Unreg3
> > > [ 58.899100] CGWB ffff88007fbb2000 usage_cnt: 0
> > > [ 58.900004] Unreg4
> > > [ 58.904976] sd 0:0:0:0: [sda] Synchronizing SCSI cache
> >
> > OK, can you put a WARN_ON trace in sd_shutdown and tell us where
> > this
> > is coming from. For the device to be reused after this we have to
> > be
> > calling sd_shutdown() without going into SDEV_DEL.
>
> Sure. The call trace is:
>
> [ 41.919244] ------------[ cut here ]------------
> [ 41.919263] WARNING: CPU: 4 PID: 2335 at drivers/scsi/sd.c:3332
> sd_shutdown+0x2f/0x100
> [ 41.919268] Modules linked in: scsi_debug(+) netconsole loop btrfs
> raid6_pq zlib_deflate lzo_compress xor
> [ 41.919319] CPU: 4 PID: 2335 Comm: modprobe Not tainted 4.11.0-rc1
> -xen+ #49
> [ 41.919325] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [ 41.919331] Call Trace:
> [ 41.919343] dump_stack+0x8e/0xf0
> [ 41.919354] __warn+0x116/0x120
> [ 41.919361] warn_slowpath_null+0x1d/0x20
> [ 41.919368] sd_shutdown+0x2f/0x100
> [ 41.919374] sd_remove+0x70/0xd0
>
> *** Here is the unexpected step I guess...
>
> [ 41.919383] driver_probe_device+0xe0/0x4c0
Exactly. It's this, I think
bool test_remove = IS_ENABLED(CONFIG_DEBUG_TEST_DRIVER_REMOVE) &&
!drv->suppress_bind_attrs;
You have that config option set.
So the drivers base layer is calling ->remove after probe and
triggering the destruction of the queue.
What to do about this (apart from nuke such a stupid option) is
somewhat more problematic.
James
^ permalink raw reply
* [PATCH] block/sed: Fix opal user range check and unused variables
From: Jon Derrick @ 2017-03-06 15:41 UTC (permalink / raw)
To: Jens Axboe
Cc: Jon Derrick, linux-block, linux-nvme, David Binderman,
Scott Bauer, Rafael Antognolli, Christoph Hellwig, Keith Busch
Fixes check that the opal user is within the range, and cleans up unused
method variables.
Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
---
block/sed-opal.c | 10 ++--------
1 file changed, 2 insertions(+), 8 deletions(-)
diff --git a/block/sed-opal.c b/block/sed-opal.c
index 1e18dca..14035f8 100644
--- a/block/sed-opal.c
+++ b/block/sed-opal.c
@@ -1023,7 +1023,6 @@ static int finalize_and_send(struct opal_dev *dev, cont_fn cont)
static int gen_key(struct opal_dev *dev, void *data)
{
- const u8 *method;
u8 uid[OPAL_UID_LENGTH];
int err = 0;
@@ -1031,7 +1030,6 @@ static int gen_key(struct opal_dev *dev, void *data)
set_comid(dev, dev->comid);
memcpy(uid, dev->prev_data, min(sizeof(uid), dev->prev_d_len));
- method = opalmethod[OPAL_GENKEY];
kfree(dev->prev_data);
dev->prev_data = NULL;
@@ -1669,7 +1667,6 @@ static int add_user_to_lr(struct opal_dev *dev, void *data)
static int lock_unlock_locking_range(struct opal_dev *dev, void *data)
{
u8 lr_buffer[OPAL_UID_LENGTH];
- const u8 *method;
struct opal_lock_unlock *lkul = data;
u8 read_locked = 1, write_locked = 1;
int err = 0;
@@ -1677,7 +1674,6 @@ static int lock_unlock_locking_range(struct opal_dev *dev, void *data)
clear_opal_cmd(dev);
set_comid(dev, dev->comid);
- method = opalmethod[OPAL_SET];
if (build_locking_range(lr_buffer, sizeof(lr_buffer),
lkul->session.opal_key.lr) < 0)
return -ERANGE;
@@ -1733,14 +1729,12 @@ static int lock_unlock_locking_range_sum(struct opal_dev *dev, void *data)
{
u8 lr_buffer[OPAL_UID_LENGTH];
u8 read_locked = 1, write_locked = 1;
- const u8 *method;
struct opal_lock_unlock *lkul = data;
int ret;
clear_opal_cmd(dev);
set_comid(dev, dev->comid);
- method = opalmethod[OPAL_SET];
if (build_locking_range(lr_buffer, sizeof(lr_buffer),
lkul->session.opal_key.lr) < 0)
return -ERANGE;
@@ -2133,7 +2127,7 @@ static int opal_add_user_to_lr(struct opal_dev *dev,
pr_err("Locking state was not RO or RW\n");
return -EINVAL;
}
- if (lk_unlk->session.who < OPAL_USER1 &&
+ if (lk_unlk->session.who < OPAL_USER1 ||
lk_unlk->session.who > OPAL_USER9) {
pr_err("Authority was not within the range of users: %d\n",
lk_unlk->session.who);
@@ -2316,7 +2310,7 @@ static int opal_activate_user(struct opal_dev *dev,
int ret;
/* We can't activate Admin1 it's active as manufactured */
- if (opal_session->who < OPAL_USER1 &&
+ if (opal_session->who < OPAL_USER1 ||
opal_session->who > OPAL_USER9) {
pr_err("Who was not a valid user: %d\n", opal_session->who);
return -EINVAL;
--
1.8.3.1
^ permalink raw reply related
* Re: [PATCH 0/8 v2] Non-blocking AIO
From: Jens Axboe @ 2017-03-06 15:38 UTC (permalink / raw)
To: Avi Kivity, Jan Kara
Cc: Goldwyn Rodrigues, jack, hch, linux-fsdevel, linux-block,
linux-btrfs, linux-ext4, linux-xfs
In-Reply-To: <9b64c78e-c984-cf29-8f79-c48332a4c450@scylladb.com>
On 03/06/2017 08:29 AM, Avi Kivity wrote:
>
>
> On 03/06/2017 05:19 PM, Jens Axboe wrote:
>> On 03/06/2017 01:25 AM, Jan Kara wrote:
>>> On Sun 05-03-17 16:56:21, Avi Kivity wrote:
>>>>> The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
>>>>> any of these conditions are met. This way userspace can push most
>>>>> of the write()s to the kernel to the best of its ability to complete
>>>>> and if it returns -EAGAIN, can defer it to another thread.
>>>>>
>>>> Is it not possible to push the iocb to a workqueue? This will allow
>>>> existing userspace to work with the new functionality, unchanged. Any
>>>> userspace implementation would have to do the same thing, so it's not like
>>>> we're saving anything by pushing it there.
>>> That is not easy because until IO is fully submitted, you need some parts
>>> of the context of the process which submits the IO (e.g. memory mappings,
>>> but possibly also other credentials). So you would need to somehow transfer
>>> this information to the workqueue.
>> Outside of technical challenges, the API also needs to return EAGAIN or
>> start blocking at some point. We can't expose a direct connection to
>> queue work like that, and let any user potentially create millions of
>> pending work items (and IOs).
>
> You wouldn't expect more concurrent events than the maxevents parameter
> that was supplied to io_setup syscall; it should have reserved any
> resources needed.
Doesn't matter what limit you apply, my point still stands - at some
point you have to return EAGAIN, or block. Returning EAGAIN without
the caller having flagged support for that change of behavior would
be problematic.
And for this to really work, aio would need some serious help in
how it applies limits. It looks like a hot mess.
--
Jens Axboe
^ permalink raw reply
* Re: [PATCH 0/8 v2] Non-blocking AIO
From: Avi Kivity @ 2017-03-06 15:29 UTC (permalink / raw)
To: Jens Axboe, Jan Kara
Cc: Goldwyn Rodrigues, jack, hch, linux-fsdevel, linux-block,
linux-btrfs, linux-ext4, linux-xfs
In-Reply-To: <d0789835-5bae-6f99-af4d-c0e60eadbab7@kernel.dk>
On 03/06/2017 05:19 PM, Jens Axboe wrote:
> On 03/06/2017 01:25 AM, Jan Kara wrote:
>> On Sun 05-03-17 16:56:21, Avi Kivity wrote:
>>>> The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
>>>> any of these conditions are met. This way userspace can push most
>>>> of the write()s to the kernel to the best of its ability to complete
>>>> and if it returns -EAGAIN, can defer it to another thread.
>>>>
>>> Is it not possible to push the iocb to a workqueue? This will allow
>>> existing userspace to work with the new functionality, unchanged. Any
>>> userspace implementation would have to do the same thing, so it's not like
>>> we're saving anything by pushing it there.
>> That is not easy because until IO is fully submitted, you need some parts
>> of the context of the process which submits the IO (e.g. memory mappings,
>> but possibly also other credentials). So you would need to somehow transfer
>> this information to the workqueue.
> Outside of technical challenges, the API also needs to return EAGAIN or
> start blocking at some point. We can't expose a direct connection to
> queue work like that, and let any user potentially create millions of
> pending work items (and IOs).
You wouldn't expect more concurrent events than the maxevents parameter
that was supplied to io_setup syscall; it should have reserved any
resources needed.
> That's why the current API is safe, even
> though it does suck that it block seemingly randomly for users.
^ permalink raw reply
* Re: [PATCH] zram: set physical queue limits to avoid array out of bounds accesses
From: Jens Axboe @ 2017-03-06 15:21 UTC (permalink / raw)
To: Johannes Thumshirn, Minchan Kim, Nitin Gupta
Cc: Christoph Hellwig, Sergey Senozhatsky, Hannes Reinecke, yizhan,
Linux Block Layer Mailinglist, Linux Kernel Mailinglist
In-Reply-To: <20170306102335.9180-1-jthumshirn@suse.de>
On 03/06/2017 03:23 AM, Johannes Thumshirn wrote:
> zram can handle at most SECTORS_PER_PAGE sectors in a bio's bvec. When using
> the NVMe over Fabrics loopback target which potentially sends a huge bulk of
> pages attached to the bio's bvec this results in a kernel panic because of
> array out of bounds accesses in zram_decompress_page().
Applied, thanks.
--
Jens Axboe
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox