* [PATCH 3/4] bfq: Remove superfluous conversion from RQ_BIC()
From: Jan Kara @ 2022-05-19 10:52 UTC (permalink / raw)
To: Paolo Valente; +Cc: linux-block, Jens Axboe, Jan Kara
In-Reply-To: <20220519104621.15456-1-jack@suse.cz>
We store struct bfq_io_cq pointer in rq->elv.priv[0] in bfq_init_rq().
Thus a call to icq_to_bic() in RQ_BIC() is wrong. Luckily it does no
harm currently because struct io_iq is the first one in struct
bfq_io_cq.
Signed-off-by: Jan Kara <jack@suse.cz>
---
block/bfq-iosched.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 1cc242c90fb6..904fe56495ce 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -374,7 +374,7 @@ static const unsigned long bfq_activation_stable_merging = 600;
*/
static const unsigned long bfq_late_stable_merging = 600;
-#define RQ_BIC(rq) icq_to_bic((rq)->elv.priv[0])
+#define RQ_BIC(rq) ((struct bfq_io_cq *)((rq)->elv.priv[0]))
#define RQ_BFQQ(rq) ((rq)->elv.priv[1])
struct bfq_queue *bic_to_bfqq(struct bfq_io_cq *bic, bool is_sync)
--
2.35.3
^ permalink raw reply related
* [PATCH 0/4] bfq: Improve waker detection
From: Jan Kara @ 2022-05-19 10:52 UTC (permalink / raw)
To: Paolo Valente; +Cc: linux-block, Jens Axboe, Jan Kara
Hello,
this patch series restores regression in dbench for large number of processes
that was introduced by c65e6fd460b4 ("bfq: Do not let waker requests skip
proper accounting"). The detailed explanation is in the first patch but the
culprit in the end was that with large number of dbench clients it often
happened that flush worker bfqq replaced jbd2 bfqq as a waker of the bfqq
shared by dbench clients and that resulted in lot of idling and reduced
throughput.
The first two patches in this series improve the waker detection, the other
two are just cleanups I've spotted when working on the code.
I've tested the patches and they don't seem to cause regression for other
workloads.
Honza
^ permalink raw reply
* Re: [PATCH -next v3 1/2] blk-throttle: fix that io throttle can only work for single bio
From: Ming Lei @ 2022-05-19 10:42 UTC (permalink / raw)
To: Yu Kuai; +Cc: tj, axboe, geert, cgroups, linux-block, linux-kernel, yi.zhang
In-Reply-To: <20220519085811.879097-2-yukuai3@huawei.com>
On Thu, May 19, 2022 at 04:58:10PM +0800, Yu Kuai wrote:
> commit 9f5ede3c01f9 ("block: throttle split bio in case of iops limit")
> introduce a new problem, for example:
>
> [root@localhost ~]# echo "8:0 1024" > /sys/fs/cgroup/blkio/blkio.throttle.write_bps_device
> [root@localhost ~]# echo $$ > /sys/fs/cgroup/blkio/cgroup.procs
> [root@localhost ~]# dd if=/dev/zero of=/dev/sda bs=10k count=1 oflag=direct &
> [1] 620
> [root@localhost ~]# dd if=/dev/zero of=/dev/sda bs=10k count=1 oflag=direct &
> [2] 626
> [root@localhost ~]# 1+0 records in
> 1+0 records out
> 10240 bytes (10 kB, 10 KiB) copied, 10.0038 s, 1.0 kB/s1+0 records in
> 1+0 records out
>
> 10240 bytes (10 kB, 10 KiB) copied, 9.23076 s, 1.1 kB/s
> -> the second bio is issued after 10s instead of 20s.
>
> This is because if some bios are already queued, current bio is queued
> directly and the flag 'BIO_THROTTLED' is set. And later, when former
> bios are dispatched, this bio will be dispatched without waiting at all,
> this is due to tg_with_in_bps_limit() return 0 for this bio.
>
> In order to fix the problem, don't skip flaged bio in
> tg_with_in_bps_limit(), and for the problem that split bio can be
> double accounted, compensate the over-accounting in __blk_throtl_bio().
>
> Fixes: 9f5ede3c01f9 ("block: throttle split bio in case of iops limit")
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Thanks,
Ming
^ permalink raw reply
* Re: [PATCH RFC v6 00/21] DEPT(Dependency Tracker)
From: Catalin Marinas @ 2022-05-19 10:11 UTC (permalink / raw)
To: Hyeonggon Yoo
Cc: Byungchul Park, torvalds, damien.lemoal, linux-ide,
adilger.kernel, linux-ext4, mingo, linux-kernel, peterz, will,
tglx, rostedt, joel, sashal, daniel.vetter, chris, duyuyang,
johannes.berg, tj, tytso, willy, david, amir73il, gregkh,
kernel-team, linux-mm, akpm, mhocko, minchan, hannes,
vdavydov.dev, sj, jglisse, dennis, cl, penberg, rientjes, vbabka,
ngupta, linux-block, paolo.valente, josef, linux-fsdevel, viro,
jack, jack, jlayton, dan.j.williams, hch, djwong, dri-devel,
airlied, rodrigosiqueiramelo, melissa.srw, hamohammed.sa
In-Reply-To: <YnuKQ9UIhk9WYoz7@hyeyoo>
On Wed, May 11, 2022 at 07:04:51PM +0900, Hyeonggon Yoo wrote:
> On Wed, May 11, 2022 at 08:39:29AM +0900, Byungchul Park wrote:
> > On Tue, May 10, 2022 at 08:18:12PM +0900, Hyeonggon Yoo wrote:
> > > On Mon, May 09, 2022 at 09:16:37AM +0900, Byungchul Park wrote:
> > > > CASE 1.
> > > >
> > > > lock L with depth n
> > > > lock_nested L' with depth n + 1
> > > > ...
> > > > unlock L'
> > > > unlock L
> > > >
> > > > This case is allowed by Lockdep.
> > > > This case is allowed by DEPT cuz it's not a deadlock.
> > > >
> > > > CASE 2.
> > > >
> > > > lock L with depth n
> > > > lock A
> > > > lock_nested L' with depth n + 1
> > > > ...
> > > > unlock L'
> > > > unlock A
> > > > unlock L
> > > >
> > > > This case is allowed by Lockdep.
> > > > This case is *NOT* allowed by DEPT cuz it's a *DEADLOCK*.
> > >
> > > Yeah, in previous threads we discussed this [1]
> > >
> > > And the case was:
> > > scan_mutex -> object_lock -> kmemleak_lock -> object_lock
> > > And dept reported:
> > > object_lock -> kmemleak_lock, kmemleak_lock -> object_lock as
> > > deadlock.
> > >
> > > But IIUC - What DEPT reported happens only under scan_mutex and it
> > > is not simple just not to take them because the object can be
> > > removed from the list and freed while scanning via kmemleak_free()
> > > without kmemleak_lock and object_lock.
The above kmemleak sequence shouldn't deadlock since those locks, even
if taken in a different order, are serialised by scan_mutex. For various
reasons, trying to reduce the latency, I ended up with some
fine-grained, per-object locking.
For object allocation (rbtree modification) and tree search, we use
kmemleak_lock. During scanning (which can take minutes under
scan_mutex), we want to prevent (a) long latencies and (b) freeing the
object being scanned. We release the locks regularly for (a) and hold
the object->lock for (b).
In another thread Byungchul mentioned:
| context X context Y
|
| lock mutex A lock mutex A
| lock B lock C
| lock C lock B
| unlock C unlock B
| unlock B unlock C
| unlock mutex A unlock mutex A
|
| In my opinion, lock B and lock C are unnecessary if they are always
| along with lock mutex A. Or we should keep correct lock order across all
| the code.
If these are the only two places, yes, locks B and C would be
unnecessary. But we have those locks acquired (not nested) on the
allocation path (kmemleak_lock) and freeing path (object->lock). We
don't want to block those paths while scan_mutex is held.
That said, we may be able to use a single kmemleak_lock for everything.
The object freeing path may be affected slightly during scanning but the
code does release it every MAX_SCAN_SIZE bytes. It may even get slightly
faster as we'd hammer a single lock (I'll do some benchmarks).
But from a correctness perspective, I think the DEPT tool should be
improved a bit to detect when such out of order locking is serialised by
an enclosing lock/mutex.
--
Catalin
^ permalink raw reply
* Re: [PATCH -next v3 2/2] blk-throttle: fix io hung due to configuration updates
From: Michal Koutný @ 2022-05-19 9:58 UTC (permalink / raw)
To: Yu Kuai
Cc: tj, axboe, ming.lei, geert, cgroups, linux-block, linux-kernel,
yi.zhang
In-Reply-To: <20220519085811.879097-3-yukuai3@huawei.com>
Hello Kuayi.
On Thu, May 19, 2022 at 04:58:11PM +0800, Yu Kuai <yukuai3@huawei.com> wrote:
> If new configuration is submitted while a bio is throttled, then new
> waiting time is recaculated regardless that the bio might aready wait
> for some time:
>
> tg_conf_updated
> throtl_start_new_slice
> tg_update_disptime
> throtl_schedule_next_dispatch
>
> Then io hung can be triggered by always submmiting new configuration
> before the throttled bio is dispatched.
O.K.
> - /*
> - * We're already holding queue_lock and know @tg is valid. Let's
> - * apply the new config directly.
> - *
> - * Restart the slices for both READ and WRITES. It might happen
> - * that a group's limit are dropped suddenly and we don't want to
> - * account recently dispatched IO with new low rate.
> - */
> - throtl_start_new_slice(tg, READ);
> - throtl_start_new_slice(tg, WRITE);
> + throtl_update_slice(tg, old_limits);
throtl_start_new_slice zeroes *_disp fields.
If for instance, new config allowed only 0.5 throughput, the *_disp
fields would be scaled to 0.5.
How that change helps (better) the previously throttled bio to be dispatched?
(Is it because you omit update of slice_{start,end}?)
Thanks,
Michal
^ permalink raw reply
* Re: [PATCH V2 0/1] ubd: add io_uring based userspace block driver
From: Stefan Hajnoczi @ 2022-05-19 9:46 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, linux-block, linux-kernel, Harris James R, io-uring,
Gabriel Krisman Bertazi, ZiyangZhang, Xiaoguang Wang
In-Reply-To: <YoWujjFArHaXuqYS@T590>
[-- Attachment #1: Type: text/plain, Size: 3053 bytes --]
On Thu, May 19, 2022 at 10:42:22AM +0800, Ming Lei wrote:
> On Wed, May 18, 2022 at 04:49:03PM +0100, Stefan Hajnoczi wrote:
> > On Wed, May 18, 2022 at 08:53:54PM +0800, Ming Lei wrote:
> > > On Wed, May 18, 2022 at 11:45:32AM +0100, Stefan Hajnoczi wrote:
> > > > On Wed, May 18, 2022 at 03:09:46PM +0800, Ming Lei wrote:
> > > > > On Tue, May 17, 2022 at 03:06:34PM +0100, Stefan Hajnoczi wrote:
> > > > > > Here are some more thoughts on the ubd-control device:
> > > > > >
> > > > > > The current patch provides a ubd-control device for processes with
> > > > > > suitable permissions (i.e. root) to create, start, stop, and fetch
> > > > > > information about devices.
> > > > > >
> > > > > > There is no isolation between devices created by one process and those
> > > > >
> > > > > I understand linux hasn't device namespace yet, so can you share the
> > > > > rational behind the idea of device isolation, is it because ubd device
> > > > > is served by ubd daemon which belongs to one pid NS? Or the user creating
> > > > > /dev/ubdbN belongs to one user NS?
> > > >
> > > > With the current model a process with access to ubd-control has control
> > > > over all ubd devices. This is not desirable for most container use cases
> > > > because ubd-control usage within a container means that container could
> > > > stop any ubd device on the system.
> > > >
> > > > Even for non-container use cases it's problematic that two applications
> > > > that use ubd can interfere with each other. If an application passes the
> > > > wrong device ID they can stop the other application's device, for
> > > > example.
> > > >
> > > > I think it's worth supporting a model where there are multiple ubd
> > > > daemons that are not cooperating/aware of each other. They should be
> > > > isolated from each other.
> > >
> > > Maybe I didn't mention it clearly, I meant the following model in last email:
> > >
> > > 1) every user can send UBD_CMD_ADD_DEV to /dev/ubd-control
> > >
> > > 2) the created /dev/ubdcN & /dev/udcbN are owned by the user who creates
> > > it
> >
> > How does this work? Does userspace (udev) somehow get the uid/gid from
> > the uevent so it can set the device node permissions?
>
> We can let 'ubd list' export the owner info, then udev may override the default
> owner with exported info.
>
> Or it can be done inside devtmpfs_create_node() by passing ubd's uid/gid
> at default.
>
> For /dev/ubdcN, I think it is safe, since the driver is only
> communicating with the userspace daemon, and both belong to same owner.
> Also ubd driver is simple enough to get full audited.
>
> For /dev/ubdbN, even though FS isn't allowed to mount, there is still
> lots of kernel code path involved, and some code path may not be run
> with unprivileged user before, that needs careful audit.
>
> So the biggest problem is if it is safe to export block disk to unprivileged
> user, and that is the one which can't be bypassed for any approach.
Okay.
Stefan
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply
* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
From: Amir Goldstein @ 2022-05-19 9:20 UTC (permalink / raw)
To: Dave Chinner
Cc: Luis Chamberlain, linux-fsdevel, linux-block, pankydev8,
Theodore Tso, Josef Bacik, jmeneghi, Jan Kara, Davidlohr Bueso,
Dan Williams, Jake Edge, Klaus Jensen, Zorro Lang, fstests
In-Reply-To: <20220519075805.GU2306852@dread.disaster.area>
On Thu, May 19, 2022 at 10:58 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Thu, May 19, 2022 at 09:36:41AM +0300, Amir Goldstein wrote:
> > [adding fstests and Zorro]
> >
> > On Thu, May 19, 2022 at 6:07 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
> > >
> > > I've been promoting the idea that running fstests once is nice,
> > > but things get interesting if you try to run fstests multiple
> > > times until a failure is found. It turns out at least kdevops has
> > > found tests which fail with a failure rate of typically 1/2 to
> > > 1/30 average failure rate. That is 1/2 means a failure can happen
> > > 50% of the time, whereas 1/30 means it takes 30 runs to find the
> > > failure.
> > >
> > > I have tried my best to annotate failure rates when I know what
> > > they might be on the test expunge list, as an example:
> > >
> > > workflows/fstests/expunges/5.17.0-rc7/xfs/unassigned/xfs_reflink.txt:generic/530 # failure rate about 1/15 https://gist.github.com/mcgrof/4129074db592c170e6bf748aa11d783d
> > >
> > > The term "failure rate 1/15" is 16 characters long, so I'd like
> > > to propose to standardize a way to represent this. How about
> > >
> > > generic/530 # F:1/15
> > >
> >
> > I am not fond of the 1/15 annotation at all, because the only fact that you
> > are able to document is that the test failed after 15 runs.
> > Suggesting that this means failure rate of 1/15 is a very big step.
> >
> > > Then we could extend the definition. F being current estimate, and this
> > > can be just how long it took to find the first failure. A more valuable
> > > figure would be failure rate avarage, so running the test multiple
> > > times, say 10, to see what the failure rate is and then averaging the
> > > failure out. So this could be a more accurate representation. For this
> > > how about:
> > >
> > > generic/530 # FA:1/15
> > >
> > > This would mean on average there failure rate has been found to be about
> > > 1/15, and this was determined based on 10 runs.
>
> These tests are run on multiple different filesystems. What happens
> if you run xfs, ext4, btrfs, overlay in sequence? We now have 4
> tests results, and 1 failure.
>
> Does that make it FA: 1/4, or does it make it 1/1,0/1,0/1,0/1?
>
> What happens if we run, say, XFS w/ defaults, rmapbt=1, v4, quotas?
>
> Does that make it FA: 1/4, or does it make it 0/1,1/1,0/1,0/1?
>
> In each case above, 1/4 tells us nothing useful. OTOH, the 0/1 vs
> 1/1 breakdown is useful information, because it tells us whihc
> filesystem failed the test, or which specific config failed the
> test.
>
> Hence I think the ability for us to draw useful conclusions from a
> number like this is large dependent on the specific data set it is
> drawn from...
>
> > > We should also go extend check for fstests/blktests to run a test
> > > until a failure is found and report back the number of successes.
> > >
> > > Thoughts?
>
> Who is the expected consumer of this information?
>
> I'm not sure it will be meaningful for anyone developing new code
> and needing to run every test every time they run fstests.
>
> OTOH, for a QA environment where you have a fixed progression of the
> kernel releases you are testing, it's likely valuable and already
> being tracked in various distro QE management tools and
> dashboards....
>
> > I have had a discussion about those tests with Zorro.
> >
> > Those tests that some people refer to as "flaky" are valuable,
> > but they are not deterministic, they are stochastic.
>
> Extremely valuable. Worth their weight in gold to developers like
> me.
>
> The recoveryloop group tests are a good example of this. The name of
> the group indicates how we use it. I typically set it up to run with
> an loop iteration like "-I 100" knowing that is will likely fail a
> random test in the group within 10 iterations.
>
> Those one-off failures are almost always a real bug, and they are
> often unique and difficult to reproduce exactly. Post-mortem needs
> to be performed immediately because it may well be a unique on-off
> failure and running another test after the failure destroys the
> state needed to perform a post-mortem.
>
> Hence having a test farm running these multiple times and then
> reporting "failed once in 15 runs" isn't really useful to me as a
> developer - it doesn't tell us anything new, nor does it help us
> find the bugs that are being tripped over.
>
> Less obvious stochastic tests exist, too. There are many tests that
> use fstress as a workload that runs while some other operation is
> performed - freeze, grow, ENOSPC, error injections, etc. They will
> never be deterministic, any again any failure tends to be a real
> bug, too.
>
> However, I think these should be run by QE environments all the time
> as they require long term, frequent execution across different
> configs in different environments to find the deep dark corners
> where the bugs may lie dormant. These are the tests that find things
> like subtle timing races no other tests ever exercise.
>
> I suspect that tests that alter their behaviour via LOAD_FACTOR or
> TIME_FACTOR will fall into this category.
>
> > I think MTBF is the standard way to describe reliability
> > of such tests, but I am having a hard time imagining how
> > the community can manage to document accurate annotations
> > of this sort, so I would stick with documenting the facts
> > (i.e. the test fails after N runs).
>
> I'm unsure of what "reliablity of such tests" means in this context.
> The tests are trying to exercise and measure the reliability of the
> kernel code - if the *test is unreliable* then that says to me the
> test needs fixing. If the test is reliable, then any failures that
> occur indicate that the filesystem/kernel/fs tools are unreliable,
> not the test....
>
> "test reliability" and "reliability of filesystem under test" are
> different things with similar names. The latter is what I think we
> are talking about measuring and reporting here, right?
>
> > OTOH, we do have deterministic tests, maybe even the majority of
> > fstests are deterministic(?)
>
> Very likely. As a generalisation, I'd say that anything that has a
> fixed, single step at a time recipe and a very well defined golden
> output or exact output comparison match is likely deterministic.
>
> We use things like 'within tolerance' so that slight variations in
> test results don't cause spurious failures and hence make the test
> more deterministic. Hence any test that uses 'within_tolerance' is
> probably a test that is expecting deterministic behaviour....
>
> > Considering that every auto test loop takes ~2 hours on our rig and that
> > I have been running over 100 loops over the past two weeks, if half
> > of fstests are deterministic, that is a lot of wait time and a lot of carbon
> > emission gone to waste.
> >
> > It would have been nice if I was able to exclude a "deterministic" group.
> > The problem is - can a developer ever tag a test as being "deterministic"?
>
> fstests allows private exclude lists to be used - perhaps these
> could be used to start building such a group for your test
> environment. Building a list from the tests you never see fail in
> your environment could be a good way to seed such a group...
>
> Maybe you have all the raw results from those hundreds of tests
> sitting around - what does crunching that data look like? Who else
> has large sets of consistent historic data sitting around? I don't
> because I pollute my results archive by frequently running varied
> and badly broken kernels through fstests, but people who just run
> released or stable kernels may have data sets that could be used....
>
I have no historic data of that sort and I have never stayed on the
same test system long enough to collect this sort of data.
Josef has told us in LPC 2021 about his btrfs fstests dashboard
where he started to collect historical data a while ago.
Collaborating on expunge lists of different fs and different
kernel/config/distro
is one of the goals behind Luis's kdevops project.
For now, the expunge lists are curated in git:
https://github.com/linux-kdevops/kdevops/tree/master/workflows/fstests/expunges
Going forward, this cannot scale. If we want to collaborate and
collect results from
multiple testers and test labs we should consult with the KernelCI
project, who are
doing exactly that for other test suites.
You did not attend Luis' talk in LSFMM this year (he has already mentioned
kdevops back in LSFMM 2019), where some of these issues were discussed.
The video from LSFMM 2022 talk should be available in coming weeks.
I hear that Luis is also planning on giving a talk to a wider audience
in LPC 2022.
Thanks,
Amir.
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
^ permalink raw reply
* [PATCH -next v3 2/2] blk-throttle: fix io hung due to configuration updates
From: Yu Kuai @ 2022-05-19 8:58 UTC (permalink / raw)
To: tj, axboe, ming.lei, geert
Cc: cgroups, linux-block, linux-kernel, yukuai3, yi.zhang
In-Reply-To: <20220519085811.879097-1-yukuai3@huawei.com>
If new configuration is submitted while a bio is throttled, then new
waiting time is recaculated regardless that the bio might aready wait
for some time:
tg_conf_updated
throtl_start_new_slice
tg_update_disptime
throtl_schedule_next_dispatch
Then io hung can be triggered by always submmiting new configuration
before the throttled bio is dispatched.
Fix the problem by respecting the time that throttled bio aready waited.
In order to do that, instead of start new slice in tg_conf_updated(),
just update 'bytes_disp' and 'io_disp' based on the new configuration.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
block/blk-throttle.c | 80 +++++++++++++++++++++++++++++++++++++-------
1 file changed, 67 insertions(+), 13 deletions(-)
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 0c37be08ff28..aca63148bb83 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1271,7 +1271,58 @@ static int tg_print_conf_uint(struct seq_file *sf, void *v)
return 0;
}
-static void tg_conf_updated(struct throtl_grp *tg, bool global)
+static u64 throtl_update_bytes_disp(u64 dispatched, u64 new_limit,
+ u64 old_limit)
+{
+ if (new_limit == old_limit)
+ return dispatched;
+
+ if (!dispatched)
+ return 0;
+
+ /*
+ * In the case that multiply will overflow, just return 0. It will only
+ * let bios to be dispatched earlier.
+ */
+ if (div64_u64(U64_MAX, dispatched) < new_limit)
+ return 0;
+
+ dispatched *= new_limit;
+ return div64_u64(dispatched, old_limit);
+}
+
+static u32 throtl_update_io_disp(u32 dispatched, u32 new_limit, u32 old_limit)
+{
+ if (new_limit == old_limit)
+ return dispatched;
+
+ if (!dispatched)
+ return 0;
+
+ /*
+ * In the case that multiply will overflow, just return 0. It will only
+ * let bios to be dispatched earlier.
+ */
+ if (UINT_MAX / dispatched < new_limit)
+ return 0;
+
+ dispatched *= new_limit;
+ return dispatched / old_limit;
+}
+
+static void throtl_update_slice(struct throtl_grp *tg, u64 *old_limits)
+{
+ tg->bytes_disp[READ] = throtl_update_bytes_disp(tg->bytes_disp[READ],
+ tg_bps_limit(tg, READ), old_limits[0]);
+ tg->bytes_disp[WRITE] = throtl_update_bytes_disp(tg->bytes_disp[WRITE],
+ tg_bps_limit(tg, WRITE), old_limits[1]);
+ tg->io_disp[READ] = throtl_update_io_disp(tg->io_disp[READ],
+ tg_iops_limit(tg, READ), (u32)old_limits[2]);
+ tg->io_disp[WRITE] = throtl_update_io_disp(tg->io_disp[WRITE],
+ tg_iops_limit(tg, WRITE), (u32)old_limits[3]);
+}
+
+static void tg_conf_updated(struct throtl_grp *tg, u64 *old_limits, bool global)
{
struct throtl_service_queue *sq = &tg->service_queue;
struct cgroup_subsys_state *pos_css;
@@ -1310,16 +1361,7 @@ static void tg_conf_updated(struct throtl_grp *tg, bool global)
parent_tg->latency_target);
}
- /*
- * We're already holding queue_lock and know @tg is valid. Let's
- * apply the new config directly.
- *
- * Restart the slices for both READ and WRITES. It might happen
- * that a group's limit are dropped suddenly and we don't want to
- * account recently dispatched IO with new low rate.
- */
- throtl_start_new_slice(tg, READ);
- throtl_start_new_slice(tg, WRITE);
+ throtl_update_slice(tg, old_limits);
if (tg->flags & THROTL_TG_PENDING) {
tg_update_disptime(tg);
@@ -1327,6 +1369,14 @@ static void tg_conf_updated(struct throtl_grp *tg, bool global)
}
}
+static void tg_get_limits(struct throtl_grp *tg, u64 *limits)
+{
+ limits[0] = tg_bps_limit(tg, READ);
+ limits[1] = tg_bps_limit(tg, WRITE);
+ limits[2] = tg_iops_limit(tg, READ);
+ limits[3] = tg_iops_limit(tg, WRITE);
+}
+
static ssize_t tg_set_conf(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off, bool is_u64)
{
@@ -1335,6 +1385,7 @@ static ssize_t tg_set_conf(struct kernfs_open_file *of,
struct throtl_grp *tg;
int ret;
u64 v;
+ u64 old_limits[4];
ret = blkg_conf_prep(blkcg, &blkcg_policy_throtl, buf, &ctx);
if (ret)
@@ -1347,13 +1398,14 @@ static ssize_t tg_set_conf(struct kernfs_open_file *of,
v = U64_MAX;
tg = blkg_to_tg(ctx.blkg);
+ tg_get_limits(tg, old_limits);
if (is_u64)
*(u64 *)((void *)tg + of_cft(of)->private) = v;
else
*(unsigned int *)((void *)tg + of_cft(of)->private) = v;
- tg_conf_updated(tg, false);
+ tg_conf_updated(tg, old_limits, false);
ret = 0;
out_finish:
blkg_conf_finish(&ctx);
@@ -1523,6 +1575,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
struct blkg_conf_ctx ctx;
struct throtl_grp *tg;
u64 v[4];
+ u64 old_limits[4];
unsigned long idle_time;
unsigned long latency_time;
int ret;
@@ -1533,6 +1586,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
return ret;
tg = blkg_to_tg(ctx.blkg);
+ tg_get_limits(tg, old_limits);
v[0] = tg->bps_conf[READ][index];
v[1] = tg->bps_conf[WRITE][index];
@@ -1624,7 +1678,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
tg->td->limit_index = LIMIT_LOW;
} else
tg->td->limit_index = LIMIT_MAX;
- tg_conf_updated(tg, index == LIMIT_LOW &&
+ tg_conf_updated(tg, old_limits, index == LIMIT_LOW &&
tg->td->limit_valid[LIMIT_LOW]);
ret = 0;
out_finish:
--
2.31.1
^ permalink raw reply related
* [PATCH -next v3 1/2] blk-throttle: fix that io throttle can only work for single bio
From: Yu Kuai @ 2022-05-19 8:58 UTC (permalink / raw)
To: tj, axboe, ming.lei, geert
Cc: cgroups, linux-block, linux-kernel, yukuai3, yi.zhang
In-Reply-To: <20220519085811.879097-1-yukuai3@huawei.com>
commit 9f5ede3c01f9 ("block: throttle split bio in case of iops limit")
introduce a new problem, for example:
[root@localhost ~]# echo "8:0 1024" > /sys/fs/cgroup/blkio/blkio.throttle.write_bps_device
[root@localhost ~]# echo $$ > /sys/fs/cgroup/blkio/cgroup.procs
[root@localhost ~]# dd if=/dev/zero of=/dev/sda bs=10k count=1 oflag=direct &
[1] 620
[root@localhost ~]# dd if=/dev/zero of=/dev/sda bs=10k count=1 oflag=direct &
[2] 626
[root@localhost ~]# 1+0 records in
1+0 records out
10240 bytes (10 kB, 10 KiB) copied, 10.0038 s, 1.0 kB/s1+0 records in
1+0 records out
10240 bytes (10 kB, 10 KiB) copied, 9.23076 s, 1.1 kB/s
-> the second bio is issued after 10s instead of 20s.
This is because if some bios are already queued, current bio is queued
directly and the flag 'BIO_THROTTLED' is set. And later, when former
bios are dispatched, this bio will be dispatched without waiting at all,
this is due to tg_with_in_bps_limit() return 0 for this bio.
In order to fix the problem, don't skip flaged bio in
tg_with_in_bps_limit(), and for the problem that split bio can be
double accounted, compensate the over-accounting in __blk_throtl_bio().
Fixes: 9f5ede3c01f9 ("block: throttle split bio in case of iops limit")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
block/blk-throttle.c | 24 ++++++++++++++++++------
1 file changed, 18 insertions(+), 6 deletions(-)
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 447e1b8722f7..0c37be08ff28 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -811,7 +811,7 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
unsigned int bio_size = throtl_bio_data_size(bio);
/* no need to throttle if this bio's bytes have been accounted */
- if (bps_limit == U64_MAX || bio_flagged(bio, BIO_THROTTLED)) {
+ if (bps_limit == U64_MAX) {
if (wait)
*wait = 0;
return true;
@@ -921,11 +921,8 @@ static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio)
unsigned int bio_size = throtl_bio_data_size(bio);
/* Charge the bio to the group */
- if (!bio_flagged(bio, BIO_THROTTLED)) {
- tg->bytes_disp[rw] += bio_size;
- tg->last_bytes_disp[rw] += bio_size;
- }
-
+ tg->bytes_disp[rw] += bio_size;
+ tg->last_bytes_disp[rw] += bio_size;
tg->io_disp[rw]++;
tg->last_io_disp[rw]++;
@@ -2121,6 +2118,21 @@ bool __blk_throtl_bio(struct bio *bio)
tg->last_low_overflow_time[rw] = jiffies;
throtl_downgrade_check(tg);
throtl_upgrade_check(tg);
+
+ /*
+ * re-entered bio has accounted bytes already, so try to
+ * compensate previous over-accounting. However, if new
+ * slice is started, just forget it.
+ */
+ if (bio_flagged(bio, BIO_THROTTLED)) {
+ unsigned int bio_size = throtl_bio_data_size(bio);
+
+ if (tg->bytes_disp[rw] >= bio_size)
+ tg->bytes_disp[rw] -= bio_size;
+ if (tg->last_bytes_disp[rw] >= bio_size)
+ tg->last_bytes_disp[rw] -= bio_size;
+ }
+
/* throtl is FIFO - if bios are already queued, should queue */
if (sq->nr_queued[rw])
break;
--
2.31.1
^ permalink raw reply related
* [PATCH -next v3 0/2] bugfix for blk-throttle
From: Yu Kuai @ 2022-05-19 8:58 UTC (permalink / raw)
To: tj, axboe, ming.lei, geert
Cc: cgroups, linux-block, linux-kernel, yukuai3, yi.zhang
Changes in v3:
- fix a check in patch 1
- fix link err in patch 2 on 32-bit platform
- hanlde overflow in patch 2
Changes in v2:
- use a new solution suggested by Ming
- change the title of patch 1
- add patch 2
Patch 1 fix that blk-throttle can't work if multiple bios are throttle,
Patch 2 fix io hung due to configuration updates.
Previous version:
v1 : https://lore.kernel.org/all/20220517134909.2910251-1-yukuai3@huawei.com/
v2 : https://lore.kernel.org/all/20220518072751.1188163-1-yukuai3@huawei.com/
Yu Kuai (2):
blk-throttle: fix that io throttle can only work for single bio
blk-throttle: fix io hung due to configuration updates
block/blk-throttle.c | 104 +++++++++++++++++++++++++++++++++++--------
1 file changed, 85 insertions(+), 19 deletions(-)
--
2.31.1
^ permalink raw reply
* Re: [RFC 6/6] irqchip/gic-v3: Re-init GIC hardware upon hibernation restore
From: Marc Zyngier @ 2022-05-19 7:59 UTC (permalink / raw)
To: Vivek Kumar
Cc: corbet, catalin.marinas, will, tglx, axboe, rafael, akpm,
linux-doc, linux-kernel, linux-arm-kernel, linux-block, linux-pm,
linux-mm, len.brown, pavel, paulmck, bp, keescook, songmuchun,
rdunlap, damien.lemoal, pasha.tatashin, tabba, ardb, tsoni,
quic_psodagud, quic_svaddagi, Prasanna Kumar
In-Reply-To: <1652860121-24092-7-git-send-email-quic_vivekuma@quicinc.com>
Hi Vivek,
On Wed, 18 May 2022 08:48:41 +0100,
Vivek Kumar <quic_vivekuma@quicinc.com> wrote:
>
> Code added in this patch takes backup of different set of
> registers during hibernation suspend. On receiving hibernation
> restore callback, it restores register values from backup. This
> ensures state of hardware to be same just before hibernation and
> after restore.
>
> Signed-off-by: Vivek Kumar <quic_vivekuma@quicinc.com>
> Signed-off-by: Prasanna Kumar <quic_kprasan@quicinc.com>
> ---
> drivers/irqchip/irq-gic-v3.c | 138 ++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 136 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
> index 2be8dea..442d32f 100644
> --- a/drivers/irqchip/irq-gic-v3.c
> +++ b/drivers/irqchip/irq-gic-v3.c
> @@ -29,6 +29,10 @@
> #include <asm/smp_plat.h>
> #include <asm/virt.h>
>
> +#include <linux/syscore_ops.h>
> +#include <linux/suspend.h>
> +#include <linux/notifier.h>
> +
> #include "irq-gic-common.h"
>
> #define GICD_INT_NMI_PRI (GICD_INT_DEF_PRI & ~0x80)
> @@ -56,6 +60,14 @@ struct gic_chip_data {
> bool has_rss;
> unsigned int ppi_nr;
> struct partition_desc **ppi_descs;
> +#ifdef CONFIG_HIBERNATION
> + unsigned int enabled_irqs[32];
> + unsigned int active_irqs[32];
> + unsigned int irq_edg_lvl[64];
> + unsigned int ppi_edg_lvl;
> + unsigned int enabled_sgis;
> + unsigned int pending_sgis;
> +#endif
This is either way too much, or way too little. Just restoring these
registers is nowhere near enough, as you are completely ignoring the
ITS, so this will leave the machine broken for anything that requires
LPIs.
But If the bootloader is supposed to replace the kernel to put the HW
in a state where the GIC is usable again, why do we need any of this?
Hibernation relies on a basic promise: the secondary kernel is entered
with the HW in a reasonable state, and the basic infrastructure
(specially for stuff that can be only programmed once per boot such as
the RD tables) is already available. If the bootloader is going to do
the work of the initial kernel, then it must do it fully, and not
require this sort of random sprinkling all over the shop.
Effectively, there is an ABI between the primary kernel and the
secondary, and I don't see why this interface should change to paper
over what I see as the deficiencies of the bootloader.
Am I missing anything?
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
^ permalink raw reply
* Re: [PATCH v4 08/13] btrfs:zoned: make sb for npo2 zone devices align with sb log offsets
From: Naohiro Aota @ 2022-05-19 7:59 UTC (permalink / raw)
To: Pankaj Raghav
Cc: axboe@kernel.dk, damien.lemoal@opensource.wdc.com,
pankydev8@gmail.com, dsterba@suse.com, hch@lst.de,
linux-nvme@lists.infradead.org, linux-fsdevel@vger.kernel.org,
linux-btrfs@vger.kernel.org, jiangbo.365@bytedance.com,
linux-block@vger.kernel.org, gost.dev@samsung.com,
linux-kernel@vger.kernel.org, dm-devel@redhat.com
In-Reply-To: <20220516165416.171196-9-p.raghav@samsung.com>
On Mon, May 16, 2022 at 06:54:11PM +0200, Pankaj Raghav wrote:
> Superblocks for zoned devices are fixed as 2 zones at 0, 512GB and 4TB.
> These are fixed at these locations so that recovery tools can reliably
> retrieve the superblocks even if one of the mirror gets corrupted.
>
> power of 2 zone sizes align at these offsets irrespective of their
> value but non power of 2 zone sizes will not align.
>
> To make sure the first zone at mirror 1 and mirror 2 align, write zero
> operation is performed to move the write pointer of the first zone to
> the expected offset. This operation is performed only after a zone reset
> of the first zone, i.e., when the second zone that contains the sb is FULL.
>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
> fs/btrfs/zoned.c | 68 ++++++++++++++++++++++++++++++++++++++++++++----
> 1 file changed, 63 insertions(+), 5 deletions(-)
>
> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
> index 3023c871e..805aeaa76 100644
> --- a/fs/btrfs/zoned.c
> +++ b/fs/btrfs/zoned.c
> @@ -760,11 +760,44 @@ int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info)
> return 0;
> }
>
> +static int fill_sb_wp_offset(struct block_device *bdev, struct blk_zone *zone,
> + int mirror, u64 *wp_ret)
> +{
> + u64 offset = 0;
> + int ret = 0;
> +
> + ASSERT(!is_power_of_two_u64(zone->len));
> + ASSERT(zone->wp == zone->start);
> + ASSERT(mirror != 0);
> +
> + switch (mirror) {
> + case 1:
> + div64_u64_rem(BTRFS_SB_LOG_FIRST_OFFSET >> SECTOR_SHIFT,
> + zone->len, &offset);
> + break;
> + case 2:
> + div64_u64_rem(BTRFS_SB_LOG_SECOND_OFFSET >> SECTOR_SHIFT,
> + zone->len, &offset);
> + break;
> + }
> +
> + ret = blkdev_issue_zeroout(bdev, zone->start, offset, GFP_NOFS, 0);
> + if (ret)
> + return ret;
> +
> + zone->wp += offset;
> + zone->cond = BLK_ZONE_COND_IMP_OPEN;
> + *wp_ret = zone->wp << SECTOR_SHIFT;
> +
> + return 0;
> +}
> +
> static int sb_log_location(struct block_device *bdev, struct blk_zone *zones,
> - int rw, u64 *bytenr_ret)
> + int rw, int mirror, u64 *bytenr_ret)
> {
> u64 wp;
> int ret;
> + bool zones_empty = false;
>
> if (zones[0].type == BLK_ZONE_TYPE_CONVENTIONAL) {
> *bytenr_ret = zones[0].start << SECTOR_SHIFT;
> @@ -775,13 +808,31 @@ static int sb_log_location(struct block_device *bdev, struct blk_zone *zones,
> if (ret != -ENOENT && ret < 0)
> return ret;
>
> + if (ret == -ENOENT)
> + zones_empty = true;
> +
I think, we don't need this. We need to issue the zeroout when
zones[0]->cond == BLK_ZONE_COND_EMPTY && !is_power_of_2(...) after sending
ZONE_RESET if necessary. No?
> if (rw == WRITE) {
> struct blk_zone *reset = NULL;
> + bool is_sb_offset_write_req = false;
> + u32 reset_zone_nr = -1;
>
> - if (wp == zones[0].start << SECTOR_SHIFT)
> + if (wp == zones[0].start << SECTOR_SHIFT) {
> reset = &zones[0];
> - else if (wp == zones[1].start << SECTOR_SHIFT)
> + reset_zone_nr = 0;
> + } else if (wp == zones[1].start << SECTOR_SHIFT) {
> reset = &zones[1];
> + reset_zone_nr = 1;
> + }
> +
> + /*
> + * Non po2 zone sizes will not align naturally at
> + * mirror 1 (512GB) and mirror 2 (4TB). The wp of the
> + * 1st zone in those superblock mirrors need to be
> + * moved to align at those offsets.
> + */
> + is_sb_offset_write_req =
> + (zones_empty || (reset_zone_nr == 0)) && mirror &&
> + !is_power_of_2(zones[0].len);
>
> if (reset && reset->cond != BLK_ZONE_COND_EMPTY) {
> ASSERT(sb_zone_is_full(reset));
> @@ -795,6 +846,13 @@ static int sb_log_location(struct block_device *bdev, struct blk_zone *zones,
> reset->cond = BLK_ZONE_COND_EMPTY;
> reset->wp = reset->start;
> }
> +
> + if (is_sb_offset_write_req) {
> + ret = fill_sb_wp_offset(bdev, &zones[0], mirror, &wp);
> + if (ret)
> + return ret;
> + }
> +
> } else if (ret != -ENOENT) {
> /*
> * For READ, we want the previous one. Move write pointer to
> @@ -851,7 +909,7 @@ int btrfs_sb_log_location_bdev(struct block_device *bdev, int mirror, int rw,
> if (ret != BTRFS_NR_SB_LOG_ZONES)
> return -EIO;
>
> - return sb_log_location(bdev, zones, rw, bytenr_ret);
> + return sb_log_location(bdev, zones, rw, mirror, bytenr_ret);
> }
>
> int btrfs_sb_log_location(struct btrfs_device *device, int mirror, int rw,
> @@ -877,7 +935,7 @@ int btrfs_sb_log_location(struct btrfs_device *device, int mirror, int rw,
>
> return sb_log_location(device->bdev,
> &zinfo->sb_zones[BTRFS_NR_SB_LOG_ZONES * mirror],
> - rw, bytenr_ret);
> + rw, mirror, bytenr_ret);
> }
>
> static inline bool is_sb_log_zone(struct btrfs_zoned_device_info *zinfo,
> --
> 2.25.1
>
^ permalink raw reply
* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
From: Dave Chinner @ 2022-05-19 7:58 UTC (permalink / raw)
To: Amir Goldstein
Cc: Luis Chamberlain, linux-fsdevel, linux-block, pankydev8,
Theodore Tso, Josef Bacik, jmeneghi, Jan Kara, Davidlohr Bueso,
Dan Williams, Jake Edge, Klaus Jensen, Zorro Lang, fstests
In-Reply-To: <CAOQ4uxhKHMjGq0QKKMPFAV6iJFwe1H5hBomCVVeT1EWJzo0eXg@mail.gmail.com>
On Thu, May 19, 2022 at 09:36:41AM +0300, Amir Goldstein wrote:
> [adding fstests and Zorro]
>
> On Thu, May 19, 2022 at 6:07 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
> >
> > I've been promoting the idea that running fstests once is nice,
> > but things get interesting if you try to run fstests multiple
> > times until a failure is found. It turns out at least kdevops has
> > found tests which fail with a failure rate of typically 1/2 to
> > 1/30 average failure rate. That is 1/2 means a failure can happen
> > 50% of the time, whereas 1/30 means it takes 30 runs to find the
> > failure.
> >
> > I have tried my best to annotate failure rates when I know what
> > they might be on the test expunge list, as an example:
> >
> > workflows/fstests/expunges/5.17.0-rc7/xfs/unassigned/xfs_reflink.txt:generic/530 # failure rate about 1/15 https://gist.github.com/mcgrof/4129074db592c170e6bf748aa11d783d
> >
> > The term "failure rate 1/15" is 16 characters long, so I'd like
> > to propose to standardize a way to represent this. How about
> >
> > generic/530 # F:1/15
> >
>
> I am not fond of the 1/15 annotation at all, because the only fact that you
> are able to document is that the test failed after 15 runs.
> Suggesting that this means failure rate of 1/15 is a very big step.
>
> > Then we could extend the definition. F being current estimate, and this
> > can be just how long it took to find the first failure. A more valuable
> > figure would be failure rate avarage, so running the test multiple
> > times, say 10, to see what the failure rate is and then averaging the
> > failure out. So this could be a more accurate representation. For this
> > how about:
> >
> > generic/530 # FA:1/15
> >
> > This would mean on average there failure rate has been found to be about
> > 1/15, and this was determined based on 10 runs.
These tests are run on multiple different filesystems. What happens
if you run xfs, ext4, btrfs, overlay in sequence? We now have 4
tests results, and 1 failure.
Does that make it FA: 1/4, or does it make it 1/1,0/1,0/1,0/1?
What happens if we run, say, XFS w/ defaults, rmapbt=1, v4, quotas?
Does that make it FA: 1/4, or does it make it 0/1,1/1,0/1,0/1?
In each case above, 1/4 tells us nothing useful. OTOH, the 0/1 vs
1/1 breakdown is useful information, because it tells us whihc
filesystem failed the test, or which specific config failed the
test.
Hence I think the ability for us to draw useful conclusions from a
number like this is large dependent on the specific data set it is
drawn from...
> > We should also go extend check for fstests/blktests to run a test
> > until a failure is found and report back the number of successes.
> >
> > Thoughts?
Who is the expected consumer of this information?
I'm not sure it will be meaningful for anyone developing new code
and needing to run every test every time they run fstests.
OTOH, for a QA environment where you have a fixed progression of the
kernel releases you are testing, it's likely valuable and already
being tracked in various distro QE management tools and
dashboards....
> I have had a discussion about those tests with Zorro.
>
> Those tests that some people refer to as "flaky" are valuable,
> but they are not deterministic, they are stochastic.
Extremely valuable. Worth their weight in gold to developers like
me.
The recoveryloop group tests are a good example of this. The name of
the group indicates how we use it. I typically set it up to run with
an loop iteration like "-I 100" knowing that is will likely fail a
random test in the group within 10 iterations.
Those one-off failures are almost always a real bug, and they are
often unique and difficult to reproduce exactly. Post-mortem needs
to be performed immediately because it may well be a unique on-off
failure and running another test after the failure destroys the
state needed to perform a post-mortem.
Hence having a test farm running these multiple times and then
reporting "failed once in 15 runs" isn't really useful to me as a
developer - it doesn't tell us anything new, nor does it help us
find the bugs that are being tripped over.
Less obvious stochastic tests exist, too. There are many tests that
use fstress as a workload that runs while some other operation is
performed - freeze, grow, ENOSPC, error injections, etc. They will
never be deterministic, any again any failure tends to be a real
bug, too.
However, I think these should be run by QE environments all the time
as they require long term, frequent execution across different
configs in different environments to find the deep dark corners
where the bugs may lie dormant. These are the tests that find things
like subtle timing races no other tests ever exercise.
I suspect that tests that alter their behaviour via LOAD_FACTOR or
TIME_FACTOR will fall into this category.
> I think MTBF is the standard way to describe reliability
> of such tests, but I am having a hard time imagining how
> the community can manage to document accurate annotations
> of this sort, so I would stick with documenting the facts
> (i.e. the test fails after N runs).
I'm unsure of what "reliablity of such tests" means in this context.
The tests are trying to exercise and measure the reliability of the
kernel code - if the *test is unreliable* then that says to me the
test needs fixing. If the test is reliable, then any failures that
occur indicate that the filesystem/kernel/fs tools are unreliable,
not the test....
"test reliability" and "reliability of filesystem under test" are
different things with similar names. The latter is what I think we
are talking about measuring and reporting here, right?
> OTOH, we do have deterministic tests, maybe even the majority of
> fstests are deterministic(?)
Very likely. As a generalisation, I'd say that anything that has a
fixed, single step at a time recipe and a very well defined golden
output or exact output comparison match is likely deterministic.
We use things like 'within tolerance' so that slight variations in
test results don't cause spurious failures and hence make the test
more deterministic. Hence any test that uses 'within_tolerance' is
probably a test that is expecting deterministic behaviour....
> Considering that every auto test loop takes ~2 hours on our rig and that
> I have been running over 100 loops over the past two weeks, if half
> of fstests are deterministic, that is a lot of wait time and a lot of carbon
> emission gone to waste.
>
> It would have been nice if I was able to exclude a "deterministic" group.
> The problem is - can a developer ever tag a test as being "deterministic"?
fstests allows private exclude lists to be used - perhaps these
could be used to start building such a group for your test
environment. Building a list from the tests you never see fail in
your environment could be a good way to seed such a group...
Maybe you have all the raw results from those hundreds of tests
sitting around - what does crunching that data look like? Who else
has large sets of consistent historic data sitting around? I don't
because I pollute my results archive by frequently running varied
and badly broken kernels through fstests, but people who just run
released or stable kernels may have data sets that could be used....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply
* Re: [PATCH v4 08/13] btrfs:zoned: make sb for npo2 zone devices align with sb log offsets
From: Johannes Thumshirn @ 2022-05-19 7:57 UTC (permalink / raw)
To: Pankaj Raghav, dsterba@suse.cz
Cc: axboe@kernel.dk, damien.lemoal@opensource.wdc.com,
pankydev8@gmail.com, dsterba@suse.com, hch@lst.de,
linux-nvme@lists.infradead.org, linux-fsdevel@vger.kernel.org,
linux-btrfs@vger.kernel.org, jiangbo.365@bytedance.com,
linux-block@vger.kernel.org, gost.dev@samsung.com,
linux-kernel@vger.kernel.org, dm-devel@redhat.com
In-Reply-To: <717a2c83-0678-9310-4c75-9ad5da0472f6@samsung.com>
On 18/05/2022 11:17, Pankaj Raghav wrote:
> On 2022-05-17 14:42, David Sterba wrote:
>> On Mon, May 16, 2022 at 06:54:11PM +0200, Pankaj Raghav wrote:
>>> Superblocks for zoned devices are fixed as 2 zones at 0, 512GB and 4TB.
>>> These are fixed at these locations so that recovery tools can reliably
>>> retrieve the superblocks even if one of the mirror gets corrupted.
>>>
>>> power of 2 zone sizes align at these offsets irrespective of their
>>> value but non power of 2 zone sizes will not align.
>>>
>>> To make sure the first zone at mirror 1 and mirror 2 align, write zero
>>> operation is performed to move the write pointer of the first zone to
>>> the expected offset. This operation is performed only after a zone reset
>>> of the first zone, i.e., when the second zone that contains the sb is FULL.
>> Is it a good idea to do the "write zeros", instead of a plain "set write
>> pointer"? I assume setting write pointer is instant, while writing
>> potentially hundreds of megabytes may take significiant time. As the
>> functions may be called from random contexts, the increased time may
>> become a problem.
>>
> Unfortunately it is not possible to just move the WP in zoned devices.
> The only alternative that I could use is to do write zeroes which are
> natively supported by some devices such as ZNS. It would be nice to know
> if someone had a better solution to this instead of doing write zeroes
> in zoned devices.
>
I have another question. In case we need to pad the sb zone with a write
zeros and have a power fail between the write-zeros and the regular
super-block write, what happens? I know this padding is only done for the
backup super blocks, never the less it can happen and it can happen when
the primary super block is also corrupted.
AFAIU we're then trying to reach out for a backup super block, look at the
write pointer and it only contains zeros but no super block, as only the
write-zeros has reached the device and not the super block write.
How is this situation handled?
Thanks,
Johannes
^ permalink raw reply
* Re: [PATCH v2] blk-mq: fix incorrect blk_status_t casts
From: Christoph Hellwig @ 2022-05-19 7:53 UTC (permalink / raw)
To: Vasily Averin
Cc: Christoph Hellwig, Jens Axboe, kernel, linux-kernel, linux-block
In-Reply-To: <55475ea9-5f6c-fa19-b52d-93fb89209850@openvz.org>
Still a whole lot of casts. Take a look at my "cleanup blk_execute_rq*"
for how think we can solve this in a much nicer way.
^ permalink raw reply
* [linux-next] build fails modpost: "blkcg_get_fc_appid" [drivers/nvme/host/nvme-fc.ko] undefined!
From: Tasmiya Nalatwad @ 2022-05-19 7:49 UTC (permalink / raw)
To: linux-block; +Cc: linux-kernel, abdhalee, sachinp, mputtash
Greetings,
linux-next build fails modpost: "blkcg_get_fc_appid"
[drivers/nvme/host/nvme-fc.ko] undefined!
Console Logs :
[console-expect]#make -j 17 -s && make modules && make modules_install
&& make install
make -j 17 -s && make modules && make modules_install && make install
ERROR: modpost: "blkcg_get_fc_appid" [drivers/nvme/host/nvme-fc.ko]
undefined!
make[1]: *** [scripts/Makefile.modpost:134: modules-only.symvers] Error 1
make: *** [Makefile:1914: modules] Error 2
make: *** Waiting for unfinished jobs....
--
Regards,
Tasmiya Nalatwad
IBM Linux Technology Center
^ permalink raw reply
* Re: [PATCHv2 0/3] direct io alignment relax
From: hch @ 2022-05-19 7:43 UTC (permalink / raw)
To: Eric Biggers
Cc: Chaitanya Kulkarni, Keith Busch, Keith Busch,
linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org,
axboe@kernel.dk, Kernel Team, hch@lst.de, bvanassche@acm.org,
damien.lemoal@opensource.wdc.com
In-Reply-To: <YoWlQxEjvlmACNRV@sol.localdomain>
On Wed, May 18, 2022 at 07:02:43PM -0700, Eric Biggers wrote:
> If you're thinking about about BLKDISCARD and BLKZEROOUT, those are block device
> ioctls, so statx() doesn't seem like a great fit for them. There is already a
> place to expose block device properties to userspace (/sys/block/$dev/). Direct
> I/O is different because direct I/O is not just a feature of block devices but
> also of regular files, and it is affected by filesystem-level constraints.
Agreed.
^ permalink raw reply
* Re: [PATCHv2 0/3] direct io alignment relax
From: Christoph Hellwig @ 2022-05-19 7:42 UTC (permalink / raw)
To: Jens Axboe
Cc: Keith Busch, linux-fsdevel, linux-block, Kernel Team, hch,
bvanassche, damien.lemoal, Keith Busch
In-Reply-To: <dc8e7b85-fba1-b45e-231e-9c8054aea505@kernel.dk>
On Wed, May 18, 2022 at 04:45:10PM -0600, Jens Axboe wrote:
> On 5/18/22 11:11 AM, Keith Busch wrote:
> > From: Keith Busch <kbusch@kernel.org>
> >
> > Including the fs list this time.
> >
> > I am still working on a better interface to report the dio alignment to
> > an application. The most recent suggestion of using statx is proving to
> > be less straight forward than I thought, but I don't want to hold this
> > series up for that.
>
> This looks good to me. Anyone object to queueing this one up?
Yes. I really do like this feature, but I don't think it is ready to
rush it in. In addition to the ongoing discussions in this thread
we absolutely need proper statx support for the alignments to avoid
userspace growing all kinds of sysfs growling crap to make use of it.
^ permalink raw reply
* Re: [PATCHv2 3/3] block: relax direct io memory alignment
From: Christoph Hellwig @ 2022-05-19 7:41 UTC (permalink / raw)
To: Keith Busch
Cc: Eric Biggers, Keith Busch, linux-fsdevel, linux-block, axboe,
Kernel Team, hch, bvanassche, damien.lemoal
In-Reply-To: <YoXN5CpSGGe7+OJs@kbusch-mbp.dhcp.thefacebook.com>
On Wed, May 18, 2022 at 10:56:04PM -0600, Keith Busch wrote:
> I'm surely missing something here. I know the bvecs are not necessarily lbs
> aligned, but why does that matter? Is there some driver that can only take
> exactly 1 bvec, but allows it to be unaligned? If so, we could take the segment
> queue limit into account, but I am not sure that we need to.
At least stacking drivers had all kinds of interesting limitations in
this area. How much testing did this series get with all kinds of
interesting dm targets and md pesonalities?
^ permalink raw reply
* Re: [PATCHv2 3/3] block: relax direct io memory alignment
From: Christoph Hellwig @ 2022-05-19 7:39 UTC (permalink / raw)
To: Keith Busch
Cc: Eric Biggers, Keith Busch, linux-fsdevel, linux-block, axboe,
Kernel Team, hch, bvanassche, damien.lemoal
In-Reply-To: <YoWWtwsiKGqoTbVU@kbusch-mbp.dhcp.thefacebook.com>
On Wed, May 18, 2022 at 07:00:39PM -0600, Keith Busch wrote:
> How? This patch ensures every segment is block size aligned. We can always
> split a bio in the middle of a bvec for any lower level hardware constraints,
> and I'm not finding any splitting criteria that would try to break a bio on a
> non-block aligned boundary.
Do you mean bio_vec with segment here? How do we ensure that?
^ permalink raw reply
* Re: [PATCHv2 3/3] block: relax direct io memory alignment
From: Christoph Hellwig @ 2022-05-19 7:38 UTC (permalink / raw)
To: Keith Busch
Cc: linux-fsdevel, linux-block, axboe, Kernel Team, hch, bvanassche,
damien.lemoal, Keith Busch
In-Reply-To: <20220518171131.3525293-4-kbusch@fb.com>
> @@ -1207,6 +1207,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
> {
> unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
> unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt;
> + struct request_queue *q = bdev_get_queue(bio->bi_bdev);
> struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
> struct page **pages = (struct page **)bv;
> bool same_page = false;
> @@ -1223,6 +1224,8 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
> pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
>
> size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
> + if (size > 0)
> + size = ALIGN_DOWN(size, queue_logical_block_size(q));
So if we do get a size that is not logical block size alignment here,
we reduce it to the block size aligned one below. Why do we do that?
> + if ((pos | iov_iter_count(iter)) & (bdev_logical_block_size(bdev) - 1))
> + return -EINVAL;
> + if (iov_iter_alignment(iter) & bdev_dma_alignment(bdev))
> return -EINVAL;
Can we have a little inline helper for these checks instead of
duplicating them three times?
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index 840752006f60..64cc176be60c 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -1131,7 +1131,7 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
> struct dio_submit sdio = { 0, };
> struct buffer_head map_bh = { 0, };
> struct blk_plug plug;
> - unsigned long align = offset | iov_iter_alignment(iter);
> + unsigned long align = iov_iter_alignment(iter);
I'd much prefer to not just relax this for random file systems,
and especially not the legacy direct I/O code. I think we can eventually
do iomap, but only after an audit and test of each file system, which
might require a new IOMAP_DIO_* flag at least initially.
> +static inline unsigned int bdev_dma_alignment(struct block_device *bdev)
> +{
> + return queue_dma_alignment(bdev_get_queue(bdev));
> +}
Plase do this in a separate patch.
^ permalink raw reply
* Re: [PATCH 3/3] blk-mq: remove the done argument to blk_execute_rq_nowait
From: Kanchan Joshi @ 2022-05-19 7:36 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Ming Lei, linux-block, linux-nvme, linux-scsi,
target-devel
In-Reply-To: <20220517064901.3059255-4-hch@lst.de>
On Tue, May 17, 2022 at 12:27 PM Christoph Hellwig <hch@lst.de> wrote:
>
> Let the caller set it together with the end_io_data instead of passing
> a pointless argument. Note the the target code did in fact already
> set it and then just overrode it again by calling blk_execute_rq_nowait.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
> block/blk-mq.c | 5 +----
> drivers/block/sx8.c | 4 ++--
> drivers/nvme/host/core.c | 3 ++-
> drivers/nvme/host/ioctl.c | 3 ++-
> drivers/nvme/host/pci.c | 10 +++++++---
> drivers/nvme/target/passthru.c | 3 ++-
> drivers/scsi/scsi_error.c | 5 +++--
> drivers/scsi/sg.c | 3 ++-
> drivers/scsi/st.c | 3 ++-
> drivers/scsi/ufs/ufshpb.c | 6 ++++--
> drivers/target/target_core_pscsi.c | 3 +--
> include/linux/blk-mq.h | 3 +--
> 12 files changed, 29 insertions(+), 22 deletions(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 0169b624edda1..c832011bc90dd 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1189,7 +1189,6 @@ static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq)
> * blk_execute_rq_nowait - insert a request to I/O scheduler for execution
> * @rq: request to insert
> * @at_head: insert request at head or tail of queue
> - * @done: I/O completion handler
> *
> * Description:
> * Insert a fully prepared request at the back of the I/O scheduler queue
> @@ -1198,13 +1197,11 @@ static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq)
> * Note:
> * This function will invoke @done directly if the queue is dead.
> */
> -void blk_execute_rq_nowait(struct request *rq, bool at_head, rq_end_io_fn *done)
> +void blk_execute_rq_nowait(struct request *rq, bool at_head)
> {
> WARN_ON(irqs_disabled());
> WARN_ON(!blk_rq_is_passthrough(rq));
>
> - rq->end_io = done;
> -
> blk_account_io_start(rq);
> if (current->plug)
> blk_add_rq_to_plug(current->plug, rq);
> diff --git a/drivers/block/sx8.c b/drivers/block/sx8.c
> index b361583944b94..63b4f6431d2e6 100644
> --- a/drivers/block/sx8.c
> +++ b/drivers/block/sx8.c
> @@ -540,7 +540,7 @@ static int carm_array_info (struct carm_host *host, unsigned int array_idx)
> spin_unlock_irq(&host->lock);
>
> DPRINTK("blk_execute_rq_nowait, tag == %u\n", rq->tag);
> - blk_execute_rq_nowait(rq, true, NULL);
> + blk_execute_rq_nowait(rq, true);
>
> return 0;
>
> @@ -579,7 +579,7 @@ static int carm_send_special (struct carm_host *host, carm_sspc_t func)
> crq->msg_bucket = (u32) rc;
>
> DPRINTK("blk_execute_rq_nowait, tag == %u\n", rq->tag);
> - blk_execute_rq_nowait(rq, true, NULL);
> + blk_execute_rq_nowait(rq, true);
>
> return 0;
> }
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 510e3860358bb..22aa5780623da 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1206,8 +1206,9 @@ static void nvme_keep_alive_work(struct work_struct *work)
> nvme_init_request(rq, &ctrl->ka_cmd);
>
> rq->timeout = ctrl->kato * HZ;
> + rq->end_io = nvme_keep_alive_end_io;
> rq->end_io_data = ctrl;
> - blk_execute_rq_nowait(rq, false, nvme_keep_alive_end_io);
> + blk_execute_rq_nowait(rq, false);
> }
>
> static void nvme_start_keep_alive(struct nvme_ctrl *ctrl)
> diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
> index 7b0e2c9cdcae3..a92cc686ffbc0 100644
> --- a/drivers/nvme/host/ioctl.c
> +++ b/drivers/nvme/host/ioctl.c
> @@ -453,6 +453,7 @@ static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
> blk_flags);
> if (IS_ERR(req))
> return PTR_ERR(req);
> + req->end_io = nvme_uring_cmd_end_io;
> req->end_io_data = ioucmd;
>
> /* to free bio on completion, as req->bio will be null at that time */
> @@ -461,7 +462,7 @@ static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
> pdu->meta_buffer = nvme_to_user_ptr(d.metadata);
> pdu->meta_len = d.metadata_len;
>
> - blk_execute_rq_nowait(req, 0, nvme_uring_cmd_end_io);
> + blk_execute_rq_nowait(req, false);
> return -EIOCBQUEUED;
> }
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 3aacf1c0d5a5f..068dbb00c5ea9 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1438,8 +1438,9 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
> }
> nvme_init_request(abort_req, &cmd);
>
> + abort_req->end_io = abort_endio;
> abort_req->end_io_data = NULL;
> - blk_execute_rq_nowait(abort_req, false, abort_endio);
> + blk_execute_rq_nowait(abort_req, false);
>
> /*
> * The aborted req will be completed on receiving the abort req.
> @@ -2483,11 +2484,14 @@ static int nvme_delete_queue(struct nvme_queue *nvmeq, u8 opcode)
> return PTR_ERR(req);
> nvme_init_request(req, &cmd);
>
> + if (opcode == nvme_admin_delete_cq)
> + req->end_io = nvme_del_cq_end;
> + else
> + req->end_io = nvme_del_queue_end;
> req->end_io_data = nvmeq;
>
> init_completion(&nvmeq->delete_done);
> - blk_execute_rq_nowait(req, false, opcode == nvme_admin_delete_cq ?
> - nvme_del_cq_end : nvme_del_queue_end);
> + blk_execute_rq_nowait(req, false);
> return 0;
> }
>
> diff --git a/drivers/nvme/target/passthru.c b/drivers/nvme/target/passthru.c
> index 5247c24538eba..3cc4d6709c93c 100644
> --- a/drivers/nvme/target/passthru.c
> +++ b/drivers/nvme/target/passthru.c
> @@ -285,8 +285,9 @@ static void nvmet_passthru_execute_cmd(struct nvmet_req *req)
> req->p.rq = rq;
> queue_work(nvmet_wq, &req->p.work);
> } else {
> + rq->end_io = nvmet_passthru_req_done;
> rq->end_io_data = req;
> - blk_execute_rq_nowait(rq, false, nvmet_passthru_req_done);
> + blk_execute_rq_nowait(rq, false);
> }
>
> if (ns)
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index cdaca13ac1f1c..49ef864df5816 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -2039,12 +2039,13 @@ static void scsi_eh_lock_door(struct scsi_device *sdev)
> scmd->cmnd[4] = SCSI_REMOVAL_PREVENT;
> scmd->cmnd[5] = 0;
> scmd->cmd_len = COMMAND_SIZE(scmd->cmnd[0]);
> + scmd->allowed = 5;
>
> req->rq_flags |= RQF_QUIET;
> req->timeout = 10 * HZ;
> - scmd->allowed = 5;
> + req->end_io = eh_lock_door_done;
>
> - blk_execute_rq_nowait(req, true, eh_lock_door_done);
> + blk_execute_rq_nowait(req, true);
> }
>
> /**
> diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c
> index cbffa712b9f3e..118c7b4a8af2c 100644
> --- a/drivers/scsi/sg.c
> +++ b/drivers/scsi/sg.c
> @@ -831,7 +831,8 @@ sg_common_write(Sg_fd * sfp, Sg_request * srp,
>
> srp->rq->timeout = timeout;
> kref_get(&sfp->f_ref); /* sg_rq_end_io() does kref_put(). */
> - blk_execute_rq_nowait(srp->rq, at_head, sg_rq_end_io);
> + srp->rq->end_io = sg_rq_end_io;
> + blk_execute_rq_nowait(srp->rq, at_head);
> return 0;
> }
>
> diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c
> index 56a093a90b922..850172a2b8f14 100644
> --- a/drivers/scsi/st.c
> +++ b/drivers/scsi/st.c
> @@ -579,9 +579,10 @@ static int st_scsi_execute(struct st_request *SRpnt, const unsigned char *cmd,
> memcpy(scmd->cmnd, cmd, scmd->cmd_len);
> req->timeout = timeout;
> scmd->allowed = retries;
> + req->end_io = st_scsi_execute_end;
> req->end_io_data = SRpnt;
>
> - blk_execute_rq_nowait(req, true, st_scsi_execute_end);
> + blk_execute_rq_nowait(req, true);
> return 0;
> }
>
> diff --git a/drivers/scsi/ufs/ufshpb.c b/drivers/scsi/ufs/ufshpb.c
> index 81099b68bbfbd..796a9773bf3de 100644
> --- a/drivers/scsi/ufs/ufshpb.c
> +++ b/drivers/scsi/ufs/ufshpb.c
> @@ -671,11 +671,12 @@ static void ufshpb_execute_umap_req(struct ufshpb_lu *hpb,
>
> req->timeout = 0;
> req->end_io_data = umap_req;
> + req->end_io = ufshpb_umap_req_compl_fn;
>
> ufshpb_set_unmap_cmd(scmd->cmnd, rgn);
> scmd->cmd_len = HPB_WRITE_BUFFER_CMD_LENGTH;
>
> - blk_execute_rq_nowait(req, true, ufshpb_umap_req_compl_fn);
> + blk_execute_rq_nowait(req, true);
>
> hpb->stats.umap_req_cnt++;
> }
> @@ -707,6 +708,7 @@ static int ufshpb_execute_map_req(struct ufshpb_lu *hpb,
> blk_rq_append_bio(req, map_req->bio);
>
> req->end_io_data = map_req;
> + req->end_io = ufshpb_map_req_compl_fn;
>
> if (unlikely(last))
> mem_size = hpb->last_srgn_entries * HPB_ENTRY_SIZE;
> @@ -716,7 +718,7 @@ static int ufshpb_execute_map_req(struct ufshpb_lu *hpb,
> map_req->rb.srgn_idx, mem_size);
> scmd->cmd_len = HPB_READ_BUFFER_CMD_LENGTH;
>
> - blk_execute_rq_nowait(req, true, ufshpb_map_req_compl_fn);
> + blk_execute_rq_nowait(req, true)
Missing semicolon here. Otherwise, looks good.
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
^ permalink raw reply
* Re: [dm-devel] [PATCH v4 00/13] support non power of 2 zoned devices
From: Johannes Thumshirn @ 2022-05-19 7:34 UTC (permalink / raw)
To: Damien Le Moal, Luis Chamberlain
Cc: Theodore Ts'o, Christoph Hellwig, Pankaj Raghav,
axboe@kernel.dk, pankydev8@gmail.com, gost.dev@samsung.com,
jiangbo.365@bytedance.com, linux-nvme@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
linux-fsdevel@vger.kernel.org, dm-devel@redhat.com,
dsterba@suse.com, linux-btrfs@vger.kernel.org
In-Reply-To: <69f06f90-d31b-620b-9009-188d1d641562@opensource.wdc.com>
On 19/05/2022 05:19, Damien Le Moal wrote:
> On 5/19/22 12:12, Luis Chamberlain wrote:
>> On Thu, May 19, 2022 at 12:08:26PM +0900, Damien Le Moal wrote:
>>> On 5/18/22 00:34, Theodore Ts'o wrote:
>>>> On Tue, May 17, 2022 at 10:10:48AM +0200, Christoph Hellwig wrote:
>>>>> I'm a little surprised about all this activity.
>>>>>
>>>>> I though the conclusion at LSF/MM was that for Linux itself there
>>>>> is very little benefit in supporting this scheme. It will massively
>>>>> fragment the supported based of devices and applications, while only
>>>>> having the benefit of supporting some Samsung legacy devices.
>>>>
>>>> FWIW,
>>>>
>>>> That wasn't my impression from that LSF/MM session, but once the
>>>> videos become available, folks can decide for themselves.
>>>
>>> There was no real discussion about zone size constraint on the zone
>>> storage BoF. Many discussions happened in the hallway track though.
>>
>> Right so no direct clear blockers mentioned at all during the BoF.
>
> Nor any clear OK.
So what about creating a device-mapper target, that's taking npo2 drives and
makes them po2 drives for the FS layers? It will be very similar code to
dm-linear.
After all zoned support for FSes started with a device-mapper (dm-zoned) and
as the need for a more integrated solution arose, it changed into natiive
support.
And all that is there is simple arithmetic and a bio_clone(), if this is the
slowest part of the stack involving a FS like f2fs or btrfs I'm throwing a
round of anyone's favorite beverage at next year's LSFMM.
Byte,
Johannes
^ permalink raw reply
* Re: [PATCHv2 2/3] block: export dma_alignment attribute
From: Christoph Hellwig @ 2022-05-19 7:33 UTC (permalink / raw)
To: Keith Busch
Cc: linux-fsdevel, linux-block, axboe, Kernel Team, hch, bvanassche,
damien.lemoal, Keith Busch
In-Reply-To: <20220518171131.3525293-3-kbusch@fb.com>
On Wed, May 18, 2022 at 10:11:30AM -0700, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
>
> User space may want to know how to align their buffers to avoid
> bouncing. Export the queue attribute.
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply
* Re: [PATCHv2 1/3] block/bio: remove duplicate append pages code
From: Christoph Hellwig @ 2022-05-19 7:32 UTC (permalink / raw)
To: Keith Busch
Cc: linux-fsdevel, linux-block, axboe, Kernel Team, hch, bvanassche,
damien.lemoal, Keith Busch
In-Reply-To: <20220518171131.3525293-2-kbusch@fb.com>
On Wed, May 18, 2022 at 10:11:29AM -0700, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
>
> The setup for getting pages are identical for zone append and normal IO.
> Use common code for each.
How about using even more common code and avoiding churn at the same
time? Something like:
diff --git a/block/bio.c b/block/bio.c
index a3893d80dccc9..15da722ed26d1 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1158,6 +1158,37 @@ static void bio_put_pages(struct page **pages, size_t size, size_t off)
put_page(pages[i]);
}
+static int bio_iov_add_page(struct bio *bio, struct page *page,
+ unsigned int len, unsigned int offset)
+{
+ bool same_page = false;
+
+ if (!__bio_try_merge_page(bio, page, len, offset, &same_page)) {
+ if (WARN_ON_ONCE(bio_full(bio, len)))
+ return -EINVAL;
+ __bio_add_page(bio, page, len, offset);
+ return 0;
+ }
+
+ if (same_page)
+ put_page(page);
+ return 0;
+}
+
+static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
+ unsigned int len, unsigned int offset)
+{
+ struct request_queue *q = bdev_get_queue(bio->bi_bdev);
+ bool same_page = false;
+
+ if (bio_add_hw_page(q, bio, page, len, offset,
+ queue_max_zone_append_sectors(q), &same_page) != len)
+ return -EINVAL;
+ if (same_page)
+ put_page(page);
+ return 0;
+}
+
#define PAGE_PTRS_PER_BVEC (sizeof(struct bio_vec) / sizeof(struct page *))
/**
@@ -1176,7 +1207,6 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt;
struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
struct page **pages = (struct page **)bv;
- bool same_page = false;
ssize_t size, left;
unsigned len, i;
size_t offset;
@@ -1195,18 +1225,18 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
for (left = size, i = 0; left > 0; left -= len, i++) {
struct page *page = pages[i];
+ int ret;
len = min_t(size_t, PAGE_SIZE - offset, left);
+ if (bio_op(bio) == REQ_OP_ZONE_APPEND)
+ ret = bio_iov_add_zone_append_page(bio, page, len,
+ offset);
+ else
+ ret = bio_iov_add_page(bio, page, len, offset);
- if (__bio_try_merge_page(bio, page, len, offset, &same_page)) {
- if (same_page)
- put_page(page);
- } else {
- if (WARN_ON_ONCE(bio_full(bio, len))) {
- bio_put_pages(pages + i, left, offset);
- return -EINVAL;
- }
- __bio_add_page(bio, page, len, offset);
+ if (ret) {
+ bio_put_pages(pages + i, left, offset);
+ return ret;
}
offset = 0;
}
@@ -1215,54 +1245,6 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
return 0;
}
-static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)
-{
- unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
- unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt;
- struct request_queue *q = bdev_get_queue(bio->bi_bdev);
- unsigned int max_append_sectors = queue_max_zone_append_sectors(q);
- struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
- struct page **pages = (struct page **)bv;
- ssize_t size, left;
- unsigned len, i;
- size_t offset;
- int ret = 0;
-
- if (WARN_ON_ONCE(!max_append_sectors))
- return 0;
-
- /*
- * Move page array up in the allocated memory for the bio vecs as far as
- * possible so that we can start filling biovecs from the beginning
- * without overwriting the temporary page array.
- */
- BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
- pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
-
- size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
- if (unlikely(size <= 0))
- return size ? size : -EFAULT;
-
- for (left = size, i = 0; left > 0; left -= len, i++) {
- struct page *page = pages[i];
- bool same_page = false;
-
- len = min_t(size_t, PAGE_SIZE - offset, left);
- if (bio_add_hw_page(q, bio, page, len, offset,
- max_append_sectors, &same_page) != len) {
- bio_put_pages(pages + i, left, offset);
- ret = -EINVAL;
- break;
- }
- if (same_page)
- put_page(page);
- offset = 0;
- }
-
- iov_iter_advance(iter, size - left);
- return ret;
-}
-
/**
* bio_iov_iter_get_pages - add user or kernel pages to a bio
* @bio: bio to add pages to
@@ -1297,10 +1279,7 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
}
do {
- if (bio_op(bio) == REQ_OP_ZONE_APPEND)
- ret = __bio_iov_append_get_pages(bio, iter);
- else
- ret = __bio_iov_iter_get_pages(bio, iter);
+ ret = __bio_iov_iter_get_pages(bio, iter);
} while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));
/* don't account direct I/O as memory stall */
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox