Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* Re: [PATCHv2 3/3] block: relax direct io memory alignment
From: Keith Busch @ 2022-05-19 16:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Eric Biggers, Keith Busch, linux-fsdevel, linux-block, axboe,
	Kernel Team, bvanassche, damien.lemoal
In-Reply-To: <20220519074114.GG22301@lst.de>

On Thu, May 19, 2022 at 09:41:14AM +0200, Christoph Hellwig wrote:
> On Wed, May 18, 2022 at 10:56:04PM -0600, Keith Busch wrote:
> > I'm surely missing something here. I know the bvecs are not necessarily lbs
> > aligned, but why does that matter? Is there some driver that can only take
> > exactly 1 bvec, but allows it to be unaligned? If so, we could take the segment
> > queue limit into account, but I am not sure that we need to.
> 
> At least stacking drivers had all kinds of interesting limitations in
> this area.  How much testing did this series get with all kinds of
> interesting dm targets and md pesonalities?

The dma limit doesn't stack (should it?). It gets the default, sector size - 1,
so their constraints are effectively unchanged.

^ permalink raw reply

* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
From: Zorro Lang @ 2022-05-19 16:18 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Amir Goldstein, Dave Chinner, Luis Chamberlain, linux-fsdevel,
	linux-block, pankydev8, Theodore Tso, jmeneghi, Jan Kara,
	Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen, fstests
In-Reply-To: <YoZj4nHX42AOn8+F@localhost.localdomain>

On Thu, May 19, 2022 at 11:36:02AM -0400, Josef Bacik wrote:
> On Thu, May 19, 2022 at 12:20:28PM +0300, Amir Goldstein wrote:
> > On Thu, May 19, 2022 at 10:58 AM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Thu, May 19, 2022 at 09:36:41AM +0300, Amir Goldstein wrote:
> > > > [adding fstests and Zorro]
> > > >
> > > > On Thu, May 19, 2022 at 6:07 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
> > > > >
> > > > > I've been promoting the idea that running fstests once is nice,
> > > > > but things get interesting if you try to run fstests multiple
> > > > > times until a failure is found. It turns out at least kdevops has
> > > > > found tests which fail with a failure rate of typically 1/2 to
> > > > > 1/30 average failure rate. That is 1/2 means a failure can happen
> > > > > 50% of the time, whereas 1/30 means it takes 30 runs to find the
> > > > > failure.
> > > > >
> > > > > I have tried my best to annotate failure rates when I know what
> > > > > they might be on the test expunge list, as an example:
> > > > >
> > > > > workflows/fstests/expunges/5.17.0-rc7/xfs/unassigned/xfs_reflink.txt:generic/530 # failure rate about 1/15 https://gist.github.com/mcgrof/4129074db592c170e6bf748aa11d783d
> > > > >
> > > > > The term "failure rate 1/15" is 16 characters long, so I'd like
> > > > > to propose to standardize a way to represent this. How about
> > > > >
> > > > > generic/530 # F:1/15
> > > > >
> > > >
> > > > I am not fond of the 1/15 annotation at all, because the only fact that you
> > > > are able to document is that the test failed after 15 runs.
> > > > Suggesting that this means failure rate of 1/15 is a very big step.
> > > >
> > > > > Then we could extend the definition. F being current estimate, and this
> > > > > can be just how long it took to find the first failure. A more valuable
> > > > > figure would be failure rate avarage, so running the test multiple
> > > > > times, say 10, to see what the failure rate is and then averaging the
> > > > > failure out. So this could be a more accurate representation. For this
> > > > > how about:
> > > > >
> > > > > generic/530 # FA:1/15
> > > > >
> > > > > This would mean on average there failure rate has been found to be about
> > > > > 1/15, and this was determined based on 10 runs.
> > >
> > > These tests are run on multiple different filesystems. What happens
> > > if you run xfs, ext4, btrfs, overlay in sequence? We now have 4
> > > tests results, and 1 failure.
> > >
> > > Does that make it FA: 1/4, or does it make it 1/1,0/1,0/1,0/1?
> > >
> > > What happens if we run, say, XFS w/ defaults, rmapbt=1, v4, quotas?
> > >
> > > Does that make it FA: 1/4, or does it make it 0/1,1/1,0/1,0/1?
> > >
> > > In each case above, 1/4 tells us nothing useful. OTOH, the 0/1 vs
> > > 1/1 breakdown is useful information, because it tells us whihc
> > > filesystem failed the test, or which specific config failed the
> > > test.
> > >
> > > Hence I think the ability for us to draw useful conclusions from a
> > > number like this is large dependent on the specific data set it is
> > > drawn from...
> > >
> > > > > We should also go extend check for fstests/blktests to run a test
> > > > > until a failure is found and report back the number of successes.
> > > > >
> > > > > Thoughts?
> > >
> > > Who is the expected consumer of this information?
> > >
> > > I'm not sure it will be meaningful for anyone developing new code
> > > and needing to run every test every time they run fstests.
> > >
> > > OTOH, for a QA environment where you have a fixed progression of the
> > > kernel releases you are testing, it's likely valuable and already
> > > being tracked in various distro QE management tools and
> > > dashboards....
> > >
> > > > I have had a discussion about those tests with Zorro.
> > > >
> > > > Those tests that some people refer to as "flaky" are valuable,
> > > > but they are not deterministic, they are stochastic.
> > >
> > > Extremely valuable. Worth their weight in gold to developers like
> > > me.
> > >
> > > The recoveryloop group tests are a good example of this. The name of
> > > the group indicates how we use it. I typically set it up to run with
> > > an loop iteration like "-I 100" knowing that is will likely fail a
> > > random test in the group within 10 iterations.
> > >
> > > Those one-off failures are almost always a real bug, and they are
> > > often unique and difficult to reproduce exactly. Post-mortem needs
> > > to be performed immediately because it may well be a unique on-off
> > > failure and running another test after the failure destroys the
> > > state needed to perform a post-mortem.
> > >
> > > Hence having a test farm running these multiple times and then
> > > reporting "failed once in 15 runs" isn't really useful to me as a
> > > developer - it doesn't tell us anything new, nor does it help us
> > > find the bugs that are being tripped over.
> > >
> > > Less obvious stochastic tests exist, too. There are many tests that
> > > use fstress as a workload that runs while some other operation is
> > > performed - freeze, grow, ENOSPC, error injections, etc. They will
> > > never be deterministic, any again any failure tends to be a real
> > > bug, too.
> > >
> > > However, I think these should be run by QE environments all the time
> > > as they require long term, frequent execution across different
> > > configs in different environments to find the deep dark corners
> > > where the bugs may lie dormant. These are the tests that find things
> > > like subtle timing races no other tests ever exercise.
> > >
> > > I suspect that tests that alter their behaviour via LOAD_FACTOR or
> > > TIME_FACTOR will fall into this category.
> > >
> > > > I think MTBF is the standard way to describe reliability
> > > > of such tests, but I am having a hard time imagining how
> > > > the community can manage to document accurate annotations
> > > > of this sort, so I would stick with documenting the facts
> > > > (i.e. the test fails after N runs).
> > >
> > > I'm unsure of what "reliablity of such tests" means in this context.
> > > The tests are trying to exercise and measure the reliability of the
> > > kernel code - if the *test is unreliable* then that says to me the
> > > test needs fixing. If the test is reliable, then any failures that
> > > occur indicate that the filesystem/kernel/fs tools are unreliable,
> > > not the test....
> > >
> > > "test reliability" and "reliability of filesystem under test" are
> > > different things with similar names. The latter is what I think we
> > > are talking about measuring and reporting here, right?
> > >
> > > > OTOH, we do have deterministic tests, maybe even the majority of
> > > > fstests are deterministic(?)
> > >
> > > Very likely. As a generalisation, I'd say that anything that has a
> > > fixed, single step at a time recipe and a very well defined golden
> > > output or exact output comparison match is likely deterministic.
> > >
> > > We use things like 'within tolerance' so that slight variations in
> > > test results don't cause spurious failures and hence make the test
> > > more deterministic.  Hence any test that uses 'within_tolerance' is
> > > probably a test that is expecting deterministic behaviour....
> > >
> > > > Considering that every auto test loop takes ~2 hours on our rig and that
> > > > I have been running over 100 loops over the past two weeks, if half
> > > > of fstests are deterministic, that is a lot of wait time and a lot of carbon
> > > > emission gone to waste.
> > > >
> > > > It would have been nice if I was able to exclude a "deterministic" group.
> > > > The problem is - can a developer ever tag a test as being "deterministic"?
> > >
> > > fstests allows private exclude lists to be used - perhaps these
> > > could be used to start building such a group for your test
> > > environment. Building a list from the tests you never see fail in
> > > your environment could be a good way to seed such a group...
> > >
> > > Maybe you have all the raw results from those hundreds of tests
> > > sitting around - what does crunching that data look like? Who else
> > > has large sets of consistent historic data sitting around? I don't
> > > because I pollute my results archive by frequently running varied
> > > and badly broken kernels through fstests, but people who just run
> > > released or stable kernels may have data sets that could be used....
> > >
> > 
> > I have no historic data of that sort and I have never stayed on the
> > same test system long enough to collect this sort of data.
> > 
> > Josef has told us in LPC 2021 about his btrfs fstests dashboard
> > where he started to collect historical data a while ago.
> > 
> 
> I'm clearly biased, but I think this is the best way to go for *developers*.  We
> want to know all the things, so we just need to have a clear way to see what's
> failing and have a historical view of what has failed.  If you look at our

I agree the "historical view" is needed, but it can't be provided by mainline
fstests, due to it might be used to test many different filesystems with different
sysytem software and hardware environment, and there're lots of downstream project,
they have their own variation, the "upstream mainline linux historical view"
isn't referential for all of them. Some of "downstream historical view" isn't
referential either.

The "historical view" is worthy for each project(or project group) itself, but
might be not universal for others. If someone would like to help to test someone
project, likes someone Ubuntu LTS version, or Debian, or CentOS, or someone LTS
kernel... that might be better to ask if related people have their "historical
view" data to help to get start, better than asking if fstests has that for all
different known/unknown projects.

I just replied Ted, I think his idea makes more sense. fstests can provide
some meaningful interfaces to help the testers use their historical data, or
help to summarize their historical data for each specific user/project. But
fstests doesn't store/provide those one-sided data directly.

Thanks,
Zorro

> dashboard at toxicpanda.com you can click on the tests and see their runs and
> failures on different configs.  This has been insanely valuable to me, and
> helped me narrow down test cases that needed to be adjusted for compression.
> 
> > Collaborating on expunge lists of different fs and different
> > kernel/config/distro
> > is one of the goals behind Luis's kdevops project.
> > 
> 
> I think this is also hugely valuable from the "Willy usecase" perspective.
> Willy doesn't care about failure rates or interpreting the tea leaves of what
> our format is, he wants to make sure he didn't break anything.  We should strive
> to have 0 failures for this use case, so having expunge lists in place to get
> rid of any flakey results are going to make it easier for non-experts to get a
> solid grasp on wether they introduced a regression or not.
> 
> There's room for both use cases.  I want the expunge lists for newbies, I want
> good reporting for the developers who know what they're doing.  We can provide
> documentation for both
> 
> - If Willy, run 'make fstests-clean'
> - If Josef, run 'mkame fstests'
> 
> Thanks,
> 
> Josef
> 


^ permalink raw reply

* Re: [PATCH -next v3 2/2] blk-throttle: fix io hung due to configuration updates
From: Michal Koutný @ 2022-05-19 16:10 UTC (permalink / raw)
  To: yukuai (C)
  Cc: tj, axboe, ming.lei, geert, cgroups, linux-block, linux-kernel,
	yi.zhang
In-Reply-To: <a8953189-af42-0225-3031-daf61347524a@huawei.com>

On Thu, May 19, 2022 at 08:14:28PM +0800, "yukuai (C)" <yukuai3@huawei.com> wrote:
> tg_with_in_bps_limit:
>  jiffy_elapsed_rnd = jiffies - tg->slice_start[rw];
>  tmp = bps_limit * jiffy_elapsed_rnd;
>  do_div(tmp, HZ);
>  bytes_allowed = tmp; -> how many bytes are allowed in this slice,
> 		         incluing dispatched.
>  if (tg->bytes_disp[rw] + bio_size <= bytes_allowed)
>   *wait = 0 -> no need to wait if this bio is within limit
> 
>  extra_bytes = tg->bytes_disp[rw] + bio_size - bytes_allowed;
>  -> extra_bytes is based on 'bytes_disp'
> 
> For example:
> 
> 1) bps_limit is 2k, we issue two io, (1k and 9k)
> 2) the first io(1k) will be dispatched, bytes_disp = 1k, slice_start = 0
>    the second io(9k) is waiting for (9 - (2 - 1)) / 2 = 4 s

The 2nd io arrived at 1s, the wait time is 4s, i.e. it can be dispatched
at 5s (i.e. 10k/*2kB/s = 5s).

> 3) after 3 s, we update bps_limit to 1k, then new waiting is caculated:
> 
> without this patch:  bytes_disp = 0, slict_start =3:
> bytes_allowed = 1k	                            <--- why 1k and not 0?
> extra_bytes = 9k - 1k = 8k
> wait = 8s

This looks like it was calculated at time 4s (1s after new config was
set).

> 
> whth this patch: bytes_disp = 0.5k, slice_start =  0,
> bytes_allowed = 1k * 3 + 1k = 4k
> extra_bytes =  0.5k + 9k - 4k = 5.5k
> wait = 5.5s

This looks like calculated at 4s, so the IO would be waiting till
4s+5.5s = 9.5s.

As I don't know why using time 4s, I'll shift this calculation to the
time 3s (when the config changes):

bytes_disp = 0.5k, slice_start =  0,
bytes_allowed = 1k * 3  = 3k
extra_bytes =  0.5k + 9k - 3k = 7.5k
wait = 7.5s

In absolute time, the IO would wait till 3s+7.5s = 10.5s

OK, either your 9.5s or my 10.5s looks weird (although earlier than
original 4s+8s=12s).
However, the IO should ideally only wait till

    3s + (9k -   (6k    -    1k)     ) / 1k/s =
         bio - (allowed - dispatched)  / new_limit

   =3s + 4k / 1k/s = 7s

   ('allowed' is based on old limit)

Or in another example, what if you change the config from 2k/s to ∞k/s
(unlimited, let's neglect the arithmetic overflow that you handle
explicitly, imagine a big number but not so big to be greater than
division result).

In such a case, the wait time should be zero, i.e. IO should be
dispatched right at the time of config change.
(With your patch that still calculates >0 wait time (and the original
behavior gives >0 wait too.)

> I hope I can expliain it clearly...

Yes, thanks for pointing me to relevant parts.
I hope I grasped them correctly.

IOW, your patch and formula make the wait time shorter but still IO can
be delayed indefinitely if you pass a sequence of new configs. (AFAIU)

Regards,
Michal

^ permalink raw reply

* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
From: Matthew Wilcox @ 2022-05-19 16:06 UTC (permalink / raw)
  To: Zorro Lang
  Cc: Amir Goldstein, Luis Chamberlain, linux-fsdevel, linux-block,
	pankydev8, Theodore Tso, Josef Bacik, jmeneghi, Jan Kara,
	Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen, fstests
In-Reply-To: <20220519154419.ziy4esm4tgikejvj@zlang-mailbox>

On Thu, May 19, 2022 at 11:44:19PM +0800, Zorro Lang wrote:
> On Thu, May 19, 2022 at 03:58:31PM +0100, Matthew Wilcox wrote:
> > On Thu, May 19, 2022 at 07:24:50PM +0800, Zorro Lang wrote:
> > > Yes, we talked about this, but if I don't rememeber wrong, I recommended each
> > > downstream testers maintain their own "testing data/config", likes exclude
> > > list, failed ratio, known failures etc. I think they're not suitable to be
> > > fixed in the mainline fstests.
> > 
> > This assumes a certain level of expertise, which is a barrier to entry.
> > 
> > For someone who wants to check "Did my patch to filesystem Y that I have
> > never touched before break anything?", having non-deterministic tests
> > run by default is bad.
> > 
> > As an example, run xfstests against jfs.  Hundreds of failures, including
> > some very scary-looking assertion failures from the page allocator.
> > They're (mostly) harmless in fact, just being a memory leak, but it
> > makes xfstests useless for this scenario.
> > 
> > Even for well-maintained filesystems like xfs which is regularly tested,
> > I expect generic/270 and a few others to fail.  They just do, and they're
> > not an indication that *I* broke anything.
> > 
> > By all means, we want to keep tests around which have failures, but
> > they need to be restricted to people who have a level of expertise and
> > interest in fixing long-standing problems, not people who are looking
> > for regressions.
> 
> It's hard to make sure if a failure is a regression, if someone only run
> the test once. The testers need some experience, at least need some
> history test data.
> 
> If a tester find a case has 10% chance fail on his system, to make sure
> it's a regression or not, if he doesn't have history test data, at least
> he need to do the same test more times on old kernel version with his
> system. If it never fail on old kernel version, but can fail on new kernel.
> Then we suspect it's a regression.
> 
> Even if the tester isn't an expert of the fs he's testing, he can report
> this issue to that fs experts, to get more checking. For downstream kernel,
> he has to report to the maintainers of downstream, or check by himself.
> If a case pass on upstream, but fail on downstream, it might mean there's
> a patchset on upstream can be backported.
> 
> So, anyway, the testers need their own "experience" (include testing history
> data, known issue, etc) to judge if a failure is a suspected regression, or
> a known issue of downstream which hasn't been fixed (by backport).
> 
> That's my personal perspective :)

Right, but that's the personal perspective of an expert tester.  I don't
particularly want to build that expertise myself; I want to write patches
which touch dozens of filesystems, and I want to be able to smoke-test
those patches.  Maybe xfstests or kdevops doesn't want to solve that
problem, but that would seem like a waste of other peoples time.

^ permalink raw reply

* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
From: Zorro Lang @ 2022-05-19 15:44 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Amir Goldstein, Luis Chamberlain, linux-fsdevel, linux-block,
	pankydev8, Theodore Tso, Josef Bacik, jmeneghi, Jan Kara,
	Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen, fstests
In-Reply-To: <YoZbF90qS+LlSDfS@casper.infradead.org>

On Thu, May 19, 2022 at 03:58:31PM +0100, Matthew Wilcox wrote:
> On Thu, May 19, 2022 at 07:24:50PM +0800, Zorro Lang wrote:
> > Yes, we talked about this, but if I don't rememeber wrong, I recommended each
> > downstream testers maintain their own "testing data/config", likes exclude
> > list, failed ratio, known failures etc. I think they're not suitable to be
> > fixed in the mainline fstests.
> 
> This assumes a certain level of expertise, which is a barrier to entry.
> 
> For someone who wants to check "Did my patch to filesystem Y that I have
> never touched before break anything?", having non-deterministic tests
> run by default is bad.
> 
> As an example, run xfstests against jfs.  Hundreds of failures, including
> some very scary-looking assertion failures from the page allocator.
> They're (mostly) harmless in fact, just being a memory leak, but it
> makes xfstests useless for this scenario.
> 
> Even for well-maintained filesystems like xfs which is regularly tested,
> I expect generic/270 and a few others to fail.  They just do, and they're
> not an indication that *I* broke anything.
> 
> By all means, we want to keep tests around which have failures, but
> they need to be restricted to people who have a level of expertise and
> interest in fixing long-standing problems, not people who are looking
> for regressions.

It's hard to make sure if a failure is a regression, if someone only run
the test once. The testers need some experience, at least need some
history test data.

If a tester find a case has 10% chance fail on his system, to make sure
it's a regression or not, if he doesn't have history test data, at least
he need to do the same test more times on old kernel version with his
system. If it never fail on old kernel version, but can fail on new kernel.
Then we suspect it's a regression.

Even if the tester isn't an expert of the fs he's testing, he can report
this issue to that fs experts, to get more checking. For downstream kernel,
he has to report to the maintainers of downstream, or check by himself.
If a case pass on upstream, but fail on downstream, it might mean there's
a patchset on upstream can be backported.

So, anyway, the testers need their own "experience" (include testing history
data, known issue, etc) to judge if a failure is a suspected regression, or
a known issue of downstream which hasn't been fixed (by backport).

That's my personal perspective :)

Thanks,
Zorro

> 

^ permalink raw reply

* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
From: Josef Bacik @ 2022-05-19 15:36 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Dave Chinner, Luis Chamberlain, linux-fsdevel, linux-block,
	pankydev8, Theodore Tso, jmeneghi, Jan Kara, Davidlohr Bueso,
	Dan Williams, Jake Edge, Klaus Jensen, Zorro Lang, fstests
In-Reply-To: <CAOQ4uxi-A2iErkbBBaewmoKa8OGWXaUzaZqwygQxKzzEZcsCXQ@mail.gmail.com>

On Thu, May 19, 2022 at 12:20:28PM +0300, Amir Goldstein wrote:
> On Thu, May 19, 2022 at 10:58 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Thu, May 19, 2022 at 09:36:41AM +0300, Amir Goldstein wrote:
> > > [adding fstests and Zorro]
> > >
> > > On Thu, May 19, 2022 at 6:07 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
> > > >
> > > > I've been promoting the idea that running fstests once is nice,
> > > > but things get interesting if you try to run fstests multiple
> > > > times until a failure is found. It turns out at least kdevops has
> > > > found tests which fail with a failure rate of typically 1/2 to
> > > > 1/30 average failure rate. That is 1/2 means a failure can happen
> > > > 50% of the time, whereas 1/30 means it takes 30 runs to find the
> > > > failure.
> > > >
> > > > I have tried my best to annotate failure rates when I know what
> > > > they might be on the test expunge list, as an example:
> > > >
> > > > workflows/fstests/expunges/5.17.0-rc7/xfs/unassigned/xfs_reflink.txt:generic/530 # failure rate about 1/15 https://gist.github.com/mcgrof/4129074db592c170e6bf748aa11d783d
> > > >
> > > > The term "failure rate 1/15" is 16 characters long, so I'd like
> > > > to propose to standardize a way to represent this. How about
> > > >
> > > > generic/530 # F:1/15
> > > >
> > >
> > > I am not fond of the 1/15 annotation at all, because the only fact that you
> > > are able to document is that the test failed after 15 runs.
> > > Suggesting that this means failure rate of 1/15 is a very big step.
> > >
> > > > Then we could extend the definition. F being current estimate, and this
> > > > can be just how long it took to find the first failure. A more valuable
> > > > figure would be failure rate avarage, so running the test multiple
> > > > times, say 10, to see what the failure rate is and then averaging the
> > > > failure out. So this could be a more accurate representation. For this
> > > > how about:
> > > >
> > > > generic/530 # FA:1/15
> > > >
> > > > This would mean on average there failure rate has been found to be about
> > > > 1/15, and this was determined based on 10 runs.
> >
> > These tests are run on multiple different filesystems. What happens
> > if you run xfs, ext4, btrfs, overlay in sequence? We now have 4
> > tests results, and 1 failure.
> >
> > Does that make it FA: 1/4, or does it make it 1/1,0/1,0/1,0/1?
> >
> > What happens if we run, say, XFS w/ defaults, rmapbt=1, v4, quotas?
> >
> > Does that make it FA: 1/4, or does it make it 0/1,1/1,0/1,0/1?
> >
> > In each case above, 1/4 tells us nothing useful. OTOH, the 0/1 vs
> > 1/1 breakdown is useful information, because it tells us whihc
> > filesystem failed the test, or which specific config failed the
> > test.
> >
> > Hence I think the ability for us to draw useful conclusions from a
> > number like this is large dependent on the specific data set it is
> > drawn from...
> >
> > > > We should also go extend check for fstests/blktests to run a test
> > > > until a failure is found and report back the number of successes.
> > > >
> > > > Thoughts?
> >
> > Who is the expected consumer of this information?
> >
> > I'm not sure it will be meaningful for anyone developing new code
> > and needing to run every test every time they run fstests.
> >
> > OTOH, for a QA environment where you have a fixed progression of the
> > kernel releases you are testing, it's likely valuable and already
> > being tracked in various distro QE management tools and
> > dashboards....
> >
> > > I have had a discussion about those tests with Zorro.
> > >
> > > Those tests that some people refer to as "flaky" are valuable,
> > > but they are not deterministic, they are stochastic.
> >
> > Extremely valuable. Worth their weight in gold to developers like
> > me.
> >
> > The recoveryloop group tests are a good example of this. The name of
> > the group indicates how we use it. I typically set it up to run with
> > an loop iteration like "-I 100" knowing that is will likely fail a
> > random test in the group within 10 iterations.
> >
> > Those one-off failures are almost always a real bug, and they are
> > often unique and difficult to reproduce exactly. Post-mortem needs
> > to be performed immediately because it may well be a unique on-off
> > failure and running another test after the failure destroys the
> > state needed to perform a post-mortem.
> >
> > Hence having a test farm running these multiple times and then
> > reporting "failed once in 15 runs" isn't really useful to me as a
> > developer - it doesn't tell us anything new, nor does it help us
> > find the bugs that are being tripped over.
> >
> > Less obvious stochastic tests exist, too. There are many tests that
> > use fstress as a workload that runs while some other operation is
> > performed - freeze, grow, ENOSPC, error injections, etc. They will
> > never be deterministic, any again any failure tends to be a real
> > bug, too.
> >
> > However, I think these should be run by QE environments all the time
> > as they require long term, frequent execution across different
> > configs in different environments to find the deep dark corners
> > where the bugs may lie dormant. These are the tests that find things
> > like subtle timing races no other tests ever exercise.
> >
> > I suspect that tests that alter their behaviour via LOAD_FACTOR or
> > TIME_FACTOR will fall into this category.
> >
> > > I think MTBF is the standard way to describe reliability
> > > of such tests, but I am having a hard time imagining how
> > > the community can manage to document accurate annotations
> > > of this sort, so I would stick with documenting the facts
> > > (i.e. the test fails after N runs).
> >
> > I'm unsure of what "reliablity of such tests" means in this context.
> > The tests are trying to exercise and measure the reliability of the
> > kernel code - if the *test is unreliable* then that says to me the
> > test needs fixing. If the test is reliable, then any failures that
> > occur indicate that the filesystem/kernel/fs tools are unreliable,
> > not the test....
> >
> > "test reliability" and "reliability of filesystem under test" are
> > different things with similar names. The latter is what I think we
> > are talking about measuring and reporting here, right?
> >
> > > OTOH, we do have deterministic tests, maybe even the majority of
> > > fstests are deterministic(?)
> >
> > Very likely. As a generalisation, I'd say that anything that has a
> > fixed, single step at a time recipe and a very well defined golden
> > output or exact output comparison match is likely deterministic.
> >
> > We use things like 'within tolerance' so that slight variations in
> > test results don't cause spurious failures and hence make the test
> > more deterministic.  Hence any test that uses 'within_tolerance' is
> > probably a test that is expecting deterministic behaviour....
> >
> > > Considering that every auto test loop takes ~2 hours on our rig and that
> > > I have been running over 100 loops over the past two weeks, if half
> > > of fstests are deterministic, that is a lot of wait time and a lot of carbon
> > > emission gone to waste.
> > >
> > > It would have been nice if I was able to exclude a "deterministic" group.
> > > The problem is - can a developer ever tag a test as being "deterministic"?
> >
> > fstests allows private exclude lists to be used - perhaps these
> > could be used to start building such a group for your test
> > environment. Building a list from the tests you never see fail in
> > your environment could be a good way to seed such a group...
> >
> > Maybe you have all the raw results from those hundreds of tests
> > sitting around - what does crunching that data look like? Who else
> > has large sets of consistent historic data sitting around? I don't
> > because I pollute my results archive by frequently running varied
> > and badly broken kernels through fstests, but people who just run
> > released or stable kernels may have data sets that could be used....
> >
> 
> I have no historic data of that sort and I have never stayed on the
> same test system long enough to collect this sort of data.
> 
> Josef has told us in LPC 2021 about his btrfs fstests dashboard
> where he started to collect historical data a while ago.
> 

I'm clearly biased, but I think this is the best way to go for *developers*.  We
want to know all the things, so we just need to have a clear way to see what's
failing and have a historical view of what has failed.  If you look at our
dashboard at toxicpanda.com you can click on the tests and see their runs and
failures on different configs.  This has been insanely valuable to me, and
helped me narrow down test cases that needed to be adjusted for compression.

> Collaborating on expunge lists of different fs and different
> kernel/config/distro
> is one of the goals behind Luis's kdevops project.
> 

I think this is also hugely valuable from the "Willy usecase" perspective.
Willy doesn't care about failure rates or interpreting the tea leaves of what
our format is, he wants to make sure he didn't break anything.  We should strive
to have 0 failures for this use case, so having expunge lists in place to get
rid of any flakey results are going to make it easier for non-experts to get a
solid grasp on wether they introduced a regression or not.

There's room for both use cases.  I want the expunge lists for newbies, I want
good reporting for the developers who know what they're doing.  We can provide
documentation for both

- If Willy, run 'make fstests-clean'
- If Josef, run 'mkame fstests'

Thanks,

Josef

^ permalink raw reply

* Re: [RFC 1/6] arm64: hibernate: Introduce new entry point to kernel
From: Marc Zyngier @ 2022-05-19 15:27 UTC (permalink / raw)
  To: Vivek Kumar
  Cc: corbet, catalin.marinas, will, tglx, axboe, rafael, akpm,
	linux-doc, linux-kernel, linux-arm-kernel, linux-block, linux-pm,
	linux-mm, len.brown, pavel, paulmck, bp, keescook, songmuchun,
	rdunlap, damien.lemoal, pasha.tatashin, tabba, ardb, tsoni,
	quic_psodagud, quic_svaddagi, Prasanna Kumar
In-Reply-To: <1652860121-24092-2-git-send-email-quic_vivekuma@quicinc.com>

On 2022-05-18 08:48, Vivek Kumar wrote:
> Introduce a new entry point to hibernated kernel image.
> This is generally needed when bootloader restores the
> hibernated image from disc to ddr and passes control
> to it by turning off the mmu, also initialize this new
> entry point with cpu_resume which turns on the mmu and
> then proceeds with restore routines.
> 
> Signed-off-by: Vivek Kumar <quic_vivekuma@quicinc.com>
> Signed-off-by: Prasanna Kumar <quic_kprasan@quicinc.com>
> ---
>  arch/arm64/kernel/hibernate.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/arch/arm64/kernel/hibernate.c 
> b/arch/arm64/kernel/hibernate.c
> index 6328308..4e294b3 100644
> --- a/arch/arm64/kernel/hibernate.c
> +++ b/arch/arm64/kernel/hibernate.c
> @@ -74,6 +74,14 @@ static struct arch_hibernate_hdr {
>  	void		(*reenter_kernel)(void);
> 
>  	/*
> +	 * Another entry point if jump to kernel happens with mmu disabled,
> +	 * generally done when restoring hibernation image from bootloader
> +	 * context
> +	 */
> +
> +	phys_addr_t	phys_reenter_kernel;
> +
> +	/*
>  	 * We need to know where the __hyp_stub_vectors are after restore to
>  	 * re-configure el2.
>  	 */
> @@ -116,6 +124,7 @@ int arch_hibernation_header_save(void *addr,
> unsigned int max_size)
>  	arch_hdr_invariants(&hdr->invariants);
>  	hdr->ttbr1_el1		= __pa_symbol(swapper_pg_dir);
>  	hdr->reenter_kernel	= _cpu_resume;
> +	hdr->phys_reenter_kernel  = __pa(cpu_resume);
> 
>  	/* We can't use __hyp_get_vectors() because kvm may still be loaded 
> */
>  	if (el2_reset_needed())

So here, you are creating a new ABI with the bootloader, based on
a data structure that isn't mean't to be ABI. It means that we
wouldn't be allowed to ever change this data structure, as this
would mean having to update the bootloader in sync.

Clearly, this isn't acceptable.

         M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply

* Re: [PATCH v4 00/13] support non power of 2 zoned devices
From: Javier González @ 2022-05-19 15:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Javier González, Pankaj Raghav, axboe, damien.lemoal,
	pankydev8, dsterba, linux-nvme, linux-fsdevel, linux-btrfs,
	jiangbo.365, linux-block, gost.dev, linux-kernel, dm-devel
In-Reply-To: <20220518080020.GA3697@lst.de>

> On 18 May 2022, at 10.16, Christoph Hellwig <hch@lst.de> wrote:
> 
> On Tue, May 17, 2022 at 11:18:34AM +0200, Javier González wrote:
>> Does the above help you reconsidering your interest in supporting this
>> in NVMe?
> 
> Very little.  It just seems like a really bad idea.

I understand you don’t like this, but I still hope you see value in supporting it. We are getting close to a very minimal patchset, which is also helping to fix bugs in the zoned stack.

If you take a look at the last version abs give some feedback, I’m sure we can end up with a good solution. 

Can you help?

^ permalink raw reply

* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
From: Zorro Lang @ 2022-05-19 15:10 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Amir Goldstein, Luis Chamberlain, linux-fsdevel, linux-block,
	pankydev8, Josef Bacik, jmeneghi, Jan Kara, Davidlohr Bueso,
	Dan Williams, Jake Edge, Klaus Jensen, fstests
In-Reply-To: <YoZRyGOwde+xkK1y@mit.edu>

On Thu, May 19, 2022 at 10:18:48AM -0400, Theodore Ts'o wrote:
> On Thu, May 19, 2022 at 07:24:50PM +0800, Zorro Lang wrote:
> > 
> > Yes, we talked about this, but if I don't rememeber wrong, I recommended each
> > downstream testers maintain their own "testing data/config", likes exclude
> > list, failed ratio, known failures etc. I think they're not suitable to be
> > fixed in the mainline fstests.
> 
> Failure ratios are the sort of thing that are only applicable for
> 
> * A specific filesystem
> * A specific configuration
> * A specific storage device / storage device class
> * A specific CPU architecture / CPU speed
> * A specific amount of memory available

And a specific bug I suppose :)

> 
> Put another way, there are problems that fail so close to rarely as to
> be "hever" on, say, an x86_64 class server with gobs and gobs of
> memory, but which can more reliably fail on, say, a Rasberry PI using
> eMMC flash.
> 
> I don't think that Luis was suggesting that this kind of failure
> annotation would go in upstream fstests.  I suspect he just wants to
> use it in kdevops, and hope that other people would use it as well in
> other contexts.  But even in the context of test runners like kdevops
> and {kvm,gce,android}-xfstests, it's going to be very specific to a
> particular test environment, and for the global list of excludes for a
> particular file system.  So in the gce-xfstests context, this is the
> difference between the excludes in the files:
> 
> 	fs/ext4/excludes
> vs
> 	fs/ext4/cfg/bigalloc.exclude
> 
> even if I only cared about, say, how things ran on GCE using
> SSD-backed Persistent Disk (never mind that I can only run
> gce-xfstests on Local SSD, and PD Extreme, etc.), failure percentages
> would never make sense for fs/ext4/excludes, since that covers
> multiple file system configs.  And my infrastructure supports kvm,
> gce, and Android, as well as some people (such as at $WORK for our
> data center kernels) who run the test appliacce directly on bare
> metal, so I wouldn't use the failure percentages in these files, etc.
> 
> Now, what I *do* is to track this sort of thing in my own notes, e.g:
> 
> generic/051	ext4/adv	Failure percentage: 16% (4/25)
>     "Basic log recovery stress test - do lots of stuff, shut down in
>     the middle of it and check that recovery runs to completion and
>     everything can be successfully removed afterwards."
> 
> generic/410 nojournal	Couldn't reproduce after running 25 times
>      "Test mount shared subtrees, verify the state transitions..."
> 
> generic/68[12]	encrypt   Failure percentage: 100%
>     The directory does grow, but blocks aren't charged to either root or
>     the non-privileged users' quota.  So this appears to be a real bug.
> 
> 
> There is one thing that I'd like to add to upstream fstests, and that
> is some kind of option so that "check --retry-failures NN" would cause
> fstests to automatically, upon finding a test failure, will rerun that
> failing test NN aditional times.

That makes more sense for me :) I'd like to help the testers to retry the
(randomly) failed cases, to help them to get their testing statistics. That's
better than recording these statistics in fstests itself.

> Another potential related feature
> which we currently have in our daily spinner infrastructure at $WORK
> would be to on a test failure, rerun a test up to M times (typically a
> small number, such as 3), and if it passes on a retry attempt, declare
> the test result as "flaky", and stop running the retries.  If the test
> repeatedly fails after M attempts, then the test result is "fail".
> 
> These results would be reported in the junit XML file, and would allow
> the test runners to annotate their test summaries appropriately.
> 
> I'm thinking about trying to implement something like this in my
> copious spare time; but before I do, does the general idea seem
> acceptable?

After a "./check ..." done, generally fstests shows 3 list:
  Ran: ...
  Not run: ...
  Failures: ...

So you mean if the "--retry-failures N" is specified. we can have one more list
named "Flaky", which is part of "Failures" list, likes:
  Ran: ...
  Not run: ...
  Failures: generic/388 generic/475 xfs/104 xfs/442
  Flaky: generic/388 [2/N] xfs/104 [1/N]

If I understand this correctly, it's acceptable for me. And it might be helpful
for Amir's situation. But let's hear more voice from other developers, if there
is not big objection from other fs maintainers, let's do it :)

BTW, about the new group name to mark cases with random load/operations/env.,
what do you think? Any suggestions or good names for that?

Thanks,
Zorro

> 
> Thanks,
> 
> 					- Ted
> 


^ permalink raw reply

* Re: [PATCH] nvme-fc: mask out blkcg_get_fc_appid() if BLK_CGROUP_FC_APPID is not set
From: Jens Axboe @ 2022-05-19 15:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Hannes Reinecke, Keith Busch, linux-nvme, linux-block, Muneendra,
	James Smart
In-Reply-To: <20220519145931.GA25382@lst.de>

On 5/19/22 8:59 AM, Christoph Hellwig wrote:
> On Thu, May 19, 2022 at 08:57:53AM -0600, Jens Axboe wrote:
>> I'm assuming that commit is from the scsi tree? It's certainly not in
>> mine. So your commit message may be correct, but since it was sent to me,
>> I was assuming it's breakage from my tree. Which doesn't appear to be the
>> case, and I don't see any of the SCSI maintainers on the to/cc.
> 
> Yes, all this went in through the scsi tree.

OK, Hannes please send it to the SCSI list/folks then.

-- 
Jens Axboe


^ permalink raw reply

* Re: [PATCH] nvme-fc: mask out blkcg_get_fc_appid() if BLK_CGROUP_FC_APPID is not set
From: Christoph Hellwig @ 2022-05-19 14:59 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Hannes Reinecke, Christoph Hellwig, Keith Busch, linux-nvme,
	linux-block, Muneendra, James Smart
In-Reply-To: <be88c0b4-ddc6-c851-c160-a929adc1e433@kernel.dk>

On Thu, May 19, 2022 at 08:57:53AM -0600, Jens Axboe wrote:
> I'm assuming that commit is from the scsi tree? It's certainly not in
> mine. So your commit message may be correct, but since it was sent to me,
> I was assuming it's breakage from my tree. Which doesn't appear to be the
> case, and I don't see any of the SCSI maintainers on the to/cc.

Yes, all this went in through the scsi tree.

^ permalink raw reply

* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
From: Matthew Wilcox @ 2022-05-19 14:58 UTC (permalink / raw)
  To: Zorro Lang
  Cc: Amir Goldstein, Luis Chamberlain, linux-fsdevel, linux-block,
	pankydev8, Theodore Tso, Josef Bacik, jmeneghi, Jan Kara,
	Davidlohr Bueso, Dan Williams, Jake Edge, Klaus Jensen, fstests
In-Reply-To: <20220519112450.zbje64mrh65pifnz@zlang-mailbox>

On Thu, May 19, 2022 at 07:24:50PM +0800, Zorro Lang wrote:
> Yes, we talked about this, but if I don't rememeber wrong, I recommended each
> downstream testers maintain their own "testing data/config", likes exclude
> list, failed ratio, known failures etc. I think they're not suitable to be
> fixed in the mainline fstests.

This assumes a certain level of expertise, which is a barrier to entry.

For someone who wants to check "Did my patch to filesystem Y that I have
never touched before break anything?", having non-deterministic tests
run by default is bad.

As an example, run xfstests against jfs.  Hundreds of failures, including
some very scary-looking assertion failures from the page allocator.
They're (mostly) harmless in fact, just being a memory leak, but it
makes xfstests useless for this scenario.

Even for well-maintained filesystems like xfs which is regularly tested,
I expect generic/270 and a few others to fail.  They just do, and they're
not an indication that *I* broke anything.

By all means, we want to keep tests around which have failures, but
they need to be restricted to people who have a level of expertise and
interest in fixing long-standing problems, not people who are looking
for regressions.

^ permalink raw reply

* Re: [PATCH] nvme-fc: mask out blkcg_get_fc_appid() if BLK_CGROUP_FC_APPID is not set
From: Jens Axboe @ 2022-05-19 14:57 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, Keith Busch, linux-nvme, linux-block,
	Muneendra, James Smart
In-Reply-To: <0b33e180-9e23-f737-3c93-5b5b13a7ded2@suse.de>

On 5/19/22 8:55 AM, Hannes Reinecke wrote:
> On 5/19/22 07:50, Jens Axboe wrote:
>> On 5/19/22 8:45 AM, Hannes Reinecke wrote:
>>> If BLK_CGROUP_FC_APPID is not configured the symbol blkcg_get_fc_appid()
>>> is undefined, so it needs to be masked out.
>>>
>>> This patch is just a diff to the v2 patchset, as the original version has
>>> already been merged.
>>>
>>> Fixes: 980a0e068d14 ("scsi: nvme-fc: Add new routine nvme_fc_io_getuuid()")
>>
>> Either this sha is wrong too, or it's not in my tree yet the breakage is.
>>
> Picked up from linux-next.
> Which tree should I look at?

I'm assuming that commit is from the scsi tree? It's certainly not in
mine. So your commit message may be correct, but since it was sent to me,
I was assuming it's breakage from my tree. Which doesn't appear to be the
case, and I don't see any of the SCSI maintainers on the to/cc.

-- 
Jens Axboe


^ permalink raw reply

* Re: [PATCH] nvme-fc: mask out blkcg_get_fc_appid() if BLK_CGROUP_FC_APPID is not set
From: Hannes Reinecke @ 2022-05-19 14:55 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Keith Busch, linux-nvme, linux-block,
	Muneendra, James Smart
In-Reply-To: <84a7c290-5e75-3e24-0674-7b51dcafa2eb@kernel.dk>

On 5/19/22 07:50, Jens Axboe wrote:
> On 5/19/22 8:45 AM, Hannes Reinecke wrote:
>> If BLK_CGROUP_FC_APPID is not configured the symbol blkcg_get_fc_appid()
>> is undefined, so it needs to be masked out.
>>
>> This patch is just a diff to the v2 patchset, as the original version has
>> already been merged.
>>
>> Fixes: 980a0e068d14 ("scsi: nvme-fc: Add new routine nvme_fc_io_getuuid()")
> 
> Either this sha is wrong too, or it's not in my tree yet the breakage is.
> 
Picked up from linux-next.
Which tree should I look at?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman

^ permalink raw reply

* Re: [PATCH] nvme-fc: mask out blkcg_get_fc_appid() if BLK_CGROUP_FC_APPID is not set
From: Jens Axboe @ 2022-05-19 14:50 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, Keith Busch, linux-nvme, linux-block,
	Muneendra, James Smart
In-Reply-To: <20220519144555.22197-1-hare@suse.de>

On 5/19/22 8:45 AM, Hannes Reinecke wrote:
> If BLK_CGROUP_FC_APPID is not configured the symbol blkcg_get_fc_appid()
> is undefined, so it needs to be masked out.
> 
> This patch is just a diff to the v2 patchset, as the original version has
> already been merged.
> 
> Fixes: 980a0e068d14 ("scsi: nvme-fc: Add new routine nvme_fc_io_getuuid()")

Either this sha is wrong too, or it's not in my tree yet the breakage is.

-- 
Jens Axboe


^ permalink raw reply

* [PATCH] nvme-fc: mask out blkcg_get_fc_appid() if BLK_CGROUP_FC_APPID is not set
From: Hannes Reinecke @ 2022-05-19 14:45 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Keith Busch, linux-nvme, linux-block,
	Hannes Reinecke, Muneendra, James Smart

If BLK_CGROUP_FC_APPID is not configured the symbol blkcg_get_fc_appid()
is undefined, so it needs to be masked out.

This patch is just a diff to the v2 patchset, as the original version has
already been merged.

Fixes: 980a0e068d14 ("scsi: nvme-fc: Add new routine nvme_fc_io_getuuid()")
Cc: Muneendra <muneendra.kumar@broadcom.com>
Cc: James Smart <jsmart2021@gmail.com>
Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/nvme/host/fc.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index a11c69e68118..3c778bb0c294 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -1911,7 +1911,9 @@ char *nvme_fc_io_getuuid(struct nvmefc_fcp_req *req)
 	struct nvme_fc_fcp_op *op = fcp_req_to_fcp_op(req);
 	struct request *rq = op->rq;
 
-	return rq->bio ? blkcg_get_fc_appid(rq->bio) : NULL;
+	if (!IS_ENABLED(CONFIG_BLK_CGROUP_FC_APPID) || !rq->bio)
+		return NULL;
+	return blkcg_get_fc_appid(rq->bio);
 }
 EXPORT_SYMBOL_GPL(nvme_fc_io_getuuid);
 
-- 
2.29.2


^ permalink raw reply related

* Re: [PATCH] blk-cgroup: provide stubs for blkcg_get_fc_appid()
From: Christoph Hellwig @ 2022-05-19 14:34 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, linux-block,
	James Smart, linux-scsi
In-Reply-To: <b34a5081-dcb5-dd33-46f4-283e9d31fc0f@suse.de>

On Thu, May 19, 2022 at 04:20:40PM +0200, Hannes Reinecke wrote:
>>
>> No, it does not fix that commit, which is perfectly fine.  It fixes
>> the recently added second caller of blkcg_get_fc_appid, and James
>> has just resent a new version of that which fixes this properly.
>
> Really? blk-cgroup.h provides the function declaration
> blkcg_get_fc_appid() unconditionally, but the implementation
> for blkcg_get_fc_appid() depends on CONFIG_CGROUP.

Take a look at the IS_ENABLED macro that we have for a few years now.

> Neither of which is changed by James patchset.
>
> And besides, the first version is already merged.
>
> Am I missing something?

https://lore.kernel.org/linux-scsi/20220519123110.17361-1-jsmart2021@gmail.com/T/#t


^ permalink raw reply

* Re: [PATCH] blk-cgroup: provide stubs for blkcg_get_fc_appid()
From: Jens Axboe @ 2022-05-19 14:25 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, Keith Busch, linux-block, James Smart,
	linux-scsi
In-Reply-To: <20220519140021.6905-1-hare@suse.de>

On 5/19/22 8:00 AM, Hannes Reinecke wrote:
> Provide stubs for blkcg_set_fc_appid() and  blkcg_get_fc_appid() to allow
> for compilation with cgroups disabled.
> 
> Fixes: db05628435aa ("blk-cgroup: move blkcg_{get,set}_fc_appid out of line")

This isn't a valid sha.

-- 
Jens Axboe


^ permalink raw reply

* Re: [PATCH] blk-cgroup: provide stubs for blkcg_get_fc_appid()
From: Hannes Reinecke @ 2022-05-19 14:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Keith Busch, linux-block, James Smart, linux-scsi
In-Reply-To: <20220519140816.GA21378@lst.de>

On 5/19/22 07:08, Christoph Hellwig wrote:
> On Thu, May 19, 2022 at 04:00:21PM +0200, Hannes Reinecke wrote:
>> Provide stubs for blkcg_set_fc_appid() and  blkcg_get_fc_appid() to allow
>> for compilation with cgroups disabled.
>>
>> Fixes: db05628435aa ("blk-cgroup: move blkcg_{get,set}_fc_appid out of line")
>> Signed-off-by: Hannes Reinecke <hare@suse.de>
> 
> No, it does not fix that commit, which is perfectly fine.  It fixes
> the recently added second caller of blkcg_get_fc_appid, and James
> has just resent a new version of that which fixes this properly.

Really? blk-cgroup.h provides the function declaration
blkcg_get_fc_appid() unconditionally, but the implementation
for blkcg_get_fc_appid() depends on CONFIG_CGROUP.
Neither of which is changed by James patchset.

And besides, the first version is already merged.

Am I missing something?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman

^ permalink raw reply

* Re: [PATCHv2 1/3] block/bio: remove duplicate append pages code
From: Keith Busch @ 2022-05-19 14:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, linux-fsdevel, linux-block, axboe, Kernel Team,
	bvanassche, damien.lemoal
In-Reply-To: <20220519073256.GC22301@lst.de>

On Thu, May 19, 2022 at 09:32:56AM +0200, Christoph Hellwig wrote:
> On Wed, May 18, 2022 at 10:11:29AM -0700, Keith Busch wrote:
> > From: Keith Busch <kbusch@kernel.org>
> > 
> > The setup for getting pages are identical for zone append and normal IO.
> > Use common code for each.
> 
> How about using even more common code and avoiding churn at the same
> time?  Something like:

Yes, I'll fold this in.
 
> diff --git a/block/bio.c b/block/bio.c
> index a3893d80dccc9..15da722ed26d1 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1158,6 +1158,37 @@ static void bio_put_pages(struct page **pages, size_t size, size_t off)
>  		put_page(pages[i]);
>  }
>  
> +static int bio_iov_add_page(struct bio *bio, struct page *page,
> +		unsigned int len, unsigned int offset)
> +{
> +	bool same_page = false;
> +
> +	if (!__bio_try_merge_page(bio, page, len, offset, &same_page)) {
> +		if (WARN_ON_ONCE(bio_full(bio, len)))
> +			return -EINVAL;
> +		__bio_add_page(bio, page, len, offset);
> +		return 0;
> +	}
> +
> +	if (same_page)
> +		put_page(page);
> +	return 0;
> +}
> +
> +static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
> +		unsigned int len, unsigned int offset)
> +{
> +	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
> +	bool same_page = false;
> +
> +	if (bio_add_hw_page(q, bio, page, len, offset,
> +			queue_max_zone_append_sectors(q), &same_page) != len)
> +		return -EINVAL;
> +	if (same_page)
> +		put_page(page);
> +	return 0;
> +}
> +
>  #define PAGE_PTRS_PER_BVEC     (sizeof(struct bio_vec) / sizeof(struct page *))
>  
>  /**
> @@ -1176,7 +1207,6 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  	unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt;
>  	struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
>  	struct page **pages = (struct page **)bv;
> -	bool same_page = false;
>  	ssize_t size, left;
>  	unsigned len, i;
>  	size_t offset;
> @@ -1195,18 +1225,18 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  
>  	for (left = size, i = 0; left > 0; left -= len, i++) {
>  		struct page *page = pages[i];
> +		int ret;
>  
>  		len = min_t(size_t, PAGE_SIZE - offset, left);
> +		if (bio_op(bio) == REQ_OP_ZONE_APPEND)	
> +			ret = bio_iov_add_zone_append_page(bio, page, len,
> +					offset);
> +		else
> +			ret = bio_iov_add_page(bio, page, len, offset);
>  
> -		if (__bio_try_merge_page(bio, page, len, offset, &same_page)) {
> -			if (same_page)
> -				put_page(page);
> -		} else {
> -			if (WARN_ON_ONCE(bio_full(bio, len))) {
> -				bio_put_pages(pages + i, left, offset);
> -				return -EINVAL;
> -			}
> -			__bio_add_page(bio, page, len, offset);
> +		if (ret) {
> +			bio_put_pages(pages + i, left, offset);
> +			return ret;
>  		}
>  		offset = 0;
>  	}
> @@ -1215,54 +1245,6 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  	return 0;
>  }
>  
> -static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)
> -{
> -	unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
> -	unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt;
> -	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
> -	unsigned int max_append_sectors = queue_max_zone_append_sectors(q);
> -	struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
> -	struct page **pages = (struct page **)bv;
> -	ssize_t size, left;
> -	unsigned len, i;
> -	size_t offset;
> -	int ret = 0;
> -
> -	if (WARN_ON_ONCE(!max_append_sectors))
> -		return 0;
> -
> -	/*
> -	 * Move page array up in the allocated memory for the bio vecs as far as
> -	 * possible so that we can start filling biovecs from the beginning
> -	 * without overwriting the temporary page array.
> -	 */
> -	BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
> -	pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
> -
> -	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
> -	if (unlikely(size <= 0))
> -		return size ? size : -EFAULT;
> -
> -	for (left = size, i = 0; left > 0; left -= len, i++) {
> -		struct page *page = pages[i];
> -		bool same_page = false;
> -
> -		len = min_t(size_t, PAGE_SIZE - offset, left);
> -		if (bio_add_hw_page(q, bio, page, len, offset,
> -				max_append_sectors, &same_page) != len) {
> -			bio_put_pages(pages + i, left, offset);
> -			ret = -EINVAL;
> -			break;
> -		}
> -		if (same_page)
> -			put_page(page);
> -		offset = 0;
> -	}
> -
> -	iov_iter_advance(iter, size - left);
> -	return ret;
> -}
> -
>  /**
>   * bio_iov_iter_get_pages - add user or kernel pages to a bio
>   * @bio: bio to add pages to
> @@ -1297,10 +1279,7 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  	}
>  
>  	do {
> -		if (bio_op(bio) == REQ_OP_ZONE_APPEND)
> -			ret = __bio_iov_append_get_pages(bio, iter);
> -		else
> -			ret = __bio_iov_iter_get_pages(bio, iter);
> +		ret = __bio_iov_iter_get_pages(bio, iter);
>  	} while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));
>  
>  	/* don't account direct I/O as memory stall */

^ permalink raw reply

* Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges
From: Theodore Ts'o @ 2022-05-19 14:18 UTC (permalink / raw)
  To: Zorro Lang
  Cc: Amir Goldstein, Luis Chamberlain, linux-fsdevel, linux-block,
	pankydev8, Josef Bacik, jmeneghi, Jan Kara, Davidlohr Bueso,
	Dan Williams, Jake Edge, Klaus Jensen, fstests
In-Reply-To: <20220519112450.zbje64mrh65pifnz@zlang-mailbox>

On Thu, May 19, 2022 at 07:24:50PM +0800, Zorro Lang wrote:
> 
> Yes, we talked about this, but if I don't rememeber wrong, I recommended each
> downstream testers maintain their own "testing data/config", likes exclude
> list, failed ratio, known failures etc. I think they're not suitable to be
> fixed in the mainline fstests.

Failure ratios are the sort of thing that are only applicable for

* A specific filesystem
* A specific configuration
* A specific storage device / storage device class
* A specific CPU architecture / CPU speed
* A specific amount of memory available

Put another way, there are problems that fail so close to rarely as to
be "hever" on, say, an x86_64 class server with gobs and gobs of
memory, but which can more reliably fail on, say, a Rasberry PI using
eMMC flash.

I don't think that Luis was suggesting that this kind of failure
annotation would go in upstream fstests.  I suspect he just wants to
use it in kdevops, and hope that other people would use it as well in
other contexts.  But even in the context of test runners like kdevops
and {kvm,gce,android}-xfstests, it's going to be very specific to a
particular test environment, and for the global list of excludes for a
particular file system.  So in the gce-xfstests context, this is the
difference between the excludes in the files:

	fs/ext4/excludes
vs
	fs/ext4/cfg/bigalloc.exclude

even if I only cared about, say, how things ran on GCE using
SSD-backed Persistent Disk (never mind that I can only run
gce-xfstests on Local SSD, and PD Extreme, etc.), failure percentages
would never make sense for fs/ext4/excludes, since that covers
multiple file system configs.  And my infrastructure supports kvm,
gce, and Android, as well as some people (such as at $WORK for our
data center kernels) who run the test appliacce directly on bare
metal, so I wouldn't use the failure percentages in these files, etc.

Now, what I *do* is to track this sort of thing in my own notes, e.g:

generic/051	ext4/adv	Failure percentage: 16% (4/25)
    "Basic log recovery stress test - do lots of stuff, shut down in
    the middle of it and check that recovery runs to completion and
    everything can be successfully removed afterwards."

generic/410 nojournal	Couldn't reproduce after running 25 times
     "Test mount shared subtrees, verify the state transitions..."

generic/68[12]	encrypt   Failure percentage: 100%
    The directory does grow, but blocks aren't charged to either root or
    the non-privileged users' quota.  So this appears to be a real bug.

There is one thing that I'd like to add to upstream fstests, and that
is some kind of option so that "check --retry-failures NN" would cause
fstests to automatically, upon finding a test failure, will rerun that
failing test NN aditional times.  Another potential related feature
which we currently have in our daily spinner infrastructure at $WORK
would be to on a test failure, rerun a test up to M times (typically a
small number, such as 3), and if it passes on a retry attempt, declare
the test result as "flaky", and stop running the retries.  If the test
repeatedly fails after M attempts, then the test result is "fail".

These results would be reported in the junit XML file, and would allow
the test runners to annotate their test summaries appropriately.

I'm thinking about trying to implement something like this in my
copious spare time; but before I do, does the general idea seem
acceptable?

Thanks,

					- Ted

^ permalink raw reply

* Re: [PATCHv2 3/3] block: relax direct io memory alignment
From: Keith Busch @ 2022-05-19 14:08 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, linux-fsdevel, linux-block, axboe, Kernel Team,
	bvanassche, damien.lemoal
In-Reply-To: <20220519073811.GE22301@lst.de>

On Thu, May 19, 2022 at 09:38:11AM +0200, Christoph Hellwig wrote:
> > @@ -1207,6 +1207,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
> >  {
> >  	unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
> >  	unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt;
> > +	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
> >  	struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
> >  	struct page **pages = (struct page **)bv;
> >  	bool same_page = false;
> > @@ -1223,6 +1224,8 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
> >  	pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
> >  
> >  	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
> > +	if (size > 0)
> > +		size = ALIGN_DOWN(size, queue_logical_block_size(q));
> 
> So if we do get a size that is not logical block size alignment here,
> we reduce it to the block size aligned one below.  Why do we do that?

There are two possibilities:

In the first case, the number of pages in this iteration exceeds bi_max_vecs.
Rounding down completes the bio with a block aligned size, and the remainder
will be picked up for the next bio, or possibly even the current bio if the
pages are sufficiently physically contiguous.

The other case is a bad iov. If we're doing __blkdev_direct_IO(), it will error
out immediately if the rounded size is 0, or the next iteration when the next
size is rounded to 0. If we're doing the __blkdev_direct_IO_simple(), it will
error out when it sees the iov hasn't advanced to the end.

And ... I just noticed I missed the size check __blkdev_direct_IO_async().
 
> > +	if ((pos | iov_iter_count(iter)) & (bdev_logical_block_size(bdev) - 1))
> > +		return -EINVAL;
> > +	if (iov_iter_alignment(iter) & bdev_dma_alignment(bdev))
> >  		return -EINVAL;
> 
> Can we have a little inline helper for these checks instead of
> duplicating them three times?

Absolutely.

> > diff --git a/fs/direct-io.c b/fs/direct-io.c
> > index 840752006f60..64cc176be60c 100644
> > --- a/fs/direct-io.c
> > +++ b/fs/direct-io.c
> > @@ -1131,7 +1131,7 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
> >  	struct dio_submit sdio = { 0, };
> >  	struct buffer_head map_bh = { 0, };
> >  	struct blk_plug plug;
> > -	unsigned long align = offset | iov_iter_alignment(iter);
> > +	unsigned long align = iov_iter_alignment(iter);
> 
> I'd much prefer to not just relax this for random file systems,
> and especially not the legacy direct I/O code.  I think we can eventually
> do iomap, but only after an audit and test of each file system, which
> might require a new IOMAP_DIO_* flag at least initially.

I did some testing with xfs, but I can certainly run more a lot more tests. I
do think filesystem support for this capability is important, so I hope we
eventually get there.

^ permalink raw reply

* Re: [PATCH] blk-cgroup: provide stubs for blkcg_get_fc_appid()
From: Christoph Hellwig @ 2022-05-19 14:08 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, Christoph Hellwig, Keith Busch, linux-block,
	James Smart, linux-scsi
In-Reply-To: <20220519140021.6905-1-hare@suse.de>

On Thu, May 19, 2022 at 04:00:21PM +0200, Hannes Reinecke wrote:
> Provide stubs for blkcg_set_fc_appid() and  blkcg_get_fc_appid() to allow
> for compilation with cgroups disabled.
> 
> Fixes: db05628435aa ("blk-cgroup: move blkcg_{get,set}_fc_appid out of line")
> Signed-off-by: Hannes Reinecke <hare@suse.de>

No, it does not fix that commit, which is perfectly fine.  It fixes
the recently added second caller of blkcg_get_fc_appid, and James
has just resent a new version of that which fixes this properly.

^ permalink raw reply

* [PATCH] blk-cgroup: provide stubs for blkcg_get_fc_appid()
From: Hannes Reinecke @ 2022-05-19 14:00 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Keith Busch, linux-block, James Smart,
	linux-scsi, Hannes Reinecke

Provide stubs for blkcg_set_fc_appid() and  blkcg_get_fc_appid() to allow
for compilation with cgroups disabled.

Fixes: db05628435aa ("blk-cgroup: move blkcg_{get,set}_fc_appid out of line")
Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 include/linux/blk-cgroup.h | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 9f40dbc65f82..4756f4d2b8e7 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -33,6 +33,9 @@ void blkcg_unpin_online(struct cgroup_subsys_state *blkcg_css);
 struct list_head *blkcg_get_cgwb_list(struct cgroup_subsys_state *css);
 struct cgroup_subsys_state *bio_blkcg_css(struct bio *bio);
 
+int blkcg_set_fc_appid(char *app_id, u64 cgrp_id, size_t app_id_len);
+char *blkcg_get_fc_appid(struct bio *bio);
+
 #else	/* CONFIG_BLK_CGROUP */
 
 #define blkcg_root_css	((struct cgroup_subsys_state *)ERR_PTR(-EINVAL))
@@ -44,9 +47,15 @@ static inline struct cgroup_subsys_state *bio_blkcg_css(struct bio *bio)
 {
 	return NULL;
 }
+static inline int blkcg_set_fc_appid(char *app_id, u64 cgrp_id,
+				     size_t app_id_len)
+{
+	return -EINVAL;
+}
+static inline char *blkcg_get_fc_appid(struct bio *bio)
+{
+	return NULL;
+}
 #endif	/* CONFIG_BLK_CGROUP */
 
-int blkcg_set_fc_appid(char *app_id, u64 cgrp_id, size_t app_id_len);
-char *blkcg_get_fc_appid(struct bio *bio);
-
 #endif	/* _BLK_CGROUP_H */
-- 
2.29.2


^ permalink raw reply related

* Re: [PATCH] blk-cgroup: provide stubs for blkcg_get_fc_appid()
From: Jens Axboe @ 2022-05-19 13:36 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Christoph Hellwig, Keith Busch, linux-block
In-Reply-To: <20220519130207.6492-1-hare@suse.de>

On 5/19/22 7:02 AM, Hannes Reinecke wrote:
> Signed-off-by: Hannes Reinecke <hare@suse.de>

This needs both an actual commit message and a Fixes tag.

-- 
Jens Axboe


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox