Re: Question: reserve log space at IO time for recover

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: Wengang Wang <wen.gang.wang@oracle.com>,
	"linux-xfs@vger.kernel.org" <linux-xfs@vger.kernel.org>,
	Srikanth C S <srikanth.c.s@oracle.com>
Subject: Re: Question: reserve log space at IO time for recover
Date: Wed, 26 Jul 2023 08:23:20 -0700	[thread overview]
Message-ID: <20230726152320.GA11352@frogsfrogsfrogs> (raw)
In-Reply-To: <ZMCcJSLiWIi3KBOl@dread.disaster.area>

On Wed, Jul 26, 2023 at 02:08:05PM +1000, Dave Chinner wrote:
> On Mon, Jul 24, 2023 at 06:03:02PM +0000, Wengang Wang wrote:
> > > On Jul 23, 2023, at 5:57 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > On Fri, Jul 21, 2023 at 07:36:03PM +0000, Wengang Wang wrote:
> > >> FYI:
> > >> 
> > >> I am able reproduce the XFS mount hang issue with hacked kernels based on
> > >> both 4.14.35 kernel or 6.4.0 kernel.
> > >> Reproduce steps:
> > >> 
> > >> 1. create a XFS with 10MiB log size (small so easier to reproduce). The following
> > >>   steps all aim at this XFS volume.
> > > 
> > > Actually, make that a few milliseconds.... :)
> > 
> > :)
> > 
> > > mkfs/xfs_info output would be appreciated.
> > 
> > sure,
> > # xfs_info 20GB.bk2
> > meta-data=20GB.bk2               isize=256    agcount=4, agsize=1310720 blks
> >          =                       sectsz=512   attr=2, projid32bit=1
> >          =                       crc=0        finobt=0, sparse=0, rmapbt=0
> >          =                       reflink=0
> 
> Hmmmm. Why are you only testing v4 filesystems? They are deprecated
> and support is largely due to be dropped from upstream in 2025...
> 
> Does the same problem occur with a v5 filesystems?
> 
> > >> 5. Checking the on disk left free log space, it’s 181760 bytes for both 4.14.35
> > >>   kernel and 6.4.0 kernel.
> > > 
> > > Which is is clearly wrong. It should be at least 360416 bytes (i.e
> > > tr_itrunc), because that's what the EFI being processed that pins
> > > the tail of the log is supposed to have reserved when it was
> > > stalled.
> > 
> > Yep, exactly.
> > 
> > > So where has the ~180kB of leaked space come from?
> > > 
> > > Have you traced the grant head reservations to find out
> > > what the runtime log space and grant head reservations actually are?
> > I have the numbers in vmcore (ignore the WARNs),
> 
> That's not what I'm asking. You've dumped the values at the time of
> the hang, not traced the runtime reservations that have been made.
> 
> > > i.e. we have full tracing of the log reservation accounting via
> > > tracepoints in the kernel. If there is a leak occurring, you need to
> > > capture a trace of all the reservation accounting operations and
> > > post process the output to find out what operation is leaking
> > > reserved space. e.g.
> > > 
> > > # trace-cmd record -e xfs_log\* -e xlog\* -e printk touch /mnt/scratch/foo
> > > ....
> > > # trace-cmd report > s.t
> > > # head -3 s.t
> > > cpus=16
> > >          touch-289000 [008] 430907.633820: xfs_log_reserve:      dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 1024 grant_write_cycle 1 grant_write_bytes 1024 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2
> > >          touch-289000 [008] 430907.633829: xfs_log_reserve_exit: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 482800 grant_write_cycle 1 grant_write_bytes 482800 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2
> > > 
> > > #
> > > 
> > > So this tells us the transaction reservation unit size, the count of
> > > reservations, the current reserve and grant head locations, and the
> > > current head and tail of the log at the time the transaction
> > > reservation is started and then after it completes.
> > 
> > Will do that and report back. You want full log or only some typical
> > ones? Full log would be big, how shall I share? 
> 
> I don't want to see the log. It'll be huge - I regularly generate
> traces containing gigabytes of log accounting traces like this from
> a single workload.
> 
> What I'm asking you to do is run the tracing and then post process
> the values from the trace to determine what operation is using more
> space than is being freed back to the log.
> 
> I generally do this with grep, awk and sed. some people use python
> or perl. But either way it's a *lot* of work - in the past I have
> spent _weeks_ on trace analysis to find a 4 byte leak in the log
> space accounting. DOing things like graphing the head, tail and grant
> spaces over time tend to show if this is a gradual leak versus a
> sudden step change. If it's a sudden step change, then you can
> isolate it in the trace and work out what happened. If it's a
> gradual change, then you need to start looking for accounting
> discrepancies...

Any chance you'd be willing to share that pipeline?  It'd be useful to
stash that kind of debugging program in xfsprogs/tools to save time and
eliminate an entire class of "Hey Dave, did I capture this correctly?"
questions.

(At least until someone changes the tracepoints :P)

--D

> e.g. a transaction records 32 bytes used in the item, so it releases
> t_unit - 32 bytes at commit. However, the CIL may then only track 28
> bytes of space for the item in the journal and we leak 4 bytes of
> reservation on every on of those items committed.
> 
> These sorts of leaks typically only add up to being somethign
> significant in situations where the log is flooded with tiny inode
> timestamp changes - 4 bytes iper item doesn't really matter when you
> only have a few thousand items in the log, but when you have
> hundreds of thousands of tiny items in the log...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

next prev parent reply	other threads:[~2023-07-26 15:23 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-18 22:57 Question: reserve log space at IO time for recover Wengang Wang
2023-07-19  0:11 ` Dave Chinner
2023-07-19  1:44   ` Darrick J. Wong
2023-07-19  6:25     ` Dave Chinner
2023-07-21 19:36       ` Wengang Wang
2023-07-24  0:57         ` Dave Chinner
2023-07-24 18:03           ` Wengang Wang
2023-07-26  4:08             ` Dave Chinner
2023-07-26 15:23               ` Darrick J. Wong [this message]
2023-07-27  1:05                 ` Dave Chinner
2023-07-28 17:56               ` Wengang Wang
2023-08-18  3:25                 ` Wengang Wang
2023-08-21 22:06                   ` Wengang Wang
2023-08-24  5:05                     ` Darrick J. Wong
2023-08-24 22:55                       ` Wengang Wang
2023-08-24  4:52                   ` Darrick J. Wong
2023-08-24  7:28                     ` Dave Chinner
2023-08-24 22:01                       ` Darrick J. Wong
2023-08-26  3:37                         ` Dave Chinner
2023-08-27 16:04                           ` Darrick J. Wong
2023-08-24 23:53                       ` Wengang Wang
2023-07-19  1:46   ` Wengang Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230726152320.GA11352@frogsfrogsfrogs \
    --to=djwong@kernel.org \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    --cc=srikanth.c.s@oracle.com \
    --cc=wen.gang.wang@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox