Re: xlog_write: reservation ran out

Linux XFS filesystem development
 help / color / mirror / Atom feed

From: Brian Foster <bfoster@redhat.com>
To: Ming Lin <mlin@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>,
	Christoph Hellwig <hch@lst.de>,
	linux-xfs@vger.kernel.org,
	Ceph Development <ceph-devel@vger.kernel.org>,
	"LIU, Fei" <james.liu@alibaba-inc.com>,
	xiongwei.jiang@alibaba-inc.com, boqian.zy@alibaba-inc.com,
	sheng.qiu@alibaba-inc.com, Sage Weil <sweil@redhat.com>
Subject: Re: xlog_write: reservation ran out
Date: Mon, 1 May 2017 14:48:24 -0400	[thread overview]
Message-ID: <20170501184822.GB3775@bfoster.bfoster> (raw)
In-Reply-To: <70def595-91a9-970a-d227-547055da5bb5@kernel.org>

On Mon, May 01, 2017 at 10:41:06AM -0700, Ming Lin wrote:
> On 5/1/2017 6:12 AM, Brian Foster wrote:
> > On Sun, Apr 30, 2017 at 11:10:15PM -0700, Ming Lin wrote:
> >>
> >> On 4/28/2017 1:56 PM, Ming Lin wrote:
> >>> I'm new to xfs code.
> >>>
> >>> Search  XFS_TRANS_INACTIVE and the usage is like below,
> >>>
> >>> xfs_trans_alloc(mp, XFS_TRANS_INACTIVE);
> >>> xfs_trans_reserve(tp, &M_RES(mp)->tr_itruncate, 0, 0);
> >>>
> >>> xfs_trans_alloc(mp, XFS_TRANS_INACTIVE);
> >>> xfs_trans_reserve(tp, &M_RES(mp)->tr_ifree, XFS_IFREE_SPACE_RES(mp), 0);
> >>>
> >>> seems tr_remove is not related.
> >>> I'll just try to enlarge the reservation for tr_itruncate and tr_ifree.
> >>
> >> Now things are a little bit more clear. I tried below debug patch.
> >> The t_decrease[] array was used to track where the space was used.
> >>
> >>  fs/xfs/libxfs/xfs_trans_resv.c |  4 ++--
> >>  fs/xfs/xfs_log.c               | 23 ++++++++++++++++++++---
> >>  fs/xfs/xfs_log_cil.c           |  8 ++++++++
> >>  fs/xfs/xfs_log_priv.h          |  3 +++
> >>  fs/xfs/xfs_super.c             |  1 +
> >>  5 files changed, 34 insertions(+), 5 deletions(-)
> >>
> > ...
> >>  277 static void
> >>  278 xlog_cil_insert_items(
> >>  279         struct xlog             *log,
> >>  280         struct xfs_trans        *tp)
> >>  281 {
> >>
> >> ....
> >>
> >>  340         /* do we need space for more log record headers? */
> >>  341         iclog_space = log->l_iclog_size - log->l_iclog_hsize;
> >>  342         if (len > 0 && (ctx->space_used / iclog_space !=
> >>  343                                 (ctx->space_used + len) / iclog_space)) {
> >>  344                 int hdrs;
> >>  345
> >>  346                 hdrs = (len + iclog_space - 1) / iclog_space;
> >>  347                 /* need to take into account split region headers, too */
> >>  348                 hdrs *= log->l_iclog_hsize + sizeof(struct xlog_op_header);
> >>  349                 ctx->ticket->t_unit_res += hdrs;
> >>  350                 ctx->ticket->t_curr_res += hdrs;
> >>  351                 tp->t_ticket->t_curr_res -= hdrs;
> >>  352                 tp->t_ticket->t_decrease[6] = hdrs;
> >>  353                 ASSERT(tp->t_ticket->t_curr_res >= len);
> >>  354         }
> >>  355         tp->t_ticket->t_curr_res -= len;
> >>  356         tp->t_ticket->t_decrease[7] = len;
> >>  357         ctx->space_used += len;
> >>  358
> >>  359         spin_unlock(&cil->xc_cil_lock);
> >>  360 }
> >>
> >> Any idea why it used so many reservation space here?
> >>
> > 
> > Nothing really rings a bell for me atm. Perhaps others might have ideas.
> > That does appear to be a sizable overrun, as opposed to a few bytes that
> > could more likely be attributed to rounding, header accounting issues or
> > something of that nature.
> 
> FYI, here are some numbers.
> 
> The original "unit res" is 83024. I made it x2 larger, so now it's 166048
> "unit res" - "current res" = the reservation space already used
> 
> XFS (nvme10n1p1): xlog_write: reservation summary:
>   trans type  = INACTIVE (3)
>   unit res    = 166048 bytes
>   current res = 77088 bytes
>   total reg   = 0 bytes (o/flow = 0 bytes)
>   ophdrs      = 0 (ophdr space = 0 bytes)
>   ophdr + reg = 0 bytes
>   num regions = 0
> 
> "already used" = 166048 - 77088 = 88960
> overrun = 88960 - 83024 = 5936
> 
> 
> XFS (nvme7n1p1): xlog_write: reservation summary:
>   trans type  = INACTIVE (3)
>   unit res    = 166048 bytes
>   current res = 53444 bytes
>   total reg   = 0 bytes (o/flow = 0 bytes)
>   ophdrs      = 0 (ophdr space = 0 bytes)
>   ophdr + reg = 0 bytes
>   num regions = 0
> 
> "already used" = 166048 - 53444 = 112604
> overrun = 112604 - 83024 = 29580
> 
> The overrun bytes seems a lot to me.
> 

Yes.

> > 
> > The debug code doesn't really tell us much beyond that the transaction
> > required logging more data than it had reserved. In the snippet above,
> > len essentially refers to a byte total of what is logged across all of
> > the various items (inode, buffers, etc.) in the transaction.
> > 
> > I'm assuming you can reproduce this often enough if you can capture
> 
> It takes about 10 hours to reproduce the problem.
> 

Out of curiosity, is that 10 hours of removing files or 10 hours of
repopulating and removing until the problem happens to occur? If the
latter, roughly how many fill/remove cycles does that entail (tens,
hundreds, thousands)?

> > debug information. Have you tried to reproduce the actual transaction
> > overrun without using Ceph (i.e., create the fs using ceph as normal,
> > but run the object removal directly)? If you can do that, you could
> 
> Not exactly same.
> But I did try just write the xfs fs with fio(64 threads) to 80% full,
> then remove the files, but can't reproduce it.
> 

You could try to populate the fs using Ceph as with your current
reproducer, particularly since it may use patterns or features that
could affect this problem (xattrs?) that fio may not induce, and then
try to directly reproduce the overrun via manual file removals. This
would be sufficient for debugging because if you can share a metadump
image of the original fs and appropriate steps to reproduce, we don't
particularly need to care about how the fs was constructed in the first
place.

For example, if you have a test that currently populates and depopulates
the fs through Ceph, something I might try is to update the test to
generate a metadump image of the fs every time your test cycles from
populating to depopulating. Once the problem reproduces, you now have a
metadump image of the original fs that you can restore and use to try to
reproduce the overrun manually (repeatedly, if nec.).

> > create an xfs_metadump of the populated fs, run a more simple reproducer
> > on that and that might make it easier to 1.) try newer distro and/or
> > upstream kernels to try and isolate where the problem exists and/or 2.)
> > share it so we can try to reproduce and narrow down where the overrun
> > seems to occur (particularly if this hasn't already been fixed
> > somewhere).
> 
> I'll try to find a more simple reproducer.
> 

Thanks.

Brian

> Thanks,
> Ming
> 
> > 
> > Brian
> > 
> >> Thanks,
> >> Ming
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2017-05-01 18:48 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CAF1ivSZeYhnzh6CPZViYUNFXdm33hhTA_CzP7V745QociEEnKg@mail.gmail.com>
2017-04-28 20:24 ` xlog_write: reservation ran out Ming Lin
2017-04-28 20:56   ` Ming Lin
2017-05-01  6:10     ` Ming Lin
2017-05-01 13:12       ` Brian Foster
2017-05-01 17:41         ` Ming Lin
2017-05-01 18:48           ` Brian Foster [this message]
2017-05-01 20:18             ` Ming Lin
2017-05-02 23:51           ` Dave Chinner
2017-05-03 14:10             ` Brian Foster
2017-05-03 15:53               ` Darrick J. Wong
2017-05-03 16:15                 ` Brian Foster
2017-05-04  0:37                   ` Dave Chinner
2017-05-04  0:34               ` Dave Chinner
2017-05-04 11:51                 ` Brian Foster
2017-05-05  3:16                   ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170501184822.GB3775@bfoster.bfoster \
    --to=bfoster@redhat.com \
    --cc=boqian.zy@alibaba-inc.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=david@fromorbit.com \
    --cc=hch@lst.de \
    --cc=james.liu@alibaba-inc.com \
    --cc=linux-xfs@vger.kernel.org \
    --cc=mlin@kernel.org \
    --cc=sheng.qiu@alibaba-inc.com \
    --cc=sweil@redhat.com \
    --cc=xiongwei.jiang@alibaba-inc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox