* [RFC] Delayed logging
@ 2010-03-15 4:30 Dave Chinner
2010-03-15 16:01 ` Christoph Hellwig
` (3 more replies)
0 siblings, 4 replies; 11+ messages in thread
From: Dave Chinner @ 2010-03-15 4:30 UTC (permalink / raw)
To: xfs
Hi Folks,
You've all heard me talking about delayed logging, but there hasn't
been any code yet. Well, here's the first code drop - see the git
tree reference at the end of the email to get it.
If you want to know what delayed logging is and how it works, pull
the tree and read the documentation in:
Documentation/filesystems/xfs-delayed-logging-design.txt
or navigate to it via gitweb from here:
http://git.kernel.org/?p=linux/kernel/git/dgc/xfs.git
The delayed-logging branch that the code lives in may be rebased at
any time, hence I'm not going to point you at commits because they
won't be stable. It also means any time you want to update, you need
to need to pull into a clean new branch.
Overall, it's not a huge change:
19 files changed, 2594 insertions(+), 580 deletions(-)
Especially when you take away the 819 lines of design documentation.
It's still a large change, though, when you consider how critical
this code is. :/
Now you know what it is, the code in the tree implements the
documented design. While the code passes XFSQA on my test systems,
there are still occassional failures that have not been resolved and
there has been almost no stress testing of the code been
done. Hence:
*** USE THIS CODE AT YOUR OWN RISK ***
At present, I have done no performance testing on production kernel
configurations - all my testing has been done with CONFIG_XFS_DEBUG
enabled along with various other kernel debugging features as well.
Hence I've really only been looking at significant deviations in
performance (up or down) to determine whether the code is meeting
design goals or not.
The following results are from a synthetic test designed to show
just the impact of delayed logging on the amount of metadata
written to the log.
load: Sequential create 100k zero-length files in a directory per
thread, no fsync between create and unlink.
(./fs_mark -S0 -n 100000 -s 0 -d ....)
measurement: via PCP. XFS specific metrics:
xfs.log.blocks
xfs.log.writes
xfs.log.noiclogs
xfs.log.force
xfs.transactions.*
xfs.dir_ops.create
xfs.dir_ops.remove
machine:
2GHz Dual core opteron, 3GB RAM
single 36GB 15krpm scsi drive w/ CTQ depth=32
mkfs.xfs -f -l size=128m /dev/sdb2
Current code:
mount -o "logbsize=262144" /dev/sdb2 /mnt/scratch
threads: fs_mark CPU create log unlink log
throughput bandwidth bandwidth
1 2900/s 75% 34MB/s 34MB/s
2 2850/s 75% 33MB/s 33MB/s
4 2800/s 80% 30MB/s 30MB/s
Delayed logging:
mount -o "delaylog,logbsize=262144" /dev/sdb2 /mnt/scratch
threads: fs_mark CPU create log unlink log
throughput bandwidth bandwidth
1 4300/s 110% 1.5MB/s <1MB/s
2 7900/s 195% <4MB/s <1MB/s
4 7500/s 200% <5MB/s <1.5MB/s
I think it pretty clear that the design goal of "an order of
magnitude less log IO bandwidth" is being met here. Scalability is
looking promising, but a 2p machine is not large enough to make any
definitive statements about that. Hence from these results the
implementation is at or exceeding design levels.
Known issues that need to be resolved:
- xfslogd can effectively lock up spinning for 10s of
seconds at a time under heavy load. Cause unknown,
needs analysis and fixing.
- leaks memory in some error paths.
- occasional recovery failure with recovery reading an inode
buffer that does not contain inodes. Cause unknown, tends
to be reproduced by xfsqa test 121 semi-reliably. Needs
further analysis and fixing. May already be fixed with a
recent fix to commit record synchronisation.
- Checkpoint log ticket allocation is less than ideal - can
also trigger lockdep warnings if we re-enter the FS. =>
needs KM_NOFS and a cleanup.
- stress will probably break it. Need to run a variety of
workloads/benchmarks and sort out issues that are
uncovered.
- scalability, while improved, is still largely an unknown.
Will need to run tests on big machines to find where new
contention points have been introduced.
- impact on sync/fsync heavy workloads largely unknown. It
should not be significant, but needs testing and analysis.
- determine if the current checkpoint sizing is appropriate,
or whether further dynamic sizing (e.g. based on log size)
needs investigation.
Further algorithmic optimisations:
- busy extent tracking is still not ideal - we can get lots
(thousands) of adjacent single extents in the same
transaction so combining them at transaction commit would
be advantageous.
- Don't need barriers on every single log IO. Indeed,
funnelling 8MB of IO through 8x256k buffers is not really
ideal. Only really need barrier on first IO of checkpoint
(to ensure all the changes we are about to overwrite are on
disk already) and last IO (to ensure commit record hits
the disk).
- commit record synchronisation is simplistic and can cause
too many wakeups. needs to be smarter about finding
previous sequences to wait on.
- AIL pushing can trigger far too many log forces in a short
period of time.
- start looking at areas where CPU usage is excessive and
try to trim it.
There's still a lot of work to do before this is production ready,
but I think it's stable enough now that the code is not going to
change significantly as a result of trying to fix bugs that are
lurking. Currently I'm aiming for experimental inclusion into
mainline for 2.6.35, with the aim for it to be production ready by
2.6.37 and the default for 2.6.39.
Anyway, here's the details of the tree. Note that this branch
includes a merge of the trans-cleanup branch as it is dependent on
those changes.
The following changes since commit 5077f72749e6a78eb57211caf337cda8297bf882:
Dave Chinner (1):
xfs: don't warn about page discards on shutdown
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfs.git delayed-logging
Dave Chinner (19):
xfs: introduce new internal log vector structure
xfs: factor xlog_write and make use of new log vector structure
xfs: Delayed logging design documentation
xfs: introduce delayed logging mount option
xfs: extend the log item to support delayed logging
xfs: Introduce the Committed Item List
xfs: Add delayed logging checkpoint context infrastructure
xfs: introduce new chained log vector transaction formatting code
xfs: format and insert log vectors into the CIL
xfs: attach transactions to the checkpoint context
xfs: checkpoint transaction infrastructure
xfs: Allow multiple in-flight checkpoints
xfs: forced unmounts need to push the CIL
xfs: enable background pushing of the CIL
xfs: modify buffer item reference counting for delayed logging
XFS: replace fixed size busy extent array with an rbtree
XFS: Don't use log forces when busy extents are allocated
XFS: Simplify transaction busy extent tracking
xfs: cluster fsync transaction
.../filesystems/xfs-delayed-logging-design.txt | 819 ++++++++++++++++++++
fs/xfs/Makefile | 1 +
fs/xfs/linux-2.6/xfs_buf.c | 9 +
fs/xfs/linux-2.6/xfs_file.c | 65 ++-
fs/xfs/linux-2.6/xfs_super.c | 9 +
fs/xfs/linux-2.6/xfs_trace.h | 80 ++-
fs/xfs/xfs_ag.h | 21 +-
fs/xfs/xfs_alloc.c | 257 ++++---
fs/xfs/xfs_alloc.h | 5 +-
fs/xfs/xfs_buf_item.c | 33 +-
fs/xfs/xfs_log.c | 679 +++++++++++------
fs/xfs/xfs_log.h | 15 +-
fs/xfs/xfs_log_cil.c | 698 +++++++++++++++++
fs/xfs/xfs_log_priv.h | 117 +++-
fs/xfs/xfs_mount.h | 1 +
fs/xfs/xfs_trans.c | 193 ++++-
fs/xfs/xfs_trans.h | 53 +-
fs/xfs/xfs_trans_item.c | 109 ---
fs/xfs/xfs_trans_priv.h | 10 +-
19 files changed, 2594 insertions(+), 580 deletions(-)
create mode 100644 Documentation/filesystems/xfs-delayed-logging-design.txt
create mode 100644 fs/xfs/xfs_log_cil.c
Anyway that's it for now - comments, thoughts, bug fixes, etc are
welcome. :)
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] Delayed logging
2010-03-15 4:30 [RFC] Delayed logging Dave Chinner
@ 2010-03-15 16:01 ` Christoph Hellwig
2010-03-15 16:33 ` Christoph Hellwig
2010-03-15 16:08 ` Christoph Hellwig
` (2 subsequent siblings)
3 siblings, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2010-03-15 16:01 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
Running the series on my KVM testbox trips out this assert in
xfstests 013, which suggests we might be calling iput one too
much somewhere (e.g. IOP_UNLOCK)):
[ 132.174780] Assertion failed: atomic_read(&VFS_I(dp)->i_count) > 0, file: fs/xfs/xfs_vnodeops.c, line: 1430
[ 132.177814] ------------[ cut here ]------------
[ 132.179219] kernel BUG at fs/xfs/support/debug.c:109!
[ 132.180818] invalid opcode: 0000 [#1] SMP
[ 132.181734] last sysfs file: /sys/devices/virtio-pci/virtio1/block/vdb/removable
[ 132.181734] Modules linked in:
[ 132.181734]
[ 132.181734] Pid: 7752, comm: fsstress Not tainted 2.6.33-xfs #488 /Bochs
[ 132.181734] EIP: 0060:[<c04ef4fe>] EFLAGS: 00010286 CPU: 0
[ 132.181734] EIP is at assfail+0x1e/0x30
[ 132.181734] EAX: 00000072 EBX: f5aba270 ECX: f5551200 EDX: 01704000
[ 132.181734] ESI: 00000000 EDI: f5f701f0 EBP: f5307e48 ESP: f5307e38
[ 132.181734] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[ 132.181734] Process fsstress (pid: 7752, ti=f5306000 task=f5551200 task.ti=f5306000)
[ 132.181734] Stack:
[ 132.181734] c0b75db8 c0b74e40 c0b30914 00000596 f5307ec0 c04ded8e 00000001 00000000
[ 132.181734] <0> 00000000 00000000 00000001 f5307eb0 f5307ea8 00000000 00002124 f5307ee0
[ 132.181734] <0> 000268b8 00000002 00000000 00000000 00000029 00000000 00000000 00000000
[ 132.181734] Call Trace:
[ 132.181734] [<c04ded8e>] ? xfs_create+0x42e/0x640
[ 132.181734] [<c04eb768>] ? xfs_vn_mknod+0xa8/0x180
[ 132.181734] [<c04eb6c0>] ? xfs_vn_mknod+0x0/0x180
[ 132.181734] [<c020eaaf>] ? vfs_mknod+0xcf/0x160
[ 132.181734] [<c020ed1b>] ? sys_mknodat+0x1db/0x230
[ 132.181734] [<c020ed97>] ? sys_mknod+0x27/0x30
[ 132.181734] [<c012ea9c>] ? sysenter_do_call+0x12/0x3c
[ 132.181734] Code: 00 e8 f7 57 18 00 c9 c3 90 8d 74 26 00 55 89 e5 83
ec 10 89 4c 24 0c 89 54 24 08 89 44 24 04 c7 04 24 b8 5d b7 c0 e8 59 a2
3f 00 <0f> 0b eb fe 8d b4 26 00 00 00 00 8d bc 27 00 00 00 00 55 89 e5
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] Delayed logging
2010-03-15 4:30 [RFC] Delayed logging Dave Chinner
2010-03-15 16:01 ` Christoph Hellwig
@ 2010-03-15 16:08 ` Christoph Hellwig
2010-03-16 1:14 ` Dave Chinner
2010-03-15 16:38 ` Christoph Hellwig
2010-03-15 18:08 ` Christoph Hellwig
3 siblings, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2010-03-15 16:08 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
Some comments on the patches:
- "xfs: cluster fsync transaction" seems like something that can stand
on it's own and go into the tree now. Despite the comments it
currently only clusters in fsync and not write_inode which might
lead to higher benefits, btw.
- the busy extent tracking might be worth to be reordered before the
delayed logging series. In itself it might also want some reordering
as there's a lot churn in the patches there. Making
"XFS: Simplify transaction busy extent tracking" first in that
subseries might help quite a bit to reduce that churn
- The actual CIL implementation seems to be split into too small
patches IMHO. E.g. "xfs: extend the log item to support delayed
logging" and "xfs: Introduce the Committed Item List" are two
sides of the same coin and splitting it might not make too much
sense.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] Delayed logging
2010-03-15 16:01 ` Christoph Hellwig
@ 2010-03-15 16:33 ` Christoph Hellwig
2010-03-16 1:06 ` Dave Chinner
0 siblings, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2010-03-15 16:33 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On Mon, Mar 15, 2010 at 12:01:54PM -0400, Christoph Hellwig wrote:
> Running the series on my KVM testbox trips out this assert in
> xfstests 013, which suggests we might be calling iput one too
> much somewhere (e.g. IOP_UNLOCK)):
>
> [ 132.174780] Assertion failed: atomic_read(&VFS_I(dp)->i_count) > 0, file: fs/xfs/xfs_vnodeops.c, line: 1430
Turns out this was due to the fsync clustering patch, which adds inodes
to a transaction without first grabbing a reference to them.
Unofrtunately grabbing the reference is not a easy as it sounds, as
inodes in the cluster might be on the unused list or even not have
valid VFS state anymore, so we can't simply call IHOLD here. I've
reverted the patch in my local tree for now.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] Delayed logging
2010-03-15 4:30 [RFC] Delayed logging Dave Chinner
2010-03-15 16:01 ` Christoph Hellwig
2010-03-15 16:08 ` Christoph Hellwig
@ 2010-03-15 16:38 ` Christoph Hellwig
2010-03-16 1:15 ` Dave Chinner
2010-03-15 18:08 ` Christoph Hellwig
3 siblings, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2010-03-15 16:38 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
Next xfstests failure in 014 (note that this is all without even
enabling the delaylog mount option):
[ 392.576013] Assertion failed: xfs_alloc_busy_search(mp, busyp->agno,
busyp->bno, busyp->length) == busyp, file: fs/xfs/xfs_alloc.c, line:
2653
[ 392.586038] ------------[ cut here ]------------
[ 392.589862] kernel BUG at fs/xfs/support/debug.c:109!
[ 392.589862] invalid opcode: 0000 [#1] SMP
[ 392.589862] last sysfs file:
/sys/devices/virtio-pci/virtio1/block/vdb/removable
[ 392.589862] Modules linked in:
[ 392.589862]
[ 392.589862] Pid: 328, comm: xfslogd/0 Not tainted 2.6.33-xfs #491
/Bochs
[ 392.589862] EIP: 0060:[<c04ef33e>] EFLAGS: 00010282 CPU: 0
[ 392.589862] EIP is at assfail+0x1e/0x30
[ 392.589862] EAX: 00000095 EBX: f6aad3f0 ECX: f7119280 EDX: 01704000
[ 392.589862] ESI: f5ac0a88 EDI: 0000b4b5 EBP: f724fe2c ESP: f724fe1c
[ 392.589862] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[ 392.589862] Process xfslogd/0 (pid: 328, ti=f724e000 task=f7119280 task.ti=f724e000)
[ 392.589862] Stack:
[ 392.589862] c0b75db8 c0b6a3b4 c0b2db8f 00000a5d f724fe54 c047c086 00000003 f56f5600
[ 392.589862] <0> 00000fb4 00000003 00000000 f575a59c f575a5a8 f575a5a8 f724fe6c c04d67d8
[ 392.589862] <0> f5ac0a88 00000000 f575a438 00000000 f724fe84 c04d6eb3 00000000 f575a43c
[ 392.589862] Call Trace:
[ 392.589862] [<c047c086>] ? xfs_alloc_clear_busy+0x56/0xe0
[ 392.589862] [<c04d67d8>] ? xfs_trans_free_busy+0x38/0x50
[ 392.589862] [<c04d6eb3>] ? xfs_trans_committed+0x73/0x90
[ 392.589862] [<c04c5a12>] ? xlog_state_do_callback+0x282/0x3c0
[ 392.589862] [<c04c5fe4>] ? xlog_state_done_syncing+0xc4/0xe0
[ 392.589862] [<c04c60f1>] ? xlog_iodone+0xf1/0x160
[ 392.589862] [<c04e53b0>] ? xfs_buf_iodone_work+0x0/0xd0
[ 392.589862] [<c04e53df>] ? xfs_buf_iodone_work+0x2f/0xd0
[ 392.589862] [<c04e53b0>] ? xfs_buf_iodone_work+0x0/0xd0
[ 392.589862] [<c017a7c4>] ? worker_thread+0x154/0x270
[ 392.589862] [<c017a763>] ? worker_thread+0xf3/0x270
[ 392.589862] [<c017e1f0>] ? autoremove_wake_function+0x0/0x40
[ 392.589862] [<c017a670>] ? worker_thread+0x0/0x270
[ 392.589862] [<c017de3c>] ? kthread+0x6c/0x80
[ 392.589862] [<c017ddd0>] ? kthread+0x0/0x80
[ 392.589862] [<c012effa>] ? kernel_thread_helper+0x6/0x1c
[ 392.589862] Code: 00 e8 f7 57 18 00 c9 c3 90 8d 74 26 00 55 89 e5 83 ec 10 89 4c 24 0c 89 54 24 08 89 44 24 04 c7 04 24 b8 5d b7 c0 e8 59 a2 3f 00 <0f> 0b eb fe 8d b4 26 00 00 00 00 8d bc 27 00 00 00 00 55 89 e5
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] Delayed logging
2010-03-15 4:30 [RFC] Delayed logging Dave Chinner
` (2 preceding siblings ...)
2010-03-15 16:38 ` Christoph Hellwig
@ 2010-03-15 18:08 ` Christoph Hellwig
2010-03-16 1:20 ` Dave Chinner
2010-03-31 13:26 ` Dave Chinner
3 siblings, 2 replies; 11+ messages in thread
From: Christoph Hellwig @ 2010-03-15 18:08 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On Mon, Mar 15, 2010 at 03:30:00PM +1100, Dave Chinner wrote:
> - occasional recovery failure with recovery reading an inode
> buffer that does not contain inodes. Cause unknown, tends
> to be reproduced by xfsqa test 121 semi-reliably. Needs
> further analysis and fixing. May already be fixed with a
> recent fix to commit record synchronisation.
I just saw a corrupted inode buffer during log recovery in 087, even
without the delaylog mount option. It's however not easily
reproducible.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] Delayed logging
2010-03-15 16:33 ` Christoph Hellwig
@ 2010-03-16 1:06 ` Dave Chinner
0 siblings, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2010-03-16 1:06 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: xfs
On Mon, Mar 15, 2010 at 12:33:24PM -0400, Christoph Hellwig wrote:
> On Mon, Mar 15, 2010 at 12:01:54PM -0400, Christoph Hellwig wrote:
> > Running the series on my KVM testbox trips out this assert in
> > xfstests 013, which suggests we might be calling iput one too
> > much somewhere (e.g. IOP_UNLOCK)):
> >
> > [ 132.174780] Assertion failed: atomic_read(&VFS_I(dp)->i_count) > 0, file: fs/xfs/xfs_vnodeops.c, line: 1430
>
> Turns out this was due to the fsync clustering patch, which adds inodes
> to a transaction without first grabbing a reference to them.
> Unofrtunately grabbing the reference is not a easy as it sounds, as
> inodes in the cluster might be on the unused list or even not have
> valid VFS state anymore, so we can't simply call IHOLD here. I've
> reverted the patch in my local tree for now.
Yeah, that change was only half-baked - I was in two minds as to
whether to include it. I'll drop it from the series until I've had
more time to consider it.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] Delayed logging
2010-03-15 16:08 ` Christoph Hellwig
@ 2010-03-16 1:14 ` Dave Chinner
0 siblings, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2010-03-16 1:14 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: xfs
On Mon, Mar 15, 2010 at 12:08:01PM -0400, Christoph Hellwig wrote:
> Some comments on the patches:
>
> - "xfs: cluster fsync transaction" seems like something that can stand
> on it's own and go into the tree now. Despite the comments it
> currently only clusters in fsync and not write_inode which might
> lead to higher benefits, btw.
As I said, I'll have a bit more of a think about this one...
> - the busy extent tracking might be worth to be reordered before the
> delayed logging series. In itself it might also want some reordering
> as there's a lot churn in the patches there. Making
> "XFS: Simplify transaction busy extent tracking" first in that
> subseries might help quite a bit to reduce that churn
Agreed. They were ordered this way because I only added the extent
tracking after doing the delayed logging and tracking down one of
the sources of log forces that was limiting performance was
overflowing the per-ag busy extent array. I'll rework it....
> - The actual CIL implementation seems to be split into too small
> patches IMHO. E.g. "xfs: extend the log item to support delayed
> logging" and "xfs: Introduce the Committed Item List" are two
> sides of the same coin and splitting it might not make too much
> sense.
True, it might be a bit fine grained. Really, what I wanted to do is
split the changes that affected the non-delayed logging from those
that are only executed when delayed logging is active. That way I
could confirm that the non-delayed path was still operating
correctly before adding all the delayed logging code that used
it. I can combine them together again if you want.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] Delayed logging
2010-03-15 16:38 ` Christoph Hellwig
@ 2010-03-16 1:15 ` Dave Chinner
0 siblings, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2010-03-16 1:15 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: xfs
On Mon, Mar 15, 2010 at 12:38:18PM -0400, Christoph Hellwig wrote:
> Next xfstests failure in 014 (note that this is all without even
> enabling the delaylog mount option):
>
> [ 392.576013] Assertion failed: xfs_alloc_busy_search(mp, busyp->agno,
> busyp->bno, busyp->length) == busyp, file: fs/xfs/xfs_alloc.c, line:
> 2653
Ok, that implies a corrupted rbtree. I didn't actually run xfsqa on
thœse patches without the "-o delaylog" option, so I'll see if I can
reproduce it here.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] Delayed logging
2010-03-15 18:08 ` Christoph Hellwig
@ 2010-03-16 1:20 ` Dave Chinner
2010-03-31 13:26 ` Dave Chinner
1 sibling, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2010-03-16 1:20 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: xfs
On Mon, Mar 15, 2010 at 02:08:41PM -0400, Christoph Hellwig wrote:
> On Mon, Mar 15, 2010 at 03:30:00PM +1100, Dave Chinner wrote:
> > - occasional recovery failure with recovery reading an inode
> > buffer that does not contain inodes. Cause unknown, tends
> > to be reproduced by xfsqa test 121 semi-reliably. Needs
> > further analysis and fixing. May already be fixed with a
> > recent fix to commit record synchronisation.
>
> I just saw a corrupted inode buffer during log recovery in 087, even
> without the delaylog mount option. It's however not easily
> reproducible.
Yeah, that's the sort of thing I'm seeing 087, 110 (IIRC) and 121
are the tests that I've been seeing occasional, random failures like
this. I hadn't noticed it without the delaylog option, so I'll see
if I can get that to occur as well...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] Delayed logging
2010-03-15 18:08 ` Christoph Hellwig
2010-03-16 1:20 ` Dave Chinner
@ 2010-03-31 13:26 ` Dave Chinner
1 sibling, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2010-03-31 13:26 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: xfs
On Mon, Mar 15, 2010 at 02:08:41PM -0400, Christoph Hellwig wrote:
> On Mon, Mar 15, 2010 at 03:30:00PM +1100, Dave Chinner wrote:
> > - occasional recovery failure with recovery reading an inode
> > buffer that does not contain inodes. Cause unknown, tends
> > to be reproduced by xfsqa test 121 semi-reliably. Needs
> > further analysis and fixing. May already be fixed with a
> > recent fix to commit record synchronisation.
>
> I just saw a corrupted inode buffer during log recovery in 087, even
> without the delaylog mount option. It's however not easily
> reproducible.
Ok, I got reliable reproduction of this problem with delayed logging
on a single CPU VM, and I've been able to understand the problem
that delayed logging it triggering..
It appears to be related to the way recover deals with logging inode
buffers. We log inode changes in buffers in two different places -
inode chunk allocation (i.e. allocation of chunks of 64 inodes) and
again when recording changes to the inode unlinked lists.
When we allocate inodes, the buffer is marked in memory as an inode
alloc buffer, but this is not recorded in the buf log format item
that is recorded in the log.
When we update the unlinked list pointer for an inode, we mark the
buffer as an inode buffer, and this is recorded in the buf log
format item and written to the log. The list pointer is only updated
if there is more than one inode in the unlinked list bucket, hence
we need a certain number of allocations and unlinks to occur in
the same delayed logging transaction window to trigger this.
Unsurprisingly, the reproducer (test 121) is specifically setting up
the unlinked lists to have more than one inode in them and then
triggering shutdown and recovery. i.e. it is pretty much
specifically exercising this problem case.
In the inode alloc case, recovery of all the newly stamped inodes in
the buffer takes place via the regular buffer recovery code.
However, buffers that are marked as "inode buffers" and contain
unlinked list pointer updates go through a different buffer recovery
path that only updates the unlinked list pointers.
Hence if we get the situation where a buffer in a transaction
contains both inode allocation data and unlinked list manipulation,
the inode allocation portion of the buffer does not get replayed.
Because the buffer has not been cancelled, inode changes are then
replayed, which requires the buffer to be read in. At this point,
the buffer contains nothing but:
xfs_db> daddr 0x60
xfs_db> p
000: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
020: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
040: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
060: ffffffff 00000000 00000000 00000000 00000000 00000000 00000000 00000000
080: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
0a0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
0c0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
0e0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
A NULL unlinked list pointer, so the magic number check fails and
recovery aborts.
Now, this should only be a problem for delayed logging because there
is no way the current code can put an unlink operation in the same
transaction as an allocation operation. However, it sounds like it
is happening for current code as well. I've got some wild-ass
theories about how it might occur, but nothing concrete yet. I need
to be able to reproduce it without delayed logging, but I haven't
been able to do that yet....
I think the problem is easy to fix by pushing the inode alloc buffer
status into the log and only doing the inode buffer specific replay
if (INODE_BUF | INODE_ALLOC_BUF) == INODE_BUF. Once written into the
log *once*, the INODE_ALLOC_BUF flag needs to be removed from the
blf so that subsequent inode unlinked list pointer updates do not
overwrite inode state. I'll code this up and test it out
tomorrow...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2010-03-31 13:25 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-03-15 4:30 [RFC] Delayed logging Dave Chinner
2010-03-15 16:01 ` Christoph Hellwig
2010-03-15 16:33 ` Christoph Hellwig
2010-03-16 1:06 ` Dave Chinner
2010-03-15 16:08 ` Christoph Hellwig
2010-03-16 1:14 ` Dave Chinner
2010-03-15 16:38 ` Christoph Hellwig
2010-03-16 1:15 ` Dave Chinner
2010-03-15 18:08 ` Christoph Hellwig
2010-03-16 1:20 ` Dave Chinner
2010-03-31 13:26 ` Dave Chinner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox