From: Eric Sandeen <sandeen@sandeen.net>
To: Matthias Schniedermeyer <ms@citd.de>
Cc: Lin Li <sdeber@gmail.com>, xfs@oss.sgi.com
Subject: Re: XFS write cache flush policy
Date: Mon, 10 Dec 2012 14:54:47 -0600 [thread overview]
Message-ID: <50C64C17.9080206@sandeen.net> (raw)
In-Reply-To: <20121210091239.GA21114@citd.de>
On 12/10/12 3:12 AM, Matthias Schniedermeyer wrote:
> On 10.12.2012 11:58, Dave Chinner wrote:
>> On Sat, Dec 08, 2012 at 08:29:27PM +0100, Matthias Schniedermeyer wrote:
>>> On 06.12.2012 09:51, Lin Li wrote:
>>>> Hi, Guys. I recently suffered a huge data loss on power cut on an XFS
>>>> partition. The problem was that I copied a lot of files (roughly 20Gb) to
>>>> an XFS partition, then 10 hours later, I got an unexpected power cut. As a
>>>> result, all these newly copied files disappeared as if they had never been
>>>> copied. I tried to check and repair the partition, but xfs_check reports no
>>>> error at all. So I guess the problem is that the meta data for these files
>>>> were all kept in the cache (64Mb) and were never committed to the hard
>>>> disk.
>>>>
>>>> What is the cache flush policy for XFS? Does it always reserve some fixed
>>>> space in cache for metadata? I asked because I thought since I copied such
>>>> a huge amount of data, at least some of these files must be fully committed
>>>> to the hard disk, then cache is only 64Mb anyway. But the reality is all of
>>>> them were lost. the only possibility I can think is some part of the cache
>>>> was reserved for meta data, so even the cache is fully filled, this part
>>>> will not be written to the disk. Am I right?
>>>
>>> I have the same problem, several times.
>>>
>>> The latest just an hour ago.
>>> I'm copying a HDD onto another. Plain rsync -a /src/ /tgt/ Both HDDs are
>>> 3TB SATA-drives in a USB3-enclosure with a dm-crypt layer in between.
>>> About 45 minutes into copying the target HDD disconnects for a moment.
>>> 45minutes means someting over 200GB were copied, each file is about
>>> 900MB.
>>> After remounting the filesystems there were exactly 0 files.
>>
>> This sounds like an entirely different problem to what the OP
>> reported.
>
> For me it sounds only like different timing.
> Otherwise i don't see much difference in files vanished after a few
> hours(of inactiviry) and a few minutes (while still beeing active).
>
>> Did the filesystem have an error returned?
>
> No.
>
>> i.e. did it shut down (what's in dmesg)?
>
> There's not much XFS could have done after the block-device vanished.
except to shut down...
> A dis-/r-eappierung block-device gets a new name because the old name is
> still "in use", the block-devic gets cleaned up after 'umount'ing and
> closing the dm-crypt device.
>
> When the USB3-HDD disconnected it reappered a moment later under a new
> name, it bounced between sdc <-> sdf.
>
> In syslog it's a plain "USB disconnect, device number XX" message.
> Followed by a standard new device found message-bombardment. In between
> there are some error-messages, but as it's pratically a yanked out and
> replugged cable, a little complaing by the kernel is to be expected.
Sure, but Dave asked if the filesystem shut down. XFS messages would
tell you that; *were* there messages from XFS in the log from the event?
Sometimes "a little complaining" can be quite informative. :)
>> Did you run repair in between the shutdown and remount?
>
> No.
>
> XFS (dm-3): Mounting Filesystem
> XFS (dm-3): Starting recovery (logdev: internal)
> XFS (dm-3): Ending recovery (logdev: internal)
>
>> How many files in that 200GB of data?
>
> At 0.9GB/file at least 220.
>
>> http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
>>
>> Basically, you have an IO error situation, and you have dm-crypt
>> in-between buffering an unknown about of changes. In my experience,
>> data loss eventsi are rarely filesystem problems when USB drives or
>> dm-crypt is involved...
>
> I don't know the inner workings auf dm-*, but shouldn't it behave
> transparent and rely on the block-layer for buffering.
I think that's partly why Dave asked you to test it, to check
that theory ;)
>>> After that i started a "while true; do sync ; done"-loop in the
>>> background.
>>> And just while i was writing this email the HDD disconnected a second
>>> time. But this time the files up until the last 'sync' were retained.
>>
>> Exactly as I'd expect.
>>
>>> And something like this has happend to me at least a half dozen times in
>>> the last few month. I think the first time was with kernel 3.5.X, when i
>>> was actually booting into 3.6 with a plain "reboot" (filesystem might
>>> not have been umounted cleanly.), after the reboot the changes of about
>>> the last half hour were gone. e.g. i had renamed a directory about 15
>>> minutes before i rebooted and after the reboot the directory had it's
>>> old name back.
>>>
>>> Kernel in all but (maybe)one case is between 3.6 and 3.6.2 (currently),
>>> the first time MIGHT have been something around 3.5.8 but i'm not sure.
>>> HDDs were either connected by plain SATA(AHCI) or by USB3 enclosure. All
>>> affected filesystems were/are with a dm-crypt layer inbetween.
>>
>> Given that dm-crypt is the common factor here, I'd start by ruling
>> that out. i.e. reproduce the problem without dm-crypt being used.
>
> That's a slight problem for me, pratically everything i have is
> encrypted.
But this is an external drive; you could run a similar test with unencrypted
data on a different hard drive, to try to get to the bottom of this
problem, right?
Thanks,
-Eric
> Now that i think about it, maybe dm-crypt really is to blame, up until a
> few month ago i was using loop-AES. After dm-crypt got the capability to
> emulate it i have moved over to dm-crypt because the loop-AES support in
> Debian got worse over time. I didn't have any problems until after i
> moved to dm-crypt, but OTOH i'm not the only one using dm-crypt. But
> OTOOH maybe not so many people use the loop-AES compatibility-mode.
>
>
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2012-12-10 20:52 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-12-06 8:51 XFS write cache flush policy Lin Li
2012-12-08 19:29 ` Matthias Schniedermeyer
2012-12-08 19:40 ` Michael Monnerie
2012-12-08 19:51 ` Joe Landman
2012-12-08 19:53 ` Matthias Schniedermeyer
2012-12-09 7:19 ` Lin Li
2012-12-10 1:01 ` Dave Chinner
2012-12-10 20:14 ` Michael Monnerie
2012-12-10 0:58 ` Dave Chinner
2012-12-10 9:12 ` Matthias Schniedermeyer
2012-12-10 20:54 ` Eric Sandeen [this message]
2012-12-10 21:45 ` Matthias Schniedermeyer
2012-12-11 0:25 ` Dave Chinner
2012-12-10 0:45 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=50C64C17.9080206@sandeen.net \
--to=sandeen@sandeen.net \
--cc=ms@citd.de \
--cc=sdeber@gmail.com \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox