From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.3 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 63E22C2BA83 for ; Wed, 12 Feb 2020 04:01:57 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 3BB332086A for ; Wed, 12 Feb 2020 04:01:57 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727897AbgBLEB4 (ORCPT ); Tue, 11 Feb 2020 23:01:56 -0500 Received: from james.kirk.hungrycats.org ([174.142.39.145]:40460 "EHLO james.kirk.hungrycats.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727826AbgBLEB4 (ORCPT ); Tue, 11 Feb 2020 23:01:56 -0500 Received: by james.kirk.hungrycats.org (Postfix, from userid 1002) id C63175BCC87; Tue, 11 Feb 2020 23:01:54 -0500 (EST) Date: Tue, 11 Feb 2020 23:01:54 -0500 From: Zygo Blaxell To: linux-btrfs@vger.kernel.org Subject: some more information on multi-hour data rollbacks, fsync, and superblock updates or lack thereof Message-ID: <20200212040154.GL13306@hungrycats.org> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="yruHGtsrg8HLSUpe" Content-Disposition: inline User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org --yruHGtsrg8HLSUpe Content-Type: text/plain; charset=us-ascii Content-Disposition: inline While trying to understand how btrfs rolls back data several hours on a crash, I deployed some observation tools: one to measure fsync latency, and one to measure the interval between updates to the filesystem superblock's generation (transid) field. I wanted to see if this phenomenon can be detected so I can figure out how to replicate it. I put these tools on some test and production machines. The results so far are somewhat disturbing: 5.0 can do multi-hour transactions too, in fact it's even easier to do on 5.0 than on 5.4. Not that I've figured out how to replicate this behavior on demand yet, just that the data I've collected so far has more and longer long transactions occurring "naturally" on 5.0 than on 5.4. There is no obvious correlation between the running times of fsync, superblock updates, and other metadata operations. Without looking directly at the superblock, there's no obvious sign of a problem: applications are reading and writing and even calling fsync() normally, but the superblock's generation field stubbornly does not update. A machine that I previously thought had been healthy and performing excellently turned out to be my example case of a trivially reproduced, worst-case scenario. Everything seems fine as long as a host _never_ runs fsync. I had one host that was up for days without incident with transactions updating at reasonable intervals, until I started running the fsync test, and had an immediate 45-minute transaction stall. The 5-hour data loss event I reported a few days ago and the 14-hour transaction event before that also started almost immediately following fsync() calls. Maybe not a coincidence? fsync returns ... well, not immediately, but almost always within 10 minutes or less, even at times when the filesystem's superblock generation number is not updated for an hour. The good news is that fsync does work. I caught a machine in the middle of one of these multi-hour transactions, made two files, fsynced one, and forced a crash, with the result: # cat /proc/version Linux version 5.0.21-zb64-079f865c6d4d+ (root@waya) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP Wed Jan 22 12:06:32 EST 2020 # cp /boot/vmlinuz test-no-fsync # cp /boot/vmlinuz test-fsync # sync test-fsync # ls -l total 16812 -rw-r--r-- 1 root root 102458 Feb 11 22:09 fsync.log -rw------- 1 root root 8550048 Feb 11 22:09 test-fsync -rw------- 1 root root 8550048 Feb 11 22:09 test-no-fsync -rw-r--r-- 1 root root 3710 Feb 11 21:22 transid.log # echo b > /proc/sysrq-trigger One reboot later: # ls -l total 8448 -rw-r--r-- 1 root root 92256 Feb 11 20:56 fsync.log -rw------- 1 root root 8550048 Feb 11 22:09 test-fsync -rw-r--r-- 1 root root 3652 Feb 11 19:57 transid.log # cmp test-fsync /boot/vmlinuz # So the file that was fsynced is complete and correct, but the file that was not fsynced--and 73 minutes of other writes throughout the filesystem--neatly disappeared. This was the third of three consecutive hour-plus transactions on this host: end of transaction timestamp seconds with same transid 2020-02-11-19-57-39 1581469059 4315.585493582 2020-02-11-21-22-06 1581474126 5067.863739094 [third transaction aborted by reboot 47 minutes later] The timestamps don't quite line up here--there is data on the filesystem after the last superblock update (at 21:22), but still far from current (73 minutes lost). Maybe btrfs is updating one superblock at a time, so I only see one of every 3 transid updates with this method? But the transid increment is always 1, and I'd expect to see increments 3 at a time if I was missing two thirds of them. If you would like to play at home, the fsync latency testing tool is: #define _GNU_SOURCE 1 #include #include #include #include #include #include #include #include #include #include static char buf[128 * 1024 * 1024]; void fillrandom(void *buf, size_t size, unsigned flags) { while (size) { const ssize_t rv = getrandom(buf, size, flags); if (rv < 0) { perror("getrandom"); exit(1); } if ((size_t)rv > size) { fprintf(stderr, "WTF: rv %zd > size %zu\n", rv, size); exit(1); } buf += rv; size -= rv; } } int main(int argc, char **argv) { double sleep_time = 1; if (argc == 2) { sleep_time = atof(argv[1]); } const int fd = open(".", O_TMPFILE | O_RDWR, 0666); if (fd < 0) { perror("open"); exit(1); } const struct timespec sleep_ts = { .tv_sec = sleep_time, .tv_nsec = 1000000000 * (sleep_time - floor(sleep_time)), }; fprintf(stderr, "Waiting for random..."); fflush(stderr); fillrandom(buf, sizeof(buf), 0); fprintf(stderr, "\n"); fflush(stderr); struct timespec next_time; clock_gettime(CLOCK_MONOTONIC, &next_time); while (true) { struct timespec start; clock_gettime(CLOCK_MONOTONIC, &start); write(2, "w\b", 2); unsigned wr_pos = 0, wr_len = 256 * 1024; fillrandom(&wr_pos, sizeof(wr_pos), 0); fillrandom(&wr_len, sizeof(wr_len), 0); wr_pos %= sizeof(buf) * 8; wr_len %= sizeof(buf); const int rvw = pwrite(fd, buf, wr_len, wr_pos); if (rvw < 0) { perror("pwrite"); exit(1); } write(2, "f\b", 2); const int rvf = fsync(fd); if (rvf) { perror("fsync"); exit(1); } write(2, "s\b", 2); struct timespec end; clock_gettime(CLOCK_MONOTONIC, &end); char ft[1024]; struct timespec now; clock_gettime(CLOCK_REALTIME, &now); const struct tm *const tm = localtime(&now.tv_sec); strftime(ft, sizeof(ft), "%Y-%m-%d-%H-%M-%S", tm); const double start_ts = start.tv_sec + (start.tv_nsec / 1000000000.0); const double end_ts = end.tv_sec + (end.tv_nsec / 1000000000.0); printf("%s %ld.%09ld %.3g\n", ft, now.tv_sec, now.tv_nsec, end_ts - start_ts); fflush(stdout); next_time.tv_sec += sleep_ts.tv_sec; next_time.tv_nsec += sleep_ts.tv_nsec; if (next_time.tv_nsec >= 1000000000) { next_time.tv_nsec -= 1000000000; ++next_time.tv_sec; } clock_nanosleep(CLOCK_MONOTONIC, TIMER_ABSTIME, &next_time, NULL); } } Run it in any directory on the target filesystem, it will create a temporary file, write data to it, fsync, repeat. The first argument is the number of seconds between write/fsync, e.g. to fsync every 1.5 seconds: fsync-log 1.5 The superblock generation monitoring tool is: #!/bin/sh devPath="$1" shift getTransid () { btrfs ins dump-super "$devPath" | sed -n 's/^generation *//p' } now () { date +%s.%N } lastTransTime="$(now)" lastTransid="$(getTransid)" while sleep 1; do nextTransid="$(getTransid)" nextTransTime="$(now)" if [ "$nextTransid" != "$lastTransid" ]; then echo "$(date +%Y-%m-%d-%H-%M-%S) $nextTransTime $(echo $nextTransTime - $lastTransTime | bc -l) $((nextTransid-lastTransid))" lastTransid="$nextTransid" lastTransTime="$nextTransTime" fi done This one needs the name of a device in the btrfs filesystem as an argument, and will call btrfs-progs to extract the generation field. e.g. transid-log /dev/mapper/testfs Both write a log like this (without the headers): timestamp seconds-since-epoch latency in seconds number of txns 2020-02-11-18-38-17 1581464297.795520161 53.726077219 1 2020-02-11-18-39-11 1581464351.590758110 53.795237949 1 2020-02-11-18-40-15 1581464415.425477144 63.834719034 1 2020-02-11-18-45-43 1581464743.422229067 327.996751923 1 2020-02-11-19-57-39 1581469059.007722649 4315.585493582 1 2020-02-11-21-22-06 1581474126.871461743 5067.863739094 1 which can be fed to your favorite plotting program to make a horrifying graph on which the area under the curve is the amount of data you'll lose in a crash or power failure. --yruHGtsrg8HLSUpe Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iF0EABECAB0WIQSnOVjcfGcC/+em7H2B+YsaVrMbnAUCXkN4rwAKCRCB+YsaVrMb nMQpAJ9uqbQy3Olnc8STWTXv0Ph6d9mhFQCdGHXvdRTHY5fNbspE6rGTGUUuFmk= =BsAr -----END PGP SIGNATURE----- --yruHGtsrg8HLSUpe--