From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-it0-f65.google.com ([209.85.214.65]:34254 "EHLO mail-it0-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752923AbdBHNmY (ORCPT ); Wed, 8 Feb 2017 08:42:24 -0500 Received: by mail-it0-f65.google.com with SMTP id o185so15480949itb.1 for ; Wed, 08 Feb 2017 05:40:53 -0800 (PST) Subject: Re: BTRFS for OLTP Databases To: Martin Raiber , Peter Zaitsev , linux-btrfs@vger.kernel.org References: <20170207140058.GA4249@carfax.org.uk> <0102015a1da5be24-3fd02799-c4e0-461b-92d2-82131016432e-000000@eu-west-1.amazonses.com> <0102015a1de76a82-da5513d7-1cd8-4eff-9e0a-e34aac752e1f-000000@eu-west-1.amazonses.com> From: "Austin S. Hemmelgarn" Message-ID: Date: Wed, 8 Feb 2017 08:32:11 -0500 MIME-Version: 1.0 In-Reply-To: <0102015a1de76a82-da5513d7-1cd8-4eff-9e0a-e34aac752e1f-000000@eu-west-1.amazonses.com> Content-Type: text/plain; charset=utf-8; format=flowed Sender: linux-btrfs-owner@vger.kernel.org List-ID: On 2017-02-08 08:26, Martin Raiber wrote: > On 08.02.2017 14:08 Austin S. Hemmelgarn wrote: >> On 2017-02-08 07:14, Martin Raiber wrote: >>> Hi, >>> >>> On 08.02.2017 03:11 Peter Zaitsev wrote: >>>> Out of curiosity, I see one problem here: >>>> If you're doing snapshots of the live database, each snapshot leaves >>>> the database files like killing the database in-flight. Like shutting >>>> the system down in the middle of writing data. >>>> >>>> This is because I think there's no API for user space to subscribe to >>>> events like a snapshot - unlike e.g. the VSS API (volume snapshot >>>> service) in Windows. You should put the database into frozen state to >>>> prepare it for a hotcopy before creating the snapshot, then ensure all >>>> data is flushed before continuing. >>>> >>>> I think I've read that btrfs snapshots do not guarantee single point in >>>> time snapshots - the snapshot may be smeared across a longer period of >>>> time while the kernel is still writing data. So parts of your writes >>>> may still end up in the snapshot after issuing the snapshot command, >>>> instead of in the working copy as expected. >>>> >>>> How is this going to be addressed? Is there some snapshot aware API to >>>> let user space subscribe to such events and do proper preparation? Is >>>> this planned? LVM could be a user of such an API, too. I think this >>>> could have nice enterprise-grade value for Linux. >>>> >>>> XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But >>>> still, also this needs to be integrated with MySQL to properly work. I >>>> once (years ago) researched on this but gave up on my plans when I >>>> planned database backups for our web server infrastructure. We moved to >>>> creating SQL dumps instead, although there're binlogs which can be used >>>> to recover to a clean and stable transactional state after taking >>>> snapshots. But I simply didn't want to fiddle around with properly >>>> cleaning up binlogs which accumulate horribly much space usage over >>>> time. The cleanup process requires to create a cold copy or dump of the >>>> complete database from time to time, only then it's safe to remove all >>>> binlogs up to that point in time. >>> >>> little bit off topic, but I for one would be on board with such an >>> effort. It "just" needs coordination between the backup >>> software/snapshot tools, the backed up software and the various snapshot >>> providers. If you look at the Windows VSS API, this would be a >>> relatively large undertaking if all the corner cases are taken into >>> account, like e.g. a database having the database log on a separate >>> volume from the data, dependencies between different components etc. >>> >>> You'll know more about this, but databases usually fsync quite often in >>> their default configuration, so btrfs snapshots shouldn't be much behind >>> the properly snapshotted state, so I see the advantages more with >>> usability and taking care of corner cases automatically. >> Just my perspective, but BTRFS (and XFS, and OCFS2) already provide >> reflinking to userspace, and therefore it's fully possible to >> implement this in userspace. Having a version of the fsfreeze (the >> generic form of xfs_freeze) stuff that worked on individual sub-trees >> would be nice from a practical perspective, but implementing it would >> not be easy by any means, and would be essentially necessary for a >> VSS-like API. In the meantime though, it is fully possible for the >> application software to implement this itself without needing anything >> more from the kernel. > > VSS snapshots whole volumes, not individual files (so comparable to an > LVM snapshot). The sub-folder freeze would be something useful in some > situations, but duplicating the files+extends might also take too long > in a lot of situations. You are correct that the kernel features are > there and what is missing is a user-space daemon, plus a protocol that > facilitates/coordinates the backups/snapshots. > > Sending a FIFREEZE ioctl, taking a snapshot and then thawing it does not > really help in some situations as e.g. MySQL InnoDB uses O_DIRECT and > manages its on buffer pool which won't get the FIFREEZE and flush, but > as said, the default configuration is to flush/fsync on every commit. OK, there's part of the misunderstanding. You can't FIFREEZE a BTRFS filesystem and then take a snapshot in it, because the snapshot requires writing to the filesystem (which the FIFREEZE would prevent, so a script that tried to do this would deadlock). A new version of the FIFREEZE ioctl would be needed that operates on subvolumes.