From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ns.bouton.name ([109.74.195.142]:52295 "EHLO mail.bouton.name" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756036AbbI0Pnv (ORCPT ); Sun, 27 Sep 2015 11:43:51 -0400 Received: from [192.168.0.32] (adsl.bouton.name [82.234.193.23]) by mail.bouton.name (Postfix) with ESMTP id DED3EB943 for ; Sun, 27 Sep 2015 17:34:50 +0200 (CEST) To: linux-btrfs@vger.kernel.org From: Lionel Bouton Subject: btrfs fi defrag interfering (maybe) with Ceph OSD operation Message-ID: <56080C9A.6030102@bouton.name> Date: Sun, 27 Sep 2015 17:34:50 +0200 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Sender: linux-btrfs-owner@vger.kernel.org List-ID: Hi, we use BTRFS for Ceph filestores (after much tuning and testing over more than a year). One of the problem we've had to face was the slow decrease in performance caused by fragmentation. Here's a small recap of the history for context. Initially we used internal journals on the few OSDs where we tested BTRFS, which meant constantly overwriting 10GB files (which is obviously bad for CoW). Before using NoCoW and eventually moving the journals to raw SSD partitions, we understood autodefrag was not being effective : the initial performance on a fresh, recently populated OSD was great and slowly degraded over time without access patterns and filesystem sizes changing significantly. My idea was that autodefrag might focus its efforts on files not useful to defragment in the long term. The obvious one was the journal (constant writes but only read again when restarting an OSD) but I couldn't find any description of the algorithms/heuristics used by autodefrag so I decided to disable it and develop our own defragmentation scheduler. It is based on both a slow walk through the filesystem (which acts as a safety net over one week period) and a fatrace pipe (used to detect recent fragmentation). Fragmentation is computed from filefrag detailed outputs and it learns how much it can defragment files with calls to filefrag after defragmentation (we learned compressed files and uncompressed files don't behave the same way in the process so we ended up treating them separately). Simply excluding the journal from defragmentation and using some basic heuristics (don't defragment recently written files but keep them in a pool then queue them and don't defragment files below a given fragmentation "cost" were defragmentation becomes ineffective) gave us usable performance in the long run. Then we successively moved the journal to NoCoW files and SSDs and disabled Ceph's use of BTRFS snapshots which were too costly (removing snapshots generated 120MB of writes to the disks and this was done every 30s on our configuration). In the end we had a very successful experience, migrated everything to BTRFS filestores that were noticeably faster than XFS (according to Ceph metrics), detected silent corruption and compressed data. Everything worked well until this morning. I woke up to a text message signalling VM freezes all over our platform. 2 Ceph OSDs died at the same time on two of our servers (20s appart) which for durability reason freezes writes on the data chunks shared by these two OSDs. The errors we got in the OSD logs seem to point to an IO error (at least IIRC we got a similar crash on an OSD where we had invalid csum errors logged by the kernel) but we couldn't find any kernel error and btrfs scrubs finished on the filesystems without finding any corruption. I've yet to get an answer for the possible contexts and exact IO errors. If people familiar with Ceph read this here's the error on Ceph 0.80.9 (more logs available on demand) : 2015-09-27 06:30:57.373841 7f05d92cf700 -1 os/FileStore.cc: In function 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t, size_t, ceph::bufferlist&, bool)' thread 7f05d92cf700 time 2015-09-27 06:30:57.260978 os/FileStore.cc: 2641: FAILED assert(allow_eio || !m_filestore_fail_eio || got != -5) Given that the defragmentation scheduler treats file accesses the same on all replicas to decide when triggering a call to "btrfs fi defrag ", I suspect this manual call to defragment could have happened on the 2 OSDs affected for the same file at nearly the same time and caused the near simultaneous crashes. It's not clear to me that "btrfs fi defrag " can't interfere with another process trying to use the file. I assume basic reading and writing is OK but there might be restrictions on unlinking/locking/using other ioctls... Are there any I should be aware of and should look for in Ceph OSDs? This is on a 3.8.19 kernel (with Gentoo patches which don't touch BTRFS sources) with btrfs-progs 4.0.1. We have 5 servers on our storage network : 2 are running a 4.0.5 kernel and 3 are running 3.8.19. The 3.8.19 servers are waiting for an opportunity to reboot on 4.0.5 (or better if we have the time to test a more recent kernel before rebooting : 4.1.8 and 4.2.1 are our candidates for testing right now). Best regards, Lionel Bouton