From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from plane.gmane.org ([80.91.229.3]:53563 "EHLO plane.gmane.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757057Ab3KMPEA (ORCPT ); Wed, 13 Nov 2013 10:04:00 -0500 Received: from list by plane.gmane.org with local (Exim 4.69) (envelope-from ) id 1Vgbz4-00027T-UZ for linux-btrfs@vger.kernel.org; Wed, 13 Nov 2013 16:03:58 +0100 Received: from ip68-231-22-224.ph.ph.cox.net ([68.231.22.224]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 13 Nov 2013 16:03:58 +0100 Received: from 1i5t5.duncan by ip68-231-22-224.ph.ph.cox.net with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 13 Nov 2013 16:03:58 +0100 To: linux-btrfs@vger.kernel.org From: Duncan <1i5t5.duncan@cox.net> Subject: Re: invalid opcode: 0000 [#1] SMP Date: Wed, 13 Nov 2013 15:03:37 +0000 (UTC) Message-ID: References: <1384242552.7516.9.camel@hsew-frn.HIPERSCAN> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Sender: linux-btrfs-owner@vger.kernel.org List-ID: Franziska Näpelt posted on Tue, 12 Nov 2013 08:49:12 +0100 as excerpted: > we are using a btrfs RAID 1 with four 2TB hard drives (WD Caviar green) > on a Debian 7.2 with Kernel 3.11.6 > > Now we had an 'invalid opcode: 0000 [#1] SMP' when a sector fails in > messages log. > After that, access over smb and nfs wasn't possible. > A restart solved the problem of inaccesibility. A couple notes from a fellow btrfs-using sysadmin... 1) invalid opcode 0000: As I understand it, this is relatively generic and doesn't define the error by itself. The 0000 opcode can be viewed as a zero-dereference of sorts, it's indication of a bug happening earlier, such that an expected valid opcode ends up being zero. The error itself will be earlier -- this is just where it ends up being trapped. As to what that error is in this case... 2) btrfs raid1: Unlike, for example, md/raid1, btrfs raid1 is not at this point run-time tolerant of device failure. At this point, a btrfs raid1 device failure seems to make the entire system basically unusable and require a reboot, after which device/data recovery (for example, mount degraded, add a replacement device, rebalance, and delete the failed one, or if it was a temporary dropout, simply btrfs scrub to find and fix the checksum mismatches from the valid copy) can be initiated, if necessary. When the sector failed, it apparently triggered the kernel btrfs to drop the entire device from active, which as I said, isn't well runtime supported at present, thus causing various btrfs worker threads to go unresponsive requiring a reboot to get back a normally functioning system. As the device failure was actually just that single sector failure, on reboot the device was once again available, and functionality was restored. However, if you haven't already done so, I'd strongly recommend doing a btrfs scrub on the affected filesystem, thus allowing btrfs to find the bad data copy due to the checksum mismatch and to recover from the good one it should have, due to the raid1 redundancy, rewriting a new, valid second copy once again, thereby restoring data redundancy as protection against the now single valid copy getting corrupted as well. Meanwhile, if you require runtime stability and failover, I'd suggest md/ raid1 or similar more mature and stable option designed to provide that. btrfs will hopefully do so at some point, but as the kernel btrfs option mentions, btrfs is still experimental and features are still being added and improved, and runtime failover is one such feature btrfs doesn't well support just yet. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman