Re: [dm-crypt] LUKS header recovery attempt from apparently healthy SSD

From: Arno Wagner <arno@wagner.name>
To: dm-crypt@saout.de
Subject: Re: [dm-crypt] LUKS header recovery attempt from apparently healthy SSD
Date: Sat, 22 Apr 2017 02:25:48 +0200	[thread overview]
Message-ID: <20170422002548.GA23882@tansi.org> (raw)
In-Reply-To: <f11f16b9-bbcd-4484-bc4b-403d25dc00b5@depressiverobots.com>

Hi Protagonist,

this is an impressive analysis and I basically agree with 
all of it.

Personally, I stropnglys suspect your option "I". This design 
here is 5 years old and MLC. MLC requires the firmware to do 
regular scanning, error correction and rewrites in order to 
be reliable. 5 years ago the state of the firmware for that 
was more "experimental" than "stable". 

For example, I have one old SSD from back then (OCZ trash), 
that has silent single bit-errors on average on one of 5 full 
reads. If such a bit-error happens on scrubbing or 
garbage-collection or regular writes to a partial internal 
(very large) sector, parts of the LUKS header may get rewritten 
with a permanent bit-error, even if the LUKS header itself was 
not written from outside at all. 

Such corruption can of course also be due to a failing SSD
controller, bad RAM in the SSD, bus-problems, etc. In 
particular, single-bit errors in an MLC-design will not
result from corrupted FLASH, but from other problems.

Now, are there any recovery options?

Aassume 1 bit has been corrupted in a random place.
A key-slot is 256kB, i.e. 2Mbit. That means trying it 
out (flip one bit, do an unlock attempt) would take 
2 million seconds on the original PC, i.e. 23 days.
This can maybe be brought down by a factor of 5 or so 
with the fastest avaliable CPU (the oteration count of 
150k is pretty low), i.e. still roughly 5 days. 

This may be worth giving it a try, but it requires some
serious coding with libcryptsetup and it will only
help on a single bit-error. 

It may of course be a more complex error, especially
when ECC in the disk has corrected an error to the
wrong value, because the original was too corrupted.
A sane design prevents this by using a second, 
independent checksum on the ECC result, but as I said, 
5 years ago SSD design was pretty experimental and 
beginner's mistakes were made. 

The keyslot checker is no help here, it is intendend
to find gross localized corruption, for example a 
new MBR being right in there in a keyslot. Chesckums
on LUKS-level were not implemented because they are 
not really needed as classical HDDs are very good at 
detecting read-errors. Unless you go to ZFS ot the like, 
filesystems do not do this either, for the same reasons. 
There is one gobal "checksum" in LUKS though, exactly 
the one that now tells you that there is no matching
keyslot, and on entry of a good passphrase that means
the keyslot is corrupted.

My take is that apart from making absolutely sure 
the passphrase is correct (it sounds very much like it 
is though) and running the manufacturers diagnostic 
tools on the SSD, there is not much more you can do. 

Regards,
Arno

On Fri, Apr 21, 2017 at 16:26:30 CEST, protagonist wrote:
> Hello all,
> someone found his way into our local hackerspace looking for help and
> advice with recovering his OS partition from a LUKS-encrypted INTEL SSD
> (SSDSC2CT240A4), and I've decided to get onto the case. Obviously,
> there is no backup, and he's aware of the consequences of this basic
> mistake by now.
> 
> The disk refused to unlock on boot in the original machine from one day
> to the other. Opening it form any other of several machines with
> different versions of Ubuntu/Debian, including Debian Stretch with a
> recent version of cryptsetup have been completely unsuccessful,
> indicating a MK digest mismatch and therefore "wrong password". The
> password is fairly simple and contains no special characters or
> locale-sensitive characters and had been written down. Therefore I
> assume it is known correctly and the header must be partially faulty.
> 
> After reading the header specification, the FAQs, relevant recovery
> threads on here as well as going through the header with a hex editor
> and deducing some of it's contents by hand, it is obvious to me that
> losing any significant portion (more than a few bytes) of the relevant
> LUKS header sections, either the critical parts of the meta-area or the
> actual key slot, would make the device contents provably irrecoverable,
> as even brute forcing becomes exponentially hard with the number of
> missing pseudo-randomly distributed bits.
> 
> Normally, one would move directly to grief stage number five -
> "Acceptance" - if the storage device in question was known to have data
> loss.
> 
> However, upon closer inspection, I can detect no obvious signs of
> multiple-byte data loss. There had been no intentional changes to the
> LUKS header, linux system upgrade or any other (known) relevant event to
> the system between it booting one day and refusing to unlock the day
> after. I realize that for *some* reasoning related to anti-forensics,
> the LUKS header specification contains no checksum over actual raw byte
> fields at all, making it very hard to detect the presence of minor
> defects in the header or providing any help in pinpointing their location.
> 
> Looking for major defects with the keyslot_checker reveals no obvious
> problems:
> 
> parameters (commandline and LUKS header):
>   sector size: 512
>   threshold:   0.900000
> 
> - processing keyslot 0:  keyslot not in use
> - processing keyslot 1:  start: 0x040000   end: 0x07e800
> - processing keyslot 2:  keyslot not in use
> - processing keyslot 3:  keyslot not in use
> - processing keyslot 4:  keyslot not in use
> - processing keyslot 5:  keyslot not in use
> - processing keyslot 6:  keyslot not in use
> - processing keyslot 7:  keyslot not in use
> 
> this is also the case if we increase the desired entropy to -t 0.935:
> 
> parameters (commandline and LUKS header):
>   sector size: 512
>   threshold:   0.935000
> 
> - processing keyslot 0:  keyslot not in use
> - processing keyslot 1:  start: 0x040000   end: 0x07e800
> - processing keyslot 2:  keyslot not in use
> [...]
> 
> Going through the sectors reported with -v at a higher -t value, I'm
> unable to find any suspicious groupings, for example unusual numbers of
> 00 00 or FF FF. Multi-byte substitution with a non-randomized pattern
> seems unlikely.
> 
> ------------------
> 
> The luksDump header information looks sane as well. The encryption had
> been created by the Mint 17.1 installation in the second half of 2014 on
> a fairly weak laptop and it's password later changed to a better one,
> which accounts for the use of keyslot #1 and fairly low iteration counts.
> 
> LUKS header information for /dev/sda5
> 
> Version:       	1
> Cipher name:   	aes
> Cipher mode:   	xts-plain64
> Hash spec:     	sha1
> Payload offset:	4096
> MK bits:       	512
> MK digest:     	ff 5c 64 48 bc 1f b2 f2 66 23 d3 66 38 41 c9 60 8a 7e
> de 0a
> MK salt:       	04 e3 04 8c 51 fd 07 ee d1 f3 4a 5e c1 8c b9 88
>                	ab 0d cf dc 55 7c fa bc ca 1a b7 02 5a 55 ac 2c
> MK iterations: 	35125
> UUID:          	24e05704-f8ed-4391-9a3d-a59330a919d2
> 
> Key Slot 0: DISABLED
> Key Slot 1: ENABLED
> 	Iterations:         	144306
> 	Salt:               	b8 6f 20 a7 fe 8b 6a 9a 21 58 92 13 ce 1a 43 12 9c
> 4e a0 bf 7c 51 5e a1 78 47 05 ca b6 32 da a4
> 	Key material offset:	512
> 	AF stripes:            	4000
> Key Slot 2: DISABLED
> Key Slot 3: DISABLED
> Key Slot 4: DISABLED
> Key Slot 5: DISABLED
> Key Slot 6: DISABLED
> Key Slot 7: DISABLED
> 
> The disabled key slot #0 salt is correctly filled up with nulls, making
> it unusable for any recovery attempt. All magic bytes of the key slots,
> including 2 to 7 look good. The uuid is "version: 4 (random data based)"
> according to uuid -d output and therefore not of much help.
> ------------------
> 
> smartctl indicates fairly standard use for a 240GB desktop ssd, with
> about ~3.7TB written at 2650h runtime, 1 reallocated sector and 0
> "Reported Uncorrectable Errors". The firmware version 335u seems to be
> the latest available, from what I've read. Smartctl tests with "-t
> short", "-t offline" and "-t long" test show no errors:
> # 1  Extended offline    Completed without error       00%      2648
>     -
> # 2  Offline             Completed without error       00%      2646
>     -
> # 3  Short offline       Completed without error       00%      2572
>     -
> The device also shows no issues during idle or read states hinting at
> physical problems.
> 
> Checksumming the 240GB of data read blockwise from the device by dd with
> sha512sum lead to identical results on three runs, so the device isn't
> mixing sectors or lying about their content in a different fashion
> differently each time we ask for data.
> 
> All in all, the failure mode is still a mystery to me. I can think of
> mainly three explanations:
> 
> I. silent data corruption events that have gone undetected by the
> SSD-internal sector-wide checksumming, namely bit/byte level changes on
>  * MK salt / digest
>  * key slot #1 iterations count / salt
>  * key slot #1 AF stripe data
> 
> II. actual passphrase mistakes
>  * "constant" mistake or layout mismatch
> This seems quite unlikely, as none of the characters change between a US
> layout and the DE layout that was used. There are also no characters
> that can be easily confused such as O/0.
> 
> III. some failure I've overlooked, like an OS-level bug or devilish
> malware causing "intentional" writes to the first 2M of the drive.
> 
> Failure case #I is still the most likely, but from my understanding, a
> four-digit number of system bootups and associated read events over the
> lifetime of the header shouldn't be able to cause any kind of flash
> wearout, let alone silent data corruption, unless the firmware is broken
> in a subtle way. Assuming it is - what to do besides  bruteforcing the
> AF section for bit flips?
> 
> I would be delighted about any advice or idea for further tests to
> narrow down whatever happened to this header.
> Regards,
> protagonist
> _______________________________________________
> dm-crypt mailing list
> dm-crypt@saout.de
> http://www.saout.de/mailman/listinfo/dm-crypt

-- 
Arno Wagner,     Dr. sc. techn., Dipl. Inform.,    Email: arno@wagner.name
GnuPG: ID: CB5D9718  FP: 12D6 C03B 1B30 33BB 13CF  B774 E35C 5FA1 CB5D 9718
----
A good decision is based on knowledge and not on numbers. -- Plato

If it's in the news, don't worry about it.  The very definition of 
"news" is "something that hardly ever happens." -- Bruce Schneier