From: Phil Turmel <philip@turmel.org>
To: Mathias Mueller <raidfail@gmx.de>
Cc: Linux raid <linux-raid@vger.kernel.org>,
linux-raid-owner@vger.kernel.org
Subject: Re: broken raid level 5 array caused by user error
Date: Tue, 19 Jan 2016 12:51:43 -0500 [thread overview]
Message-ID: <569E77AF.6040906@turmel.org> (raw)
In-Reply-To: <bb6cce6b0d5bb6c653ef94e4a58388cf@pingofdeath.de>
[-- Attachment #1: Type: text/plain, Size: 1389 bytes --]
Hi Mathias,
On 01/19/2016 09:35 AM, Mathias Mueller wrote:
> Hi Phil,
>
> I forgot to add some information: when I was creating the bytestrings
> from my jpg file, I did not start from 0k but from 100k of the jpg file
> (to skip the jpg header).
Ok. But I'm still not confident of chunk boundaries.
>> Very interesting. You could go one step further and compare the jpeg
>> file contents in the first 1M against the locations found to determine
>> where the chunks actually start and end on each device. The final
>> offset will be a chunk multiple before these boundaries. Or do md5 sums
>> of 4k blocks to reduce the amount to inspect.
>
> How exactly can I do this? Should I create more Bytestrings and do more
> brep with them on my physical devices? I have already results from
> searching bytestrings with an offset of 64k (starting from 100k to 612k
> of my jpeg file, so 9 bytestrings at all). Should I provide a table of
> the results?
Sigh. I couldn't help myself. New utility attached. Curse you Mathias
for an interesting problem! ;-)
Call it with your jpeg and the devices to search, like so:
findHash.py /path/to/picture.jpeg /dev/sd[bcde]
It'll make a map of hashes of each 4k block in the jpeg and then search
the listed devices for those hashes, building a map of the file
fragments. This will clearly show chunk boundaries.
Please show the output.
Phil
[-- Attachment #2: findHash.py --]
[-- Type: text/x-python, Size: 2243 bytes --]
#! /usr/bin/python2
#
# Locate 4k fragments of a subject file in one or more other files or
# devices. Only reports two or more consecutive matches.
#
# Usage:
# findHash.py /path/to/subject/file /dev/sdx|/path/to/image/file [/dev/sdy ...]
import hashlib, sys, datetime
# Read the known file 4k at a time, building a dictionary of
# md5 hashes vs. offset. Use a large buffer for speed.
# Drops any partial block at the end of the file.
d = {}
pos = long(0)
f = open(sys.argv[1], 'r', 1<<20)
b = f.read(4096)
while len(b)==4096:
md5 = hashlib.md5()
md5.update(b)
h = md5.digest()
hlist = d.get(h)
if not hlist:
hlist = []
d[h] = hlist
# print "New hash %s at %8.8x" % (h.encode('hex'), pos)
hlist.append(pos)
pos += 4096
b = f.read(4096)
f.close()
print "%d Unique hashes in %s" % (len(d), sys.argv[1])
def checkAndPrint(match):
if match[2]>4096:
print "%20s @ %12.12x:%12.12x ~= %8.8x:%8.8x" % (fname, match[1], match[1]+match[2]-1, match[0], match[0]+match[2]-1)
# Read the candidate files/devices, looking for possible matches. Match
# entries are vectors of known file offset, candidate file offset, and
# length.
for fname in sys.argv[2:]:
print "\nSearching for pieces of %s in %s:..." % (sys.argv[1], fname)
pos = long(0)
f = open(fname, 'r', 1<<24)
matches = []
b = f.read(4096)
lastts = None
while len(b)==4096:
if not (pos & 0x7ffffff):
ts = datetime.datetime.now()
if lastts:
print "@ %12.12x %.1fMB/s \r" % (pos, 128.0/((ts-lastts).total_seconds())),
else:
print "@ %12.12x...\r" % pos,
sys.stdout.flush()
lastts = ts
md5 = hashlib.md5()
md5.update(b)
h = md5.digest()
if h in d:
i = 0
while i<len(matches):
match = matches[i]
target = match[0]+match[2]
continuations = [x for x in d[h] if x==target]
if continuations:
match[2] += 4096
i += 1
else:
del matches[i]
checkAndPrint(match)
if not matches:
matches = [[x, pos, 4096] for x in d[h]]
else:
for match in matches:
checkAndPrint(match)
matches = []
pos += 4096
b = f.read(4096)
print "End of %s at %12.12x" % (fname, pos)
# show matches that continue to the end of the candidate file/device.
for match in matches:
checkAndPrint(match)
next prev parent reply other threads:[~2016-01-19 17:51 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-11-10 21:33 broken raid level 5 array caused by user error Mathias Mueller
2015-11-10 21:41 ` Phil Turmel
2015-11-10 23:47 ` Mathias Mueller
2015-11-10 23:59 ` Phil Turmel
[not found] ` <b0cdddd4394bbc1356980bb61ac199c3@pingofdeath.de>
2015-11-11 1:00 ` Phil Turmel
2015-11-11 17:53 ` Mathias Mueller
2016-01-18 15:33 ` Mathias Mueller
2016-01-18 19:09 ` Phil Turmel
2016-01-19 14:35 ` Mathias Mueller
2016-01-19 17:51 ` Phil Turmel [this message]
2016-01-19 19:37 ` Phil Turmel
2016-01-20 9:04 ` Mathias Mueller
2016-01-22 9:30 ` Mathias Mueller
2016-01-22 17:16 ` Phil Turmel
2016-01-22 17:39 ` Mathias Mueller
2016-01-22 19:13 ` Phil Turmel
2016-01-25 10:02 ` Mathias Mueller
2015-11-11 1:03 ` Phil Turmel
2015-11-11 1:29 ` Mathias Mueller
-- strict thread matches above, loose matches on Subject: below --
2015-11-09 11:27 Mathias Mueller
2015-11-09 11:56 ` Mikael Abrahamsson
2015-11-09 13:50 ` Phil Turmel
[not found] ` <07de4cd96f39ecb6154794d072ca12e7@pingofdeath.de>
[not found] ` <5640B8AD.3030800@turmel.org>
2015-11-09 15:41 ` Mathias Mueller
[not found] ` <d764bf541381927fa4183c9266fb3f5a@pingofdeath.de>
[not found] ` <5640C38B.4060503@turmel.org>
[not found] ` <a3a91665c4b7cdd70dacc7d8815cc365@pingofdeath.de>
2015-11-09 21:13 ` Phil Turmel
2015-11-10 8:37 ` Mathias Mueller
2015-11-10 13:55 ` Phil Turmel
2015-11-10 14:55 ` Mathias Mueller
2015-11-10 15:20 ` Mathias Mueller
2015-11-10 15:28 ` Phil Turmel
2015-11-10 21:02 ` Mathias Mueller
2015-11-10 21:11 ` Phil Turmel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=569E77AF.6040906@turmel.org \
--to=philip@turmel.org \
--cc=linux-raid-owner@vger.kernel.org \
--cc=linux-raid@vger.kernel.org \
--cc=raidfail@gmx.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).