From: Phil Turmel <philip@turmel.org>
To: Mathias Mueller <raidfail@gmx.de>
Cc: Linux raid <linux-raid@vger.kernel.org>,
linux-raid-owner@vger.kernel.org
Subject: Re: broken raid level 5 array caused by user error
Date: Tue, 19 Jan 2016 12:51:43 -0500 [thread overview]
Message-ID: <569E77AF.6040906@turmel.org> (raw)
In-Reply-To: <bb6cce6b0d5bb6c653ef94e4a58388cf@pingofdeath.de>
[-- Attachment #1: Type: text/plain, Size: 1389 bytes --]
Hi Mathias,
On 01/19/2016 09:35 AM, Mathias Mueller wrote:
> Hi Phil,
>
> I forgot to add some information: when I was creating the bytestrings
> from my jpg file, I did not start from 0k but from 100k of the jpg file
> (to skip the jpg header).
Ok. But I'm still not confident of chunk boundaries.
>> Very interesting. You could go one step further and compare the jpeg
>> file contents in the first 1M against the locations found to determine
>> where the chunks actually start and end on each device. The final
>> offset will be a chunk multiple before these boundaries. Or do md5 sums
>> of 4k blocks to reduce the amount to inspect.
>
> How exactly can I do this? Should I create more Bytestrings and do more
> brep with them on my physical devices? I have already results from
> searching bytestrings with an offset of 64k (starting from 100k to 612k
> of my jpeg file, so 9 bytestrings at all). Should I provide a table of
> the results?
Sigh. I couldn't help myself. New utility attached. Curse you Mathias
for an interesting problem! ;-)
Call it with your jpeg and the devices to search, like so:
findHash.py /path/to/picture.jpeg /dev/sd[bcde]
It'll make a map of hashes of each 4k block in the jpeg and then search
the listed devices for those hashes, building a map of the file
fragments. This will clearly show chunk boundaries.
Please show the output.
Phil
[-- Attachment #2: findHash.py --]
[-- Type: text/x-python, Size: 2243 bytes --]
#! /usr/bin/python2
#
# Locate 4k fragments of a subject file in one or more other files or
# devices. Only reports two or more consecutive matches.
#
# Usage:
# findHash.py /path/to/subject/file /dev/sdx|/path/to/image/file [/dev/sdy ...]
import hashlib, sys, datetime
# Read the known file 4k at a time, building a dictionary of
# md5 hashes vs. offset. Use a large buffer for speed.
# Drops any partial block at the end of the file.
d = {}
pos = long(0)
f = open(sys.argv[1], 'r', 1<<20)
b = f.read(4096)
while len(b)==4096:
md5 = hashlib.md5()
md5.update(b)
h = md5.digest()
hlist = d.get(h)
if not hlist:
hlist = []
d[h] = hlist
# print "New hash %s at %8.8x" % (h.encode('hex'), pos)
hlist.append(pos)
pos += 4096
b = f.read(4096)
f.close()
print "%d Unique hashes in %s" % (len(d), sys.argv[1])
def checkAndPrint(match):
if match[2]>4096:
print "%20s @ %12.12x:%12.12x ~= %8.8x:%8.8x" % (fname, match[1], match[1]+match[2]-1, match[0], match[0]+match[2]-1)
# Read the candidate files/devices, looking for possible matches. Match
# entries are vectors of known file offset, candidate file offset, and
# length.
for fname in sys.argv[2:]:
print "\nSearching for pieces of %s in %s:..." % (sys.argv[1], fname)
pos = long(0)
f = open(fname, 'r', 1<<24)
matches = []
b = f.read(4096)
lastts = None
while len(b)==4096:
if not (pos & 0x7ffffff):
ts = datetime.datetime.now()
if lastts:
print "@ %12.12x %.1fMB/s \r" % (pos, 128.0/((ts-lastts).total_seconds())),
else:
print "@ %12.12x...\r" % pos,
sys.stdout.flush()
lastts = ts
md5 = hashlib.md5()
md5.update(b)
h = md5.digest()
if h in d:
i = 0
while i<len(matches):
match = matches[i]
target = match[0]+match[2]
continuations = [x for x in d[h] if x==target]
if continuations:
match[2] += 4096
i += 1
else:
del matches[i]
checkAndPrint(match)
if not matches:
matches = [[x, pos, 4096] for x in d[h]]
else:
for match in matches:
checkAndPrint(match)
matches = []
pos += 4096
b = f.read(4096)
print "End of %s at %12.12x" % (fname, pos)
# show matches that continue to the end of the candidate file/device.
for match in matches:
checkAndPrint(match)
next prev parent reply other threads:[~2016-01-19 17:51 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-11-10 21:33 broken raid level 5 array caused by user error Mathias Mueller
2015-11-10 21:41 ` Phil Turmel
2015-11-10 23:47 ` Mathias Mueller
2015-11-10 23:59 ` Phil Turmel
[not found] ` <b0cdddd4394bbc1356980bb61ac199c3@pingofdeath.de>
2015-11-11 1:00 ` Phil Turmel
2015-11-11 17:53 ` Mathias Mueller
2016-01-18 15:33 ` Mathias Mueller
2016-01-18 19:09 ` Phil Turmel
2016-01-19 14:35 ` Mathias Mueller
2016-01-19 17:51 ` Phil Turmel [this message]
2016-01-19 19:37 ` Phil Turmel
2016-01-20 9:04 ` Mathias Mueller
2016-01-22 9:30 ` Mathias Mueller
2016-01-22 17:16 ` Phil Turmel
2016-01-22 17:39 ` Mathias Mueller
2016-01-22 19:13 ` Phil Turmel
2016-01-25 10:02 ` Mathias Mueller
2015-11-11 1:03 ` Phil Turmel
2015-11-11 1:29 ` Mathias Mueller
-- strict thread matches above, loose matches on Subject: below --
2015-11-09 11:27 Mathias Mueller
2015-11-09 11:56 ` Mikael Abrahamsson
2015-11-09 13:50 ` Phil Turmel
[not found] ` <07de4cd96f39ecb6154794d072ca12e7@pingofdeath.de>
[not found] ` <5640B8AD.3030800@turmel.org>
2015-11-09 15:41 ` Mathias Mueller
[not found] ` <d764bf541381927fa4183c9266fb3f5a@pingofdeath.de>
[not found] ` <5640C38B.4060503@turmel.org>
[not found] ` <a3a91665c4b7cdd70dacc7d8815cc365@pingofdeath.de>
2015-11-09 21:13 ` Phil Turmel
2015-11-10 8:37 ` Mathias Mueller
2015-11-10 13:55 ` Phil Turmel
2015-11-10 14:55 ` Mathias Mueller
2015-11-10 15:20 ` Mathias Mueller
2015-11-10 15:28 ` Phil Turmel
2015-11-10 21:02 ` Mathias Mueller
2015-11-10 21:11 ` Phil Turmel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=569E77AF.6040906@turmel.org \
--to=philip@turmel.org \
--cc=linux-raid-owner@vger.kernel.org \
--cc=linux-raid@vger.kernel.org \
--cc=raidfail@gmx.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.