linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Phil Turmel <philip@turmel.org>
To: Mathias Mueller <raidfail@gmx.de>
Cc: Linux raid <linux-raid@vger.kernel.org>,
	linux-raid-owner@vger.kernel.org
Subject: Re: broken raid level 5 array caused by user error
Date: Tue, 19 Jan 2016 12:51:43 -0500	[thread overview]
Message-ID: <569E77AF.6040906@turmel.org> (raw)
In-Reply-To: <bb6cce6b0d5bb6c653ef94e4a58388cf@pingofdeath.de>

[-- Attachment #1: Type: text/plain, Size: 1389 bytes --]

Hi Mathias,

On 01/19/2016 09:35 AM, Mathias Mueller wrote:
> Hi Phil,
> 
> I forgot to add some information: when I was creating the bytestrings
> from my jpg file, I did not start from 0k but from 100k of the jpg file
> (to skip the jpg header).

Ok. But I'm still not confident of chunk boundaries.

>> Very interesting.  You could go one step further and compare the jpeg
>> file contents in the first 1M against the locations found to determine
>> where the chunks actually start and end on each device.  The final
>> offset will be a chunk multiple before these boundaries.  Or do md5 sums
>> of 4k blocks to reduce the amount to inspect.
> 
> How exactly can I do this? Should I create more Bytestrings and do more
> brep with them on my physical devices? I have already results from
> searching bytestrings with an offset of 64k (starting from 100k to 612k
> of my jpeg file, so 9 bytestrings at all). Should I provide a table of
> the results?

Sigh.  I couldn't help myself.  New utility attached.  Curse you Mathias
for an interesting problem! ;-)

Call it with your jpeg and the devices to search, like so:

findHash.py /path/to/picture.jpeg /dev/sd[bcde]

It'll make a map of hashes of each 4k block in the jpeg and then search
the listed devices for those hashes, building a map of the file
fragments.  This will clearly show chunk boundaries.

Please show the output.

Phil

[-- Attachment #2: findHash.py --]
[-- Type: text/x-python, Size: 2243 bytes --]

#! /usr/bin/python2
#
# Locate 4k fragments of a subject file in one or more other files or
# devices.  Only reports two or more consecutive matches.
#
# Usage:
#   findHash.py /path/to/subject/file /dev/sdx|/path/to/image/file [/dev/sdy ...]

import hashlib, sys, datetime

# Read the known file 4k at a time, building a dictionary of
# md5 hashes vs. offset.  Use a large buffer for speed.
# Drops any partial block at the end of the file.
d = {}
pos = long(0)
f = open(sys.argv[1], 'r', 1<<20)
b = f.read(4096)
while len(b)==4096:
	md5 = hashlib.md5()
	md5.update(b)
	h = md5.digest()
	hlist = d.get(h)
	if not hlist:
		hlist = []
		d[h] = hlist
#		print "New hash %s at %8.8x" % (h.encode('hex'), pos)
	hlist.append(pos)
	pos += 4096
	b = f.read(4096)
f.close()

print "%d Unique hashes in %s" % (len(d), sys.argv[1])

def checkAndPrint(match):
	if match[2]>4096:
		print "%20s @ %12.12x:%12.12x ~= %8.8x:%8.8x" % (fname, match[1], match[1]+match[2]-1, match[0], match[0]+match[2]-1)

# Read the candidate files/devices, looking for possible matches.  Match
# entries are vectors of known file offset, candidate file offset, and
# length.
for fname in sys.argv[2:]:
	print "\nSearching for pieces of %s in %s:..." % (sys.argv[1], fname)
	pos = long(0)
	f = open(fname, 'r', 1<<24)
	matches = []
	b = f.read(4096)
	lastts = None
	while len(b)==4096:
		if not (pos & 0x7ffffff):
			ts = datetime.datetime.now()
			if lastts:
				print "@ %12.12x %.1fMB/s   \r" % (pos, 128.0/((ts-lastts).total_seconds())),
			else:
				print "@ %12.12x...\r" % pos,
			sys.stdout.flush()
			lastts = ts
		md5 = hashlib.md5()
		md5.update(b)
		h = md5.digest()
		if h in d:
			i = 0
			while i<len(matches):
				match = matches[i]
				target = match[0]+match[2]
				continuations = [x for x in d[h] if x==target]
				if continuations:
					match[2] += 4096
					i += 1
				else:
					del matches[i]
					checkAndPrint(match)
			if not matches:
				matches = [[x, pos, 4096] for x in d[h]]
		else:
			for match in matches:
				checkAndPrint(match)
			matches = []
		pos += 4096
		b = f.read(4096)
	print "End of %s at %12.12x" % (fname, pos)
	# show matches that continue to the end of the candidate file/device.
	for match in matches:
		checkAndPrint(match)

  reply	other threads:[~2016-01-19 17:51 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-11-10 21:33 broken raid level 5 array caused by user error Mathias Mueller
2015-11-10 21:41 ` Phil Turmel
2015-11-10 23:47   ` Mathias Mueller
2015-11-10 23:59     ` Phil Turmel
     [not found]       ` <b0cdddd4394bbc1356980bb61ac199c3@pingofdeath.de>
2015-11-11  1:00         ` Phil Turmel
2015-11-11 17:53           ` Mathias Mueller
2016-01-18 15:33             ` Mathias Mueller
2016-01-18 19:09               ` Phil Turmel
2016-01-19 14:35                 ` Mathias Mueller
2016-01-19 17:51                   ` Phil Turmel [this message]
2016-01-19 19:37                     ` Phil Turmel
2016-01-20  9:04                       ` Mathias Mueller
2016-01-22  9:30                         ` Mathias Mueller
2016-01-22 17:16                           ` Phil Turmel
2016-01-22 17:39                             ` Mathias Mueller
2016-01-22 19:13                               ` Phil Turmel
2016-01-25 10:02                                 ` Mathias Mueller
2015-11-11  1:03       ` Phil Turmel
2015-11-11  1:29         ` Mathias Mueller
  -- strict thread matches above, loose matches on Subject: below --
2015-11-09 11:27 Mathias Mueller
2015-11-09 11:56 ` Mikael Abrahamsson
2015-11-09 13:50   ` Phil Turmel
     [not found]     ` <07de4cd96f39ecb6154794d072ca12e7@pingofdeath.de>
     [not found]       ` <5640B8AD.3030800@turmel.org>
2015-11-09 15:41         ` Mathias Mueller
     [not found]           ` <d764bf541381927fa4183c9266fb3f5a@pingofdeath.de>
     [not found]             ` <5640C38B.4060503@turmel.org>
     [not found]               ` <a3a91665c4b7cdd70dacc7d8815cc365@pingofdeath.de>
2015-11-09 21:13                 ` Phil Turmel
2015-11-10  8:37                   ` Mathias Mueller
2015-11-10 13:55                     ` Phil Turmel
2015-11-10 14:55                       ` Mathias Mueller
2015-11-10 15:20                       ` Mathias Mueller
2015-11-10 15:28                         ` Phil Turmel
2015-11-10 21:02                           ` Mathias Mueller
2015-11-10 21:11                             ` Phil Turmel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=569E77AF.6040906@turmel.org \
    --to=philip@turmel.org \
    --cc=linux-raid-owner@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=raidfail@gmx.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).