Searching \ for '[OT] batch hex file similarity comparison tool?' in subject line. ()
Make payments with PayPal - it's fast, free and secure! Help us get a faster server
FAQ page: www.piclist.com/techref/index.htm?key=batch+hex+file+similarity
Search entire site for: 'batch hex file similarity comparison tool?'.

Exact match. Not showing close matches.
PICList Thread
'[OT] batch hex file similarity comparison tool?'
2006\01\13@141417 by William Couture

face picon face
Hi all,

The company I work for is suing a Taiwanese company for making an illegal
"clone" of one of our products.  I'm trying to help prove that the code in the
ROM is actually ours.  This started before I joined the company, so I'm not
familiar with all the details.

There are a lot of versions of code for this, and I need to find the one they
copied.

Today, however, my boss suddenly told me that they made some changes,
it's not an exact copy.

So, I need to do a "similarity search" over several hundred hex files scattered
throughout a file tree to find the one that they derived their code from.

Has anyone ever heard of such a tool?

Thanks,
  Bill

--
Psst...  Hey, you... Buddy...  Want a kitten?  straycatblues.petfinder.org

2006\01\13@143031 by Alex Harford

face picon face
On 1/13/06, William Couture <spam_OUTbcoutureTakeThisOuTspamgmail.com> wrote:
>
> Today, however, my boss suddenly told me that they made some changes,
> it's not an exact copy.
>
> So, I need to do a "similarity search" over several hundred hex files scattered
> throughout a file tree to find the one that they derived their code from.
>
> Has anyone ever heard of such a tool?

No but I've been interested in this as well because I've been working
on reverse engineering the GM ECMs and I'd like to have a nice way of
automating the search for large chunks of identical (or very similar)
binary data.  I'm sure GM reused a large amount of code across the
various ECMs but I just haven't gotten a round tuit yet.

AFAIK the unix 'diff' program only works on text, and I haven't had
any luck with the binary diff tools I've found on the net.

Since I'm not a computer scientist I'm sure I'm missing out on a more
efficient algorithm, but simple program could probably be written to
read a chunk of binary data (4 bytes lets say) and search for the same
pattern in another file.  If that pattern is found, check the next
byte in both files until they don't match.  Do this for every 4 byte
chunk in the original file and report the large chunks that were
found.

Alex

2006\01\13@143537 by Robert Rolf

picon face
I would convert the hex files to a binary image to make the
similarities more visible.

Then use Symantec 'filecompare' in forced ASCII mode.
That way identical sequences will be detected and highlighted.

The other way would be to convert the binary files back into
source using an appropriate disassembler (they are everywhere on
the web) and again do a file compare.

Trying to do this with the hex files is a pointless exercise in
futility since the hex encoding process will make changes less visible.
i.e. referenced addresses will be different, but with source, you can
see the intent is identical.

Robert

William Couture wrote:

{Quote hidden}

2006\01\13@144533 by Herbert Graf

flavicon
face
On Fri, 2006-01-13 at 14:14 -0500, William Couture wrote:
> Hi all,
>
> The company I work for is suing a Taiwanese company for making an illegal
> "clone" of one of our products.  I'm trying to help prove that the code in the
> ROM is actually ours.  This started before I joined the company, so I'm not
> familiar with all the details.
>
> There are a lot of versions of code for this, and I need to find the one they
> copied.
>
> Today, however, my boss suddenly told me that they made some changes,
> it's not an exact copy.
>
> So, I need to do a "similarity search" over several hundred hex files scattered
> throughout a file tree to find the one that they derived their code from.
>
> Has anyone ever heard of such a tool?

Don't know what OS you run, but diff under linux/unix will do that very
simply. I'm sure there's something similar for the OS you're interested
in.

TTYL

-----------------------------
Herbert's PIC Stuff:
http://repatch.dyndns.org:8383/pic_stuff/

2006\01\13@151443 by Wouter van Ooijen

face picon face
> So, I need to do a "similarity search" over several hundred
> hex files scattered
> throughout a file tree to find the one that they derived
> their code from.
>
> Has anyone ever heard of such a tool?

Maybe concatenate the candidate .hex file with the one from the pirate,
and zip the combination. A copy or partial copy will compress better
than a different application. You might need some experimenting to
interpret the results. But at least you can automate the process.

Wouter van Ooijen

-- -------------------------------------------
Van Ooijen Technische Informatica: http://www.voti.nl
consultancy, development, PICmicro products
docent Hogeschool van Utrecht: http://www.voti.nl/hvu


2006\01\13@154236 by William Chops Westfield
face picon face
>> Today, however, my boss suddenly told me that they made some changes,
>> it's not an exact copy.
>>
>> So, I need to do a "similarity search" over several hundred hex
>> files scattered throughout a file tree to find the one that they
>> derived their code from.

For something like PIC code, where you HAVE the original source
code, I think you'd be best off disassembling the code, running
it through some sort of symbol assignment based on your source,
and comparing the results...

BillW

2006\01\13@164145 by Michael Dipperstein

face
flavicon
face
> From: .....piclist-bouncesKILLspamspam@spam@mit.edu [piclist-bouncesspamKILLspammit.edu] On
Behalf
> Of William Couture


> So, I need to do a "similarity search" over several hundred hex files
> scattered
> throughout a file tree to find the one that they derived their code
from.
>
> Has anyone ever heard of such a tool?

The people who do DNA sequence alignment have tools for calculating
similarity between DNA sequences.  One of the algorithms that they use
is called PAM (Percent Acceptable Mutation).  I don't remember much
about it, but I've heard of people adopting it for spell checkers.  It
might be useful for stings of data bytes too.

You also need to be careful about the hex files you compare.  There's
nothing magical about the order that the lines appear in or the number
of bytes on a line.  Unprogrammed locations don't have to be included
either.

My first thought at handling hex file variations is to reformat them
using a program reads the hex file data into an array the size of the
PIC's memory, then writes out the sorted data bytes to a file to be used
for the comparison.

-Mike

2006\01\13@172127 by andrew kelley

picon face
Or as an alternative to all those, write a utility which will read in
the opcodes and search for pattern matches between the two.. (because
the operands will most likely differ at least for jumps (if they
shuffled the code around)..) and also for memory if they changed that
around too..  But basically comparing code sequences..  As a start I
can send you some code that I have for processing PIC opcodes (in C).

Email me offlist if you are interested.. ( I could even write the
pattern matching code...)

andrew

2006\01\13@182254 by James Newtons Massmind

face picon face

Resync Hex file comparison is available in:

Hex Works from BreakPoint Software

010Edit from Sweetscape Software

---
James.



> {Original Message removed}

2006\01\13@195226 by Rolf

face picon face
A while ago there was a big deal between PearPC and CherryOS (Mac
Emulators).

Google search for that because people used some very fancy mechanisms to
identify common code patterns betweenthem (lots).

The "drunken-blog" was the best reference, but it is no longer
available, although it is still in the google cache.

Rolf

William Couture wrote:
{Quote hidden}

2006\01\13@211045 by Jose Da Silva

flavicon
face
On January 13, 2006 11:14 am, William Couture wrote:
> The company I work for is suing a Taiwanese company for making an
> illegal "clone" of one of our products.  I'm trying to help prove
> that the code in the ROM is actually ours.  This started before I
> joined the company, so I'm not familiar with all the details.
>
> There are a lot of versions of code for this, and I need to find the
> one they copied.

1. Use a disassembler to reverse the code into source code.
2. Use just the command sequence to quickly find-out which one they
copied by using the unix "diff" command:
xxxx        movlw        xxx        <-(the xxx stuff throws off a diff too quick)
xxxx        addwf        xxx
gets stripped-down to this:
movlw
addwf
etc...

This should probably give you the quickest answer to find the version
since you really don't care too much if they swapped RAM file registers
(this affects the byte values in the hex code) which is going to
throw-off your diff command quite quickly.

Let me know off-list if you want help.
http://www.JoesCat.com/micro/picchip.htm

2006\01\14@083709 by William Couture

face picon face
On 1/13/06, Jose Da Silva <.....DigitalKILLspamspam.....joescat.com> wrote:

> On January 13, 2006 11:14 am, William Couture wrote:
> > There are a lot of versions of code for this, and I need to find the
> > one they copied.
>
> 1. Use a disassembler to reverse the code into source code.
> 2. Use just the command sequence to quickly find-out which one they
> copied by using the unix "diff" command:
> xxxx    movlw   xxx     <-(the xxx stuff throws off a diff too quick)
> xxxx    addwf   xxx
> gets stripped-down to this:
> movlw
> addwf
> etc...
>
> This should probably give you the quickest answer to find the version
> since you really don't care too much if they swapped RAM file registers
> (this affects the byte values in the hex code) which is going to
> throw-off your diff command quite quickly.

Some more info (some of it learned yesterday afternoon, after my
initial question):

The original source code is C for a 68HC11.  They did their changes
from a ROM.  So, I have to figure out if they just patched the ROM
image, or did they disassemble, change, and re-assemble.  Then,
maybe I can find the version they pirated...  Bleh...

Bill

--
Psst...  Hey, you... Buddy...  Want a kitten?  straycatblues.petfinder.org

2006\01\14@144612 by Peter

picon face

One way to compute the similarity of data is as follows:

1. turn each file into a binary image (a .bin file should work)
2. compute the DFFT of the file using a fixed 'window'
3. compute the CoG of the normalized DFFT of each DFFT, using frequency
and amplitude as 2d space
4. sort the results from 3, using the 3-result of the file to be
compared with as reference.

Maybe I am not very clear in my explanation, ask for more. The algorythm
is used in image processing among other things (e.g. image recognition).
Also speech recognition and general pattern matching. Such an algorythm
should exist somewhere anyway. Step 3 can be replaced by other types of
calculations.

Peter

2006\01\14@160935 by Jose Da Silva

flavicon
face
On January 14, 2006 05:37 am, William Couture wrote:
{Quote hidden}

Bleh is correct since you are dealing with 1,2,3 or 4 byte codes.

2006\01\23@111031 by William Couture

face picon face
I though I'd follow up on this.

I've found the pirated file -- it turns out that I was searching the
wrong files, the .HEX files in the project directory, despite their
names, had nothing to do with the code (I still don't  know why
they are there, the programmer is long gone).

The correct object code was stored in a Motorola S file.  Once I
found that out, it was fairly easy to search on them and find the
"correct" file by a simple similarity comparison.

The pirated version had 54 bytes out of 65536 changed, all of them
constant data.  I used the linkmap to figure out exactly what they
had changed and to what, and given all the pertinant information
to my boss.

Thanks for the suggestions everyone!

Bill

--
Psst...  Hey, you... Buddy...  Want a kitten?  straycatblues.petfinder.org

2006\01\23@114755 by Alan B. Pearce

face picon face
>The pirated version had 54 bytes out of 65536 changed,

The copyright information ???

2006\01\23@115905 by William Couture

face picon face
On 1/23/06, Alan B. Pearce <A.B.Pearcespamspam_OUTrl.ac.uk> wrote:
> >The pirated version had 54 bytes out of 65536 changed,
>
> The copyright information ???

Copyright, unit name, software version, a couple of error
messages, and a few numeric constants (so it did not
"act like" our controller by default).

Bill

--
Psst...  Hey, you... Buddy...  Want a kitten?  straycatblues.petfinder.org

More... (looser matching)
- Last day of these posts
- In 2006 , 2007 only
- Today
- New search...