Steve Krenzel - 16. Algorithms: MD5 Cycles

MD5 Cycles

I recently submitted a python solution to the Kember Identity Project. I wasn't striving for speed, but rather clarity, simply because the first implementation that was up was fairly convoluted (with all due respect to the author). You'll see I also used a for loop instead of "while True". If you wanted to run this script for real and never have it stop, you'd just replace line 8 with "while True:".

Effectively we're just trying to find a solution for MD5(x) = x by trying random x's.

On my machine I was able to test around 300,000 hashes a second. One of the ways this speed was achieved was by assuming that the MD5 function is a sufficiently random function. Once we have this assumption we can use the output of the MD5 function as it's input the next time since we are just trying random strings until we find a solution. If we used python's random library to generate a new random string on every iteration, the performance drops almost five fold.

I started wondering if we couldn't make that assumption. What if there existed a case where we got into an infinite loop like:

MD5(A) = B

MD5(B) = C

MD5(C) = D

MD5(D) = E

MD5(E) = C

We'd have a cycle of length three, effectively causing the code to keep looping through the same values and never realizing it. The Kember Identity is a special case of the above scenario, where the cycle is of length 1.

So I asked myself, "Suppose we wanted to find a cycle of any length, rather than of just length 1."

The first immediate problem you might realize is that the MD5 function space has lots of values. The cycle could be 50 trillion numbers long. This large size immediately knocks out any possibility of storing all of the hashes we compute. So what can we do?

We can use the same trick that is used to find loops in linked lists. If you need to find a loop in a linked list you can do so with just two pointers. You simply have one pointer move forward two nodes every time the other moves forward one. Because the one pointer is moving forward twice as fast as the other pointer, if there are any loops the faster pointer will eventually pass the slower pointer. We just need to check if that happens, and if it does, we know that the list has a loop.

We can apply this same concept to finding cycles in the MD5 function while needing only two variables. We have the faster variable move forward two hashes on every iteration, and then we move the slower variable forward one hash at a time. Now our code will find cycles of any length and stop when it finds one.

Here is the code :

#!/usr/bin/python
from random import randint
import md5

a = b = md5.new("0").hexdigest()
while True:
    for i in xrange(2):
        a = md5.new(a).hexdigest()
        if a == b:
            print a
            break
    b = md5.new(b).hexdigest()

About the author
I'm Steve Krenzel, a software engineer and co-founder of Thinkfuse. Contact me at steve@thinkfuse.com.