In [1]:
%%HTML
<style> code {background-color : lightgrey !important;} </style>

this text was written by bruno cuconato, and was originally published at my blog. it is licensed as a Creative Commons Attribution-ShareAlike 4.0 International License.

how could we stop SPAM without SPAM filters?

this notebook explains how an anti-SPAM schema would work using a variant of the hashcash algorithm.

what is a hash function?

first, we need to know how a cryptographic hash function works. we don't need to know exactly how it is calculated to know what it does: it takes an input such as a sentence or a number* and outputs a seemingly random combination of numbers and letters.

I say seemingly random because the result is actually deterministic: every time you input the word cat to the hash function md5, for example, you get d077f244def8a70e5ea758bd8352fcd8. it doesn't matter which computer you use, the result is always the same.**

if you input cate, though, the answer is 8540d0d7e1d8111279b63983697e2e37, which is totally unrelated to the previous output, even if the input is almost the same. these are nice properties of hash functions: we can't really guess what an output to a given input will be, and we can't try to guess the answer by trying similar inputs.

numeral systems

if you're curious about what is the combination of numbers (0-9) and letters (a-f) outputed by a hash function, keep on reading this section, else jump to the next section.

something like 8540d0d7e1d8111279b63983697e2e37 is actually a number. "how come?" you may ask, "it's not made up of other numbers!". well, yes and no. do you remember how we were taught at school that the babylonians used a different numeral system? they had 60 (!) instead of ten (0-9) numerical digits, so theirs was a base-60 number system. you may also have heard that computers store information in only two numerical digits: 0 and 1. that's the binary system. same as with any unit of measurement, all of these are arbitrary and equivalent (although having ten fingers may make the decimal system easier to learn, if you're taught any of these systems at an early age all of them will seem pretty easy to you).

our beloved decimal system works this way: any number is actually a sum of bases of ten. the first (from right to left) digit gets multiplied by $10^0$, the second by $10^1$, and so on. the number 1789 is actually

$$1789 \equiv 9 \times 10^0 + 8 \times 10^1 + 7 \times 10^2 + 1 \times 10^3 \equiv 9 + 80 + 700 + 1000 \equiv (1789)_{10}$$

binary works the same way, but with bases of 2 instead of 10: $(10110)_2$ is

$$(10110)_2 \equiv 0 \times 2^0 + 1 \times 2^1 + 1 \times 2^2 + 0 \times 2^3 + 1 \times 2^4 \equiv 0 + 2 + 4 + 0 + 16 \equiv (22)_{10}$$

we write $(10110)_2$ so as not confuse the reader with the number $(10110)_{10}$, i.e., ten thousand one hundred and ten, usually written as simple 10110, as the decimal system is the conventional system.

now you probably see where this is going: hashes are written as hexadecimal numbers, that is, numbers of base sixteen. as we don't have numerical digits for numbers 10-15 as we do for 0-9, we use the first six letters of the alphabet (a-f). thus, the number $(1f40)_{16}$ is written in hexadecimal as

$$ (1f40)_{16} \equiv 0 \times 16^0 + 4 \times 16^1 + 15 \times 16^2 + 1 \times 16^3 \equiv 0 + 64 + 3840 + 4096 \equiv (8000)_{10}$$

* in computers, everything is actually a number, so it's all the same.

** you can run an md5 on linux by typing echo -n cat | md5sum on the terminal. source: the -n flag is necessary or the echo command will print cat\n instead of cat (\n means a newline).


what a hash function does: what a hash function does source: original work for Wikipedia

what are hash functions used for?

hash functions are used to create a digestsa kind of unique identifier, such as fingerprint for humans) of a long stream of data. suppose you have written a very long text on a public computer and forgot to save it on your pendrive, but you did save its hash, which was bf9d8e46b4edcd49fd5993b8a4105b54. on the following day you come back to the public computer to save your text, and thankfully it is still there. but how can you know if no one has changed it? if it were a very small change, you would have to skim through every letter to make sure everything is the same. but you can also hash the text again: if the output is the same, it is very likely that the text is unchanged.

before internet connection speeds and reliability improved, it was common to download a corrupted file from the internet, specially if what you were trying to download was a big file. to make sure you knew when this happened, the page which offered the download usually offered its hash too. that way you could hash whatever you had downloaded, and if the hash matched with the one on the webpage, it meant that your file was likely to be uncorrupted.

you may have noticed that I did not say that when the output of two hash functions are the same it means that their input is the same. this is because there are infinite possible inputs to a hash function, but its output will always be a number below a certain value. this implies that it is impossible for a hash function to garantee that every input will get a unique hash.

the difference between a cryptographic and a non-cryptographic hash function is that it is easier to find hash collisions on the latter than on the former. a hash collision, as you may have guessed, is when two inputs have the same hash/digest. the md5 hash function is not considered to be collision resistant anymore, so its cryptographic use is not recommended.

if we take another hash function, such as sha256, we can see why it is overwhelmingly unlikely that two different inputs will map to the same output. the sha256 takes any input and outputs a number between zero and $2^{256}$. that is so because the output of the sha256 hash function is a 64-digit hexadecimal number. as each digit in a hexadecimal number can take 16 values, we have $16^{64}$ possibilities, which is the same as $2^{256}$ possibilities. $2^{256}$ is huge number, so we would have to try a lot of inputs before finding two different ones with the same output.

this is so because of the property of cryotographic hash functions that we have discussed before: there is no (known) way of choosing an input that matches a given hash, so the best someone can do if they want a particular hash, say cbdc2cb757153cb0da7501baf45c076112501f88c35d99395c93acdbcae83a8b, is to randomly try several values until a suitable one is found. because there are $2^{256}$ possible values in a sha256 function, no one is likely to succeed at doing so in a reasonable amount of time (less than several thousand years with current hardware).

In [2]:
import hashlib

message = b"nobody expects the Spanish inquisition" #http://stackoverflow.com/questions/6269765/what-does-the-b-character-do-in-front-of-a-string-literal

hash_digest = hashlib.sha256(message).hexdigest()

print(hash_digest)
cbdc2cb757153cb0da7501baf45c076112501f88c35d99395c93acdbcae83a8b

how can we stop SPAM with hash functions?

by using a schema called proof-of-work, invented by adam back back in 1997. (the version employed here is a simplified variant).

although we can't find an input that will map exactly to a certain hash, it is relatively easy to find an input that hashes to a similar digest --- depending on how you define similar. this is what adam back called a partial hash collision. we could define a similar digest as a hash that shares a certain number of digits with the desired hash, or as a hash which is not far from the desired output, as in $| hash\_candidate - desired\_hash| < 10^5$.

in a proof-of-work schema like the one adam back imagined or the one used in the Bitcoin protocol, though, the partial hash collision sought is defined as finding a hash (remember, a hash is a hexadecimal number) below a certain target. in the case of sha256, finding one target among $2^{256}$ possibilities is expected to take centuries long using today's computer resources, but finding a digest below a certain target is much more feasible. this is called a proof-of-work because if you have a hash matching this requirement, it means that on average you had to perform a certain amount of work.

imagine a simple hash functions that takes any input and maps it to a number between 1 and 100. if I need to find some input that hashes to exactly 42, it will take me longer than finding an input that hashes to any number below 10. it may happen that I find a hash below 10 on my first try, but on average this will take me 10 tries. so if I present you an input that hashes to a number below 10, you know that I have probably performed work corresponding to 10 tries.

now imagine the email protocol worked as follows, such that every mail was composed of these parts: its content (from, to, subject, body, timestamp, nonce) and its digest.

In [3]:
email = {
    "content": {"from": "alien@at.ed",
                "to": "mother@earth.tr",
                "subject": "help NEEDED",
                "body": "hello, world. do you need help not destroying yourself? sincerely, concerned aliens.",
                "timestamp": 1103031175, #in unix format, as this is easier to update than other formats
                "nonce": 0},
    "digest": None #initialize with empty value
}

in this new email protocol, an email server would only accept a message as valid if:

  • its content, when hashed, matched its digest;

  • its digest is below a certain threshold;

  • its timestamp is not too far in the future or in the past.

a valid message is forwarded to its recipient normally, an invalid message is rejected.

how is a valid message produced?

the sender writes his email normally. when he clicks the button send, his computer creates an email object such as the one shown above, and hashes its content:

In [4]:
import json

def hash_email_content(email):
    content = json.dumps(email['content'], sort_keys=True) #http://stackoverflow.com/questions/17412304/hashing-an-array-or-object-in-python-3
    digest = hashlib.sha256(content.encode('utf-8')).hexdigest()
    return digest

print(hash_email_content(email))
fa092a8600610b55b00c770f327367cef0d23809aa459f23035d39e8e3f33514

let's say the threshold is 0x0000ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff, so that any digest with at least four zeroes as starting digits is considered valid.

the digest found in the above step is then evaluated against the chosen threshold and it does not pass.

the sender's computer then increments the nonce email['content']['nonce'], and hashes the email content again, and checks if the digest is valid. if it is not, the computer tries it all over again until a valid digest is found.

In [5]:
def find_valid_digest(email, target=4):
    target_digest = int('0'*target + 'f'*(64-target), 16) #create target value in hexadecimal representation
    digest = int(hash_email_content(email), 16) # get initialization digest value and make it decimal int
    while digest > target_digest:
        email['content']['nonce'] += 1
        digest = int(hash_email_content(email), 16)
    digest = hex(digest)[2:] #[2:] is to remove python's 0x prefix
    email['digest'] = '{:0>64}'.format(digest) #pads the leading zeroes if hexadecimal is less than 64 characters
    return email
In [6]:
%time find_valid_digest(email)
CPU times: user 1.18 s, sys: 0 ns, total: 1.18 s
Wall time: 1.18 s
Out[6]:
{'content': {'body': 'hello, world. do you need help not destroying yourself? sincerely, concerned aliens.',
  'from': 'alien@at.ed',
  'nonce': 18258,
  'subject': 'help NEEDED',
  'timestamp': 1103031175,
  'to': 'mother@earth.tr'},
 'digest': '0000c789e915a37b4d6c1ebee5b7aa666fc5ac959f9e83d1177a4ad243e6b74d'}

after 2 seconds (in this case; it might take more or less time), a valid digest is found, and the email is sent.

how does this stop SPAM?

if you are a regular user, sending a few emails won't take long and won't cost you much electricity. but if you are a spammer, sending thousands of emails per second will now cost you a lot more money in electricity and in time, decreasing your expected profit. that way, you are unlikely to continue engaging in such a questionable activity.

this schema works because if a spammer ignores the hash requirements (trying not to spend computational resources), his emails will end up in the server's bin. if his emails never reach the intended recipients, his expected profit goes to zero, and he has no incentive to continue spamming.

if, however, the spammer decides to continue spamming for a profit, he must send valid emails. in order to do so, he now has to spend computational resources (electricity and hardware) to send his emails. this puts a price tag in the previously virtually free emails he could send. now he has to factor in this cost, which is likely to reduce his profit to a loss, or at least to reduce his output to a few hundred thousand emails a day*** using only his personal computer.

why do you include a timestamp in the email content?

so that the same email cannot be sent several times to the same person without calculating a valid digest every single time. with the timestamp included, an email server can deem an email invalid if its timestamp is too late or not soon enough.**** with the timestamp included in the email digest, a spammer can not simply find a valid hash and change the timestamp to a more suitable one, as doing so would change the digest to one unlikely to be valid.

*** this may seem like a lot of email, but without the proof-of-work schema, a spammer can send this many emails in minutes.

**** we can imagine an email client choosing a timestamp for an email a little bit far in the future, already knowing that calculating a valid digest will take a few seconds.