In [1]:

```
%%HTML
<style> code {background-color : lightgrey !important;} </style>
```

**bruno cuconato**, and was originally published at my blog. it is licensed as a Creative Commons Attribution-ShareAlike 4.0 International License.

this notebook explains how an anti-SPAM schema would work using a variant of the hashcash algorithm.

first, we need to know how a cryptographic hash function works. we don't need to know exactly *how* it is calculated to know *what* it does: it takes an input such as a sentence or a number* and outputs a seemingly random combination of numbers and letters.

I say seemingly random because the result is actually **deterministic**: every time you input the word cat to the hash function `md5`

, for example, you get `d077f244def8a70e5ea758bd8352fcd8`

. it doesn't matter which computer you use, the result is always the same.**

if you input `cate`

, though, the answer is `8540d0d7e1d8111279b63983697e2e37`

, which is totally unrelated to the previous output, even if the input is almost the same. these are nice properties of hash functions: we can't really guess what an output to a given input will be, and we can't try to guess the answer by trying similar inputs.

if you're curious about what is the combination of numbers (0-9) and letters (a-f) outputed by a hash function, keep on reading this section, else jump to the next section.

something like `8540d0d7e1d8111279b63983697e2e37`

is actually a number. "how come?" you may ask, "it's not made up of other numbers!". well, yes and no. do you remember how we were taught at school that the babylonians used a different numeral system? they had 60 (!) instead of ten (0-9) numerical digits, so theirs was a base-60 number system. you may also have heard that computers store information in only two numerical digits: 0 and 1. that's the binary system. same as with any unit of measurement, all of these are arbitrary and equivalent (although having ten fingers may make the decimal system easier to learn, if you're taught any of these systems at an early age all of them will seem pretty easy to you).

our beloved decimal system works this way: any number is actually a sum of bases of ten. the first (from right to left) digit gets multiplied by $10^0$, the second by $10^1$, and so on. the number 1789 is actually

$$1789 \equiv 9 \times 10^0 + 8 \times 10^1 + 7 \times 10^2 + 1 \times 10^3 \equiv 9 + 80 + 700 + 1000 \equiv (1789)_{10}$$binary works the same way, but with bases of 2 instead of 10: $(10110)_2$ is

$$(10110)_2 \equiv 0 \times 2^0 + 1 \times 2^1 + 1 \times 2^2 + 0 \times 2^3 + 1 \times 2^4 \equiv 0 + 2 + 4 + 0 + 16 \equiv (22)_{10}$$we write $(10110)_2$ so as not confuse the reader with the number $(10110)_{10}$, i.e., ten thousand one hundred and ten, usually written as simple 10110, as the decimal system is the conventional system.

now you probably see where this is going: hashes are written as hexadecimal numbers, that is, numbers of base sixteen. as we don't have numerical digits for numbers 10-15 as we do for 0-9, we use the first six letters of the alphabet (a-f). thus, the number $(1f40)_{16}$ is written in hexadecimal as

$$ (1f40)_{16} \equiv 0 \times 16^0 + 4 \times 16^1 + 15 \times 16^2 + 1 \times 16^3 \equiv 0 + 64 + 3840 + 4096 \equiv (8000)_{10}$$* in computers, everything is actually a number, so it's all the same.

** you can run an `md5`

on linux by typing `echo -n cat | md5sum`

on the terminal. source: the `-n`

flag is necessary or the `echo`

command will print `cat\n`

instead of `cat`

(`\n`

means a newline).

what a hash function does: source: original work for Wikipedia

hash functions are used to create a digestsa kind of unique identifier, such as fingerprint for humans) of a long stream of data. suppose you have written a very long text on a public computer and forgot to save it on your pendrive, but you did save its hash, which was `bf9d8e46b4edcd49fd5993b8a4105b54`

. on the following day you come back to the public computer to save your text, and thankfully it is still there. but how can you know if no one has changed it? if it were a very small change, you would have to skim through every letter to make sure everything is the same. but you can also hash the text again: if the output is the same, it is very likely that the text is unchanged.

before internet connection speeds and reliability improved, it was common to download a corrupted file from the internet, specially if what you were trying to download was a big file. to make sure you knew when this happened, the page which offered the download usually offered its hash too. that way you could hash whatever you had downloaded, and if the hash matched with the one on the webpage, it meant that your file was likely to be uncorrupted.

you may have noticed that I did not say that when the output of two hash functions are the same it means that their input is the same. this is because there are **infinite** possible inputs to a hash function, but its output will always be a number below a certain value. this implies that it is impossible for a hash function to garantee that every input will get a unique hash.

the difference between a cryptographic and a non-cryptographic hash function is that it is easier to find hash collisions on the latter than on the former. a hash collision, as you may have guessed, is when two inputs have the same hash/digest. the `md5`

hash function is not considered to be collision resistant anymore, so its cryptographic use is not recommended.

if we take another hash function, such as `sha256`

, we can see why it is overwhelmingly unlikely that two different inputs will map to the same output. the `sha256`

takes any input and outputs a number between zero and $2^{256}$. that is so because the output of the `sha256`

hash function is a 64-digit hexadecimal number. as each digit in a hexadecimal number can take 16 values, we have $16^{64}$ possibilities, which is the same as $2^{256}$ possibilities. $2^{256}$ is **huge** number, so we would have to try a **lot** of inputs before finding two different ones with the same output.

this is so because of the property of cryotographic hash functions that we have discussed before: there is no (known) way of choosing an input that matches a given hash, so the best someone can do if they want a particular hash, say `cbdc2cb757153cb0da7501baf45c076112501f88c35d99395c93acdbcae83a8b`

, is to randomly try **several** values until a suitable one is found. because there are $2^{256}$ possible values in a `sha256`

function, no one is likely to succeed at doing so in a reasonable amount of time (less than several thousand years with current hardware).

In [2]:

```
import hashlib
message = b"nobody expects the Spanish inquisition" #http://stackoverflow.com/questions/6269765/what-does-the-b-character-do-in-front-of-a-string-literal
hash_digest = hashlib.sha256(message).hexdigest()
print(hash_digest)
```

by using a schema called proof-of-work, invented by adam back back in 1997. (the version employed here is a simplified variant).

although we can't find an input that will map exactly to a certain hash, it is relatively easy to find an input that hashes to a *similar* digest --- depending on how you define similar. this is what adam back called a *partial hash collision*. we could define a similar digest as a hash that shares a certain number of digits with the desired hash, or as a hash which is not far from the desired output, as in $| hash\_candidate - desired\_hash| < 10^5$.

in a proof-of-work schema like the one adam back imagined or the one used in the Bitcoin protocol, though, the partial hash collision sought is defined as finding a hash (remember, a hash is a hexadecimal number) below a certain target. in the case of `sha256`

, finding one target among $2^{256}$ possibilities is expected to take centuries long using today's computer resources, but finding a digest below a certain target is much more feasible. this is called a proof-of-work because if you have a hash matching this requirement, it means that on average you had to perform a certain amount of work.

imagine a simple hash functions that takes any input and maps it to a number between 1 and 100. if I need to find some input that hashes to exactly 42, it will take me longer than finding an input that hashes to any number below 10. it may happen that I find a hash below 10 on my first try, but on average this will take me 10 tries. so if I present you an input that hashes to a number below 10, you know that I have probably performed work corresponding to 10 tries.

In [3]:

```
email = {
"content": {"from": "alien@at.ed",
"to": "mother@earth.tr",
"subject": "help NEEDED",
"body": "hello, world. do you need help not destroying yourself? sincerely, concerned aliens.",
"timestamp": 1103031175, #in unix format, as this is easier to update than other formats
"nonce": 0},
"digest": None #initialize with empty value
}
```

in this new email protocol, an email server would only accept a message as valid if:

its content, when hashed, matched its digest;

its digest is below a certain threshold;

its timestamp is not too far in the future or in the past.

a valid message is forwarded to its recipient normally, an invalid message is rejected.

the sender writes his email normally. when he clicks the button send, his computer creates an email object such as the one shown above, and hashes its content:

In [4]:

```
import json
def hash_email_content(email):
content = json.dumps(email['content'], sort_keys=True) #http://stackoverflow.com/questions/17412304/hashing-an-array-or-object-in-python-3
digest = hashlib.sha256(content.encode('utf-8')).hexdigest()
return digest
print(hash_email_content(email))
```

let's say the threshold is `0x0000ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff`

, so that any digest with at least four zeroes as starting digits is considered valid.

the digest found in the above step is then evaluated against the chosen threshold and it does not pass.

the sender's computer then increments the nonce `email['content']['nonce']`

, and hashes the email content again, and checks if the digest is valid. if it is not, the computer tries it all over again until a valid digest is found.

In [5]:

```
def find_valid_digest(email, target=4):
target_digest = int('0'*target + 'f'*(64-target), 16) #create target value in hexadecimal representation
digest = int(hash_email_content(email), 16) # get initialization digest value and make it decimal int
while digest > target_digest:
email['content']['nonce'] += 1
digest = int(hash_email_content(email), 16)
digest = hex(digest)[2:] #[2:] is to remove python's 0x prefix
email['digest'] = '{:0>64}'.format(digest) #pads the leading zeroes if hexadecimal is less than 64 characters
return email
```

In [6]:

```
%time find_valid_digest(email)
```

Out[6]:

after 2 seconds (in this case; it might take more or less time), a valid digest is found, and the email is sent.

if you are a regular user, sending a few emails won't take long and won't cost you much electricity. but if you are a spammer, sending thousands of emails per second will now cost you a **lot** more money in electricity and in time, decreasing your expected profit. that way, you are unlikely to continue engaging in such a questionable activity.

this schema works because if a spammer ignores the hash requirements (trying not to spend computational resources), his emails will end up in the server's bin. if his emails never reach the intended recipients, his expected profit goes to zero, and he has no incentive to continue spamming.

if, however, the spammer decides to continue spamming for a profit, he must send valid emails. in order to do so, he now has to spend computational resources (electricity and hardware) to send his emails. this puts a price tag in the previously virtually free emails he could send. now he has to factor in this cost, which is likely to reduce his profit to a loss, or at least to reduce his output to a few hundred thousand emails a day*** using only his personal computer.

so that the same email cannot be sent several times to the same person without calculating a valid digest every single time. with the timestamp included, an email server can deem an email invalid if its timestamp is too late or not soon enough.**** with the timestamp included in the email digest, a spammer can not simply find a valid hash and change the timestamp to a more suitable one, as doing so would change the digest to one unlikely to be valid.

*** this may seem like a lot of email, but without the proof-of-work schema, a spammer can send this many emails in minutes.

**** we can imagine an email client choosing a timestamp for an email a little bit far in the future, already knowing that calculating a valid digest will take a few seconds.