In any real-world dataset, the probability of the first digit of a value being 1 is very high compared to the rest of the values, with the digit 9 having the lowest probability. This interesting occurrence is well-known as BenfordThis a very essential thing to be aware of if you have a hard disk full of memories and mixes chucked away somewhere. Take it out, refresh the data, add checksums to it, and take redundant backups. Repeat the process by comparing checksums once in a while. :) As someone who lost data (not data rot though), I know the pain!
To verify checksums, on Windows, there is a handy program called certUtil that you can use from the command line. With the -hashfile
parameter you can use MD5 or SHA256 to get the hash. Use it like certUtil -hashfile path/myarchive.zip md5
.
On Linux, it’s md5sum path/myarchive.zip
.
$$ P(d)=\log_{10}(d+1)-\log_{10}(d) $$
The intuition behind this observation is that a value in a real-world dataset spends a long time with its first digit as 1, and then the time shortens for every digit towards the digit 9 where it is the shortest. For example, if a value has to grow from 100 to 200, it needs to double itself. But when it reaches 900, the transition to 1000 is very quick as it’s just a 11% growth compared to 100 to 200, which is a 100% growth.
In other words, growth tends to be slow at the beginning (the first digit being 1) and exponentially faster as it reaches towards 9.