Kolmogorov complexity - short, informal introduction


My aim here is to give you an intuition about Kolmogorov complexity, without being very formal.

Basically, Kolmogorov complexity defines a measure of how small can you compress a certain string of characters.You can think about it like measuring how small can you make a file (string of characters) by compressing it with ZIP or similar. A file storing a very redundant information (like "aaaaaaaaaaaaaaaa....") can be compressed in a much smaller size compared with a random one.

For calculating the "Kolmogorov complexity", the "compressed" form is considered to be something like a self-extract archive: the shortest program that can output the initial string of characters (the "file/string to be compressed").

The Kolmogorov complexity is more general and includes classical compressing algorithms. For example a Huffman code is not very powerful in compressing a string like "123456789101112131415161718192021222324..." (concatenated consecutive numbers), while a very short program can generate a very long string like this. On the other hand, any traditional compressed archive can be made as constant to a program that is doing extraction on that archive. So Kolmogorov complexity of a string/file is smaller or equal than any achieved archive plus the decompressing program.



Kolmogorov complexity cannot be much longer than the string to be compressed, because you can always put the string as a constant S="......" and just execute "print S" after. On the other side, not all strings can compressed to a smaller size than initial. If all "compressed" programs would be shorter than the "uncompressed programs", you would have less unique values for the compressed ones, so you would not be able to differentiate between some of the strings "to be compressed". Therefore, some strings can be compressed to lower than initial size only to the expense of some strings that can be "compressed" a little longer than the initial size.

Such measure is specific to a certain computing machine (usually a Turing machine), however it was proven that measures between two Turing machines will only differ by a constant in how small a certain string can be compressed. The idea is that, for a very long sequence of characters, the Kolmogorov complexity will not differ very much in percentage, as the constant would count less for a long string of characters.

Unfortunately, it is proven that calculating the Kolmogorov complexity is not possible even theoretically in general (while in very particular small strings it might be possible). Basically this is because you cannot try all the possible smaller programs, while some of them might run indefinitely long and you don't have a way to know when it ran long enough so we are sure that it will not stop producing the string we want to compress - because the Halting problem of a Turing machine is undecidable in the general case.

However, you can know that Kolmogorov complexity of a string S is "as much X" as soon as you find a program that can output the string S. However, finding a smaller one is almost always possible and you cannot know when you found the smallest one - except some very particular cases when you actually tried all the smaller programs and proved that they either stop outputting something else, or they will cycle forever.

Some very compressible sequences are very tricky to identify without a hint. For example the following string "4142356237309504880168872420969807856967187537694807317667973..." seems random, however it can be easily generated as digits of the square root of 2. Same program can generate any number of such decimals that we considered in the "string to be compressed".

This "informational" complexity have philosophical implications. For example, given an apparently very complex set of observations, we can never be sure there is no simple theory that explains them all.


Dear reader, please leave a message if you exist! ;) Also, please share this article if you find it interesting. Thank you.

Comments