Skip to content

Analysis of the Message

joha2 edited this page Aug 15, 2016 · 8 revisions

Some Analysis of the Message from Receivers Point of View by Simple Python Decoder

Graphical Representation

For a message from ET it is useful to first obtain a graphical representation. This is done by translating the numbers into floats and fill up the missing numbers by NaNs. The decomposition into different lines is of course arbitrary. But it helps to see the gate-pictures or some other regular structures which provides a first orientation for at least a human receiver ;-).

message picture

TODO: Insert picture with pixel lines only up to delimiter symbol 2233

Zipf's Law

Another type of analysis to check whether the message contains a language (natural or constructed) is to see if it obeys Zipf's law (at least approximately): This means if one counts the words (after setting 2, 3 and combinations of them as separators) and ranks the words corresponding to their frequency one should obtain a power law (an approximate line in a log-log plot). There are a few drawbacks in this analysis:

  • Zipf's law in this context is not fully understood.
  • According to Wikipedia the power law is only an approximation of the Yule–Simon distribution.
  • Since the description of Zipf's law is not very exact it is not sure that the decoders implements it in the correct manner.
  • What does the picture mean? Probably not more than: The most frequently used word occurs roughly twice as many times as the second most frequently used word.

Content of the picture:

  • Red dots indicate the ranked word frequency content of CosmicOS (taken from index.txt) a = 0.032796, b = -1.687575
  • Green dots indicate the same for a random text from the symbols 0 to 3: It shows a step function like behaviour and not a line-type one a = 0.197912, b = -1.742364
  • Yellow dots show the distribution of the Moby Dick text taken from (http://www.gutenberg.org) which is only approximately a line a = 0.043911, b = -1.130787
  • According to the literature b for Moby Dick is nearly equal to -1, which is typical for Zipf's law in its original formulation.

zipfs law

Entropy Analysis

An additional analysis to decide whether the message is a purely random signal or it contains any information (i.e. dependencies between the symbols) is to analyse the Shannon-Boltzmann entropy $S = -sum_{i=1}^N p_i log_N p_i$ where N is the number of characters in the alphabet and p_i is the probability (i.e. the relative frequency) of the i-th letter.

  • For a purely random signal all p_i are equal and S = 1
  • For a signal containing only one character of an alphabet with N > 1 S = 0

For the following pictures also so-called n-grams (strings with n characters) were analysed in their respective frequencies. Therefore we plotted the entropy over the length of the n-grams for certain 'codes', namely Moby Dick (blue), CosmicOS (green), the METI-Message (magenta) in different encodings and for a random signal (red) containing only the characters 0, 1, 2, 3.

entropy

As you can see:

  • The random signal has always an entropy near 1 for all n
  • Another random signal with Poisson distributed characters 0123 (p = 0.1) also has a very distinct shape in its entropy plot
  • The Moby Dick signal has a local minimum and quickly goes up to 1 which means that very long n-grams (>20 characters) are nearly equally distributed and carry no information anymore; further it decays very fast to zero a little after n=20 which means that there were just no words with this length which only contain `0123456780abcde...'.
  • The CosmicOS signal has a broad local minium (indicating information carrying blocks of varying length) and grows slowly up to one which means that also very long n-grams carry some information (think of the unless gates graphics). There are also some small dips appearing.
  • The METI message shows nearly the same behaviour as the CosmicOS code, but there are spikes in the entropy. My guess is that these are the block lengths in the signal (because the METI message is encoded as 8-character blocks containing the number 0, 1, 2, 3, 4, 5, 6, 7); therefore the deepest spike appears at n=8
  • Since there is no signal like 111111... appearing there is no zero entropy signal