Introduction to Data Engineering and Information Theory
This content delves into the fundamentals of data engineering and information theory, focusing on topics such as data communication, sharing, storage, and archiving. It explores key concepts like data representation, corruption prevention, historical milestones in communication technology, Shannon's contributions, effective communication, entropy, motivation, source coding, run-length encoding, and Huffman coding. Through images and text, it provides insights into Claude Shannon's work, the importance of reliable communication, and various encoding techniques.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Information Theory NDBI046 NDBI046 - - Introduction to Data Engineering Introduction to Data Engineering 202 2023/2024 3/2024 Petr Petr koda koda https://github.com/skodapetr https://github.com/skodapetr https://www.ksi.mff.cuni.cz https://www.ksi.mff.cuni.cz
We have data, we have problems ... Data communication Data sharing Data storage Data archiving . Questions: What is the size of the minimal representation of the data? How to prevent data corruption? 2
History 1840 Samuel Finley Breese Morse, U.S. Patent No. 1,647A IMPROVEMENT IN THE MODE OF COMMUNICATING INFORMATION BY SIGNALS BY THE APPLICATION OF ELECTROMAGNETISM. 1858 Transatlantic Cable 1937 Shannon, MIT, Master thesis A symbolic analysis of relay and switching circuits 1948 Shannon A mathematical theory of communication 1876 Bell Laboratories Improvement in Telegraphy Optional: 2023/2024 3
Claude Shannon April 30, 1916 February 24, 2001 Source Coding Theorem Noisy-Channel Coding Theorem Source: https://en.wikipedia.org/wiki/File:ClaudeShannon_MFO3807.jpg 4
Effective and Reliable Communication Source Destination Transmitter Channel Receiver Noise 5
Shannon Entropy ? ? ? = ? ?? ???2 ? ?? ? 6
Motivation Input: Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Integer lacinia. Curabitur ligula amet, pulvinar a vestibulum quis, facilisis vel sapien. Nullam feugiat, turpis at pulvinar vulputate, erat libero tristique tellus, nec bibendum odio risus sit amet ante. Output: ??? 7
Shannon's source coding theorem N outcomes from a source X can be compressed into roughly N * H(X) bits. 8
Run-Length Encoding Input: CCCCCBBBBBDDDDBBBBBAAABBBBBDDDDEEEEEDDDDDDDCCCCCAA Output: 5C5B4D5B3A5B4D5E7D5C2A 9
Huffman Coding Input: CCCCCBBBBBDDDDBBBBBAAABBBBBDDDDEEEEEDDDDDDDCCCCCAA 0 0 00 B 15/50 = 0.3 0.3 0.3 0.6 1.0 1 01 D 15/50 = 0.3 0.3 0.3 1 0 10 C 10/50 = 0.2 0.2 0.4 0.4 0 1 A 5/50 = 0.1 0.2 110 1 E 5/50 = 0.1 111 10
Lempel-Ziv 1977 : Idea Input: Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Integer lacinia. Curabitur ligula amet, pulvinar a vestibulum quis, facilisis vel sapien. Nullam feugiat, turpis at pulvinar vulputate, erat libero tristique tellus, nec bibendum odio risus sit amet ante. Output: Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Integer lacinia. Curabitur ligula (71,5) pulvinar a vestibulum quis, facilisis vel sapien. Nullam feugiat, turpis at pulvinar vulputate, erat libero tristique tellus, nec bibendum odio risus sit amet ante. Source: https://doi.org/10.1109/TIT.1977.1055714 11
Lempel-Ziv 1977 : Algorithm Input (with spaces for readability): 001 010 210 210 212 021 021 200... Buffer: 000 000 000 001 010 210 P1= 1 I1= 3 C1~ 0 2 1 000 000 001 010 210 210 P2= 8 I2= 4 C2~ 7 3 2 000 010 102 102 102 120 P3= 7 I3= 8 C3~ 6 7 2 P4= 3 I4= 9 C4~ 2 8 0 210 210 212 021 021 200 12
Lempel-Ziv-Welch : Idea Universal dictionary-based method. Start with core dictionary ~ basic characters (ASCII). Update dictionary as a part of the compression process. 13
Lempel-Ziv-Welch : Algorithm Input (with spaces for readability): abc abc abc abc abc ab Read Buffer Output Dictionary Read Buffer Output Dictionary a a - - a bca bc bca = 261 b ab a ab = 256 b ab - - c bc b bc = 257 c abc - - a ca c ca = 258 a abca abc abca = 262 b ab - - b ab - - c abc ab abc = 259 c abc - - a ca - - a abca - - b cab ca cav = 260 b abcab abca abcab = 263 c bc - - - - b - 14
Lempel-Ziv-Welch : Algorithm Input Buffer Output Dictionary Input Buffer Output Dictionary - - - - bc cabc bc cab = 260 a a a - - - - - b ab b ab = 256 - - - - c bc c bc = 257 abc bcabc abc bca = 261 - - - - - - - - ab cab ab ca = 258 - - - - - - - - - - - - ca abca ca abc = 259 abca = 262 abcabca abca abca = 262 - - - - b abcab b abcab = 263 15
Arithmetic coding Input (with spaces for readability): aab aba cba ... 0 a 0 1 a b c 0 a b a 1 c b c 1 b c 16
Effective and Reliable Communication Source Destination Transmitter Channel Receiver Noise 17
Motivation We use 1 GB HDD, with P(a single bit flip) = 0.1 = f, we store 1 GB every day for 5 years. How to make this reliable? In total we transfer 5 * 365 * 8 * 109 bits ~ 104 * 109 = 1013 If we want 1% failure chance, we need the P(flip) < 10-15 We can try to add redundancy using repetition : 0 to 000, 1, to 111 P(error) = P(send != decoded) = P( 3 flips ) + P( 2 flips ) = f3 + 3 * f2 * (1 - f) = 3f2 - 2f3 ~ 3f2 Optional: 2023/2024 18
Motivation We use 1 GB HDD, with P(a single bit flip) = 0.1 = f, we store 1 GB every day for 5 years. How to make this reliable? P(error) f = 10-1 R1 3f2 = 3 *10-2 R3 10-15 1/3 Rate 1 Optional: 2023/2024 19
Shannon's Noisy-Channel Coding Theorem It is possible to communicate over a noisy channel with arbitrarily small chance of error when the rate of communication is kept below a channel capacity (C). 21
Idea For Binary Symmetric Channel (BSC) we have chance of flip f C = 1 - H2(f) = 1 - [ f * log2 1/f + (1 - f) * log2 1/(1 - f) ] The Shannon prove is non-constructive We know what is possible, but we do know how. Codes: Hamming codes Reed-Solomon code, Convolutional coding, ... 22
Code idea We can use the channel multiple times (N) ~ extended channel For large N we can define typical sets ??????? ??? = 2? ?2(?) for BSC We can use 2? (1 ?2? )code words for BSC Similar idea to sphere-packing bound, or the volume bound 23
Practical issues Apache Commons Compress (Java) bzip2, gzip, pack200, lzma, xz, Snappy, traditional Unix Compress, DEFLATE, DEFLATE64, LZ4, Brotli, Zstandard and ar, cpio, jar, tar, zip, dump, 7z, arj, Compression for active documents Service configuration SN ISO/IEC 12042, K > 0 24