To get the file size down, we need to look for redundancies. Ignoring the difference between capital and lower-case letters, roughly half of the phrase is redundant. Nine words -- ask, not, what, your, country, can, do, for, you -- give us almost everything we need for the entire quote.
To construct the second half of the phrase, we just point to the words in the first half and fill in the spaces and punctuation. We'll look at how file-compression systems deal with redundancy in more detail in the next section.
Most compression programs use a variation of the LZ adaptive dictionary-based algorithm to shrink files. The system for arranging dictionaries varies, but it could be as simple as a numbered list. When we go through Kennedy's famous words, we pick out the words that are repeated and put them into the numbered index.
Then, we simply write the number instead of writing out the whole word. If you knew the system, you could easily reconstruct the original phrase using only this dictionary and number pattern. This is what the expansion program on your computer does when it expands a downloaded file. You might also have encountered compressed files that open themselves up. To create this sort of file, the programmer includes a simple expansion program with the compressed file.
It automatically reconstructs the original file once it's downloaded. But how much space have we actually saved with this system? In an actual compression scheme, figuring out the various file requirements would be fairly complicated; but for our purposes, let's go back to the idea that every character and every space takes up one unit of memory.
We already saw that the full phrase takes up 79 units. Our compressed sentence including spaces takes up 37 units, and the dictionary words and numbers also takes up 37 units.
This gives us a file size of 74, so we haven't reduced the file size by very much. But this is only one sentence! You can imagine that if the compression program worked through the rest of Kennedy's speech, it would find these words and others repeated many more times. And, as we'll see in the next section, it would also be rewriting the dictionary to get the most efficient organization possible.
In our previous example, we picked out all the repeated words and put those in a dictionary. To us, this is the most obvious way to write a dictionary.
But a compression program sees it quite differently: It doesn't have any concept of separate words -- it only looks for patterns. And in order to reduce the file size as much as possible, it carefully selects which patterns to include in the dictionary. If we approach the phrase from this perspective, we end up with a completely different dictionary. If the compression program scanned Kennedy's phrase, the first redundancy it would come across would be only a couple of letters long.
In "ask not what your," there is a repeated pattern of the letter "t" followed by a space -- in "not" and "what. But in this short phrase, this pattern doesn't occur enough to make it a worthwhile entry, so the program would eventually overwrite it. The next thing the program might notice is "ou," which appears in both "your" and "country.
But as the compression program worked through this sentence, it would quickly discover a better choice for a dictionary entry: Not only is "ou" repeated, but the entire words "your" and "country" are both repeated, and they are actually repeated together, as the phrase "your country.
The phrase "can do for" is also repeated, one time followed by "your" and one time followed by "you," giving us a repeated pattern of "can do for you. This ability to rewrite the dictionary is the "adaptive" part of LZ adaptive dictionary-based algorithm. The way a program actually does this is fairly complicated, as you can see by the discussions on Data-Compression. No matter what specific method you use, this in-depth searching system lets you compress the file much more efficiently than you could by just picking out words.
The sentence now takes up 18 units of memory, and our dictionary takes up 41 units. So we've compressed the total file size from 79 units to 59 units! Data compression has been one branch of computer science that made this digital revolution possible. By compressing the input data, for example, to 1 the compressed file size is 1 th of original file size it is possible to send the same information times faster or to send the file at once on a transmission channel that has capacity or even to send files in parallel on a channel that has capacity.
In this digital revolution communication has a price: accidentally we have to consent that the digital message we are sending will be potentially intercepted and read on its way to the destination. Cryptography might be a solution to this issue: if the sender encrypts the message, assuming that only the destination has a way to decrypt it, then privacy will be insured.
Therefore digital communication should be based on a layout articulated in two operations that are heterogeneous and in some case conflicting but that are needed to be applied to the original file to have efficiency and security. The aim of this work is to study the combination of compression and encryption techniques on digital documents.
In this paper we test the state-of-the-art compression and cryptography techniques on various kinds of digital data. The next section shall present an introduction to the most commonly used methods in data compression and cryptography, together with a short survey of past work. Sections 3 , 4 , and 5 will show the experimental results obtained on one-dimensional, two-dimensional, and three-dimensional data.
Today the way we communicate has dramatically changed. We communicate digitally and we aim to have an efficient and secure communication.
The research in this field is devoted to improving the way we communicate so as to have stronger requirements of efficiency and security, where efficiency is given by data compression and security by encryption. Data compression is today essential for digital communication. Without data compression we would not have digital televisions, smart-phones, satellite communications, Internet, etc.
Information theory tells us that there is a limit to the amount of compression we can gain from a digital file and that this limit is the Shannon entropy [ 1 ]: the highest the entropy is, the lowest the possibility of compressing data shall be. Therefore the enemy of compression is randomness. But, on the other side, encryption needs to bring randomness into the digital data to bring security.
This is why, when we have to perform both compression and encryption, we will always compress first and then encrypt, as shown in the workflow of Figure 1.
There are therefore two interesting questions to be posed: What is the cost of encryption, in terms of file size, after performing compression? How bad is performing first encryption and then compression? Data Compression can be defined as the coding of data to minimize its representation.
The compression process is called lossless if the original one can be exactly reconstructed from the compressed copy; otherwise it is called lossy. It dates back to the seminal work of Shannon who, more than half a century ago, gave precise limits on the performance of any lossless compression algorithm: this limit is the entropy of the source we want to compress.
Data compression techniques are specifically dependent on the type of data that has to be compressed and on the desired performance. Run Length Encoding RLE [ 1 ] is one of the simplest compression algorithms in which any repeated run of characters is coded by using only two elements: the first is used as a counter: it memorizes how long the string is; the second contains the repetitive element that constitutes the string. If a data item occurs at consecutive times in the input stream, RLE replaces the occurrences with the single pair nd.
Huffman Coding has been introduced in by Huffman [ 2 ]. Arithmetic Coding [ 3 ] is a form of entropy encoding that encodes a message into a single number: an arbitrary-precision fraction where. In Huffman Coding there is a limit: each source character has to be coded with at least one bit.
This limitation does not apply to Arithmetic Coding. Textual substitution methods, often called dictionary methods or Lempel-Ziv methods after the important papers by Ziv and Lempel [ 4 , 5 ], maintain a constantly changing dictionary of strings to adaptively compress a stream of characters by replacing common substrings with indices into the dictionary.
Lempel and Ziv proved that these schemes were practical as well as asymptotically optimal for a general source model. In Welch published [ 6 ] in which he described a practical implementation of the method outlined in [ 5 ]. Digital images can be defined as a set of two-dimensional arrays of integer data the samples , represented with a given number of bits per component. Because of the many applications of digital images there have been various processes of standardization involving image compression.
Perhaps the widest diffused standard is JPEG that is an acronym for Joint Photographic Experts Group, the unofficial name of the Standardization Committee that promoted the standard—see [ 7 ]. JPEG has four modes of operation. Today Lossless JPEG is far from the state of the art of lossless image compression but it is used in this paper to show the performance of a simple image compression method when coupled with encryption.
The standard is intended to offer unprecedented access into the image while still in compressed domain. Thus, images can be accessed, manipulated, edited, transmitted, and stored in a compressed form.
The lossless mode of the standard is very close to the actual state of the art in lossless image compression. Video compression, since the beginning of the s, has been an attractive research area because a digital video may provide more information than a single image frame. The huge computational complexity and memory space required for video processing are now more attainable, due to the more advanced, achievable computational capability today we have available.
MPEG-4 is used primarily for applications such as video telephony and digital television, for the transmission of movies via the web and for storage. Current research in data compression focuses also on the compression of other three-dimensional data, i. For MPEG4 we tested an on-line codec. A common way to encrypt a message is in blocks. Naturally such systems are known as block ciphers.
It is a block cipher that builds on the success of DES, while increasing the key length so that brute force attacks become almost impossible. It uses DES three times. This would be an unclassified, public encryption scheme.
For a complete, comparative analysis on the above symmetric encryption block cipher algorithms see [ 15 ]. RC4 Rivest Cipher 4 is a stream cipher. In the past it was widespread for its simplicity and speed in software, but after that its multiple vulnerabilities have been discovered and it is not considered secure any more [ 16 ]. It is still largely used in noncritic applications.
The code we experimented for encryption is taken from the open source library OpenSSL [ 17 ]. The need for a secure and efficient data transmission has had an impact also on the research in other applications of compression and security. The delicate balance between compression and cryptography might be framed in a slightly wider context, and it is possible to consider other points of view from which this deep and intriguing question has been studied; see, for example, [ 18 , 19 ]. In this digital era information owners are reluctant to distribute their data in an insecure environment.
As its name indicate that this is loss-less compression. Because there is no data loss when the file is recovered or comes to its original state. Even a single bit is not deducted after restoration. Lossless Compression is used in Image , Text File Compression where smaller amount of data loss is not tolerated. Because smaller loss in bit changes its information which is not required. It is also used for Executable Files.
This type of compression is done to get low space consumption file. After passing your data from this compression algorithm you gets compressed output which consume less space. So it is good for storage point of view. This compressed file is easy to send and receive over the internet. Once this file is encoded with lossless compression algorithm then we need to send it over internet.
At the receiving end receiver receive the encoded file and then decode it or decompress it to its original form. Because small amount of data loss is tolerable in audio and video which do not affect the viewers. By doing so we can save lot of space in storage device and bandwidth consumption over internet. Sometimes this type of compression is also used in images where quality do not matters that much.
0コメント