
90
棍子 发表于 2017914 00:29:10
导读：GEGlobalResearch______________________________________________________________KolmogorovComplexityEstimationandAnalysisS.C.Evans,J.E.HersheyandG.Saulnier2002GRC177,October2002Class
GE Global Research
______________________________________________________________
Kolmogorov ComplexityEstimation and Analysis
S.C. Evans, J.E. Hershey and G. Saulnier
2002GRC177, October 2002
Class 1
Technical Information Series
Copyright ? 2002 International Institute of Informatics and Systems (IITS).
Used with permission.
GE Global ResearchTechnical Report Abstract Page
TitleAuthor(s)
Kolmogorov Complexity Estimation and AnalysisS.C. EvansJ.E. HersheyG. Saulnier*
Phone
(518)38770148*8337014
ComponentReportNumberNumberof Pages
Information and Decision Technologies, Niskayuna
2002GRC177DateOctober 2002
6Class1
Key WordsSource Coding; Lempel Ziv compression; Kolmogorov Complexity
Methods for discerning and measuring Kolmogorov Complexity are discussed and their relationshipsexplored. A computationally efficient method of using Lempel Ziv 78 Universal compression algorithmto estimate complexity is introduced.Manuscript received June 26, 2002
*RPI  Rensselaer Polytechnic Institute, Troy, NY
Submitted to Sixth World Conference on Systemics, Cybernetics and Informatics, July 1418 2002, forsession on Complexity Theory and its applications to Systems, Networks and Information Assurance
Kolmogorov Complexity Estimation and
Analysis
Scott C. Evans, John E. Hershey and Gary Saulnier
Abstract—Methods for discerning and measuring KolmogorovComplexity are discussed and their relationships explored. Acomputationally efficient method of using Lempel Ziv 78Universal compression algorithm to estimate complexity isintroduced.
I.
INTRODUCTION
K
olmogorov Complexity is a fundamental measure ofinformation with growing applications and importance[2], [4]. Estimation of Kolmogorov Complexity is key toobjective information system monitoring and analysis.References [7], [2][4] contain many applications ofKolmogorov Complexity; also see [1] for background on thissubject. All applications of Kolmogorov Complexity arelimited due to its incomputable nature and are impacted byimprovements or innovations in the ability to estimateKolmogorov Complexity well.
In this paper we review a generic method for estimatingcomplexity – the LempelZiv 78 (LZ78) [11] universalcompression algorithm, discuss its limitations in estimatingcomplexity, and derive a computationally efficient method ofusing this algorithm to estimate complexity. We then developtwo measures for the estimation of complexity – powerspectral density based estimation and expected time ofsequence production. We discuss the relationships betweenthese methods of estimation and other estimators.Additionally we introduce a third parameter related tocomplexity – SPAN. Relationships between these measures ofcomplexity are compared and discussed, and theirrelationships to compressionbased estimates of KolmogorovComplexity are explored.
II. KOLMOGOROV COMPLEXITY
A. Background
Kolmogorov Complexity is a measure of descriptivecomplexity contained in an object. It refers to the minimumlength of a program such that a universal computer cangenerate a specific sequence. A good introduction to
Kolmogorov Complexity is contained in [1] with a solidtreatment in [7]. Kolmogorov Complexity is related toShannon entropy, in that the expected value of K(x) for arandom sequence is approximately the entropy of the sourcedistribution for the process generating the sequence. However,Kolmogorov Complexity differs from entropy in that it relatesto the specific string being considered rather than the sourcedistribution. Kolmogorov Complexity can be described asfollows, where j represents a universal computer, p representsa program, and x represents a string:
K
j
(x)=
{min
j(p)=x
l(p) .
}Random strings have rather high Kolmogorov Complexity –on the order of their length, as patterns cannot be discerned toreduce the size of a program generating such a string. On theother hand, strings with a large amount of structure have fairlylow complexity. Universal computers can be equated throughprograms of constant length, thus a mapping can be madebetween universal computers of different types. TheKolmogorov Complexity of a given string on two computersdiffers by known or determinable constants. The KolmogorovComplexity K(yx) of a string y, given string x as input isdescribed by the equation below:
minl(p)ìüj(p,x)=yKj(yx)=íy
?￥,ifthereisnopsuchthatj(p,x)=yt ,
where l(p) represents program length p and j is a particularuniversal computer under consideration. Thus, knowledge orinput of a string x may reduce the complexity or program sizenecessary to produce a new string y.
The major difficulty with Kolmogorov Complexity is thatyou can’t compute it. Any program that produces a givenstring is an upper bound on the Kolmogorov Complexity forthis string, but you can’t compute the lower bound.
III. COMPLEXITY ESTIMATION
The authors are with GE Global Research, Niskayuna, NY and RPI in Troy,NY. Lockheed Martin Systems Integration Owego, NY, funded this work,technically transitioning ideas developed under DARPA InformationAssurance and Fault Tolerant Networks Projects contract F3060201C0182and managed by the Air Force Research Laboratory (AFRL) InformationDirectorate.
A. Empirical Entropy
Entropy is calculated from the source distribution producing agiven string [10]. When the source distribution is not known,
1
calculation of entropy is not possible until a distribution ismeasured. This is essentially the same problem as estimatingthe Kolmogorov Complexity of a given string in that we aretrying to find the smallest set (or probability density function)from which a string is a typical element.
Empirical Entropy is entropy measured from the data itself.Considering the case of binary sequences, first order empiricalentropy compares the number of ones and zeros and developsa probability density function. Second order empirical entropynotes the frequency of occurrence of pairs 01, 11, 10, 00.Third and following order empirical entropies developprobability density functions based on larger length patterns. Empirical entropy as a measure of complexity can beextremely inaccurate. For example, the string101010101010..., N times has maximal first order empiricalentropy since it has an equal number of ones and zeros, yetintuitively this string is extremely simple in descriptivecomplexity. Still, even crude complexity estimation such asthis has been shown to be useful in classifying data andbehavior (see for example [6]).
B. Lempel Ziv 78 for Complexity Estimation LZ78 has been used as an estimator for complexity forvarious applications, including DNA sequence analysis [9] aswell as information security [3]. Kolmogorov Complexity isthe ultimate compression bound for a given finite string, thus anatural choice for estimation of complexity is the class ofuniversal compression techniques. In [11], Lempel and Zivdefine a measure of complexity for finite sequences rooted inthe ability to produce these sequences from simple copyoperations. The LZ78 universal compression algorithmharnesses this principle to yield a universal compressionalgorithm that can approach the entropy of an infinite sequenceproduced by an ergodic source. However, as discussed in [5],any universal sequential code will fall short for someindividual sequences.
For example, consider the binary string created byconcatenating the first 52 unique binary strings together (fromthe sequence 0, 1, 00, 01, 10, 11, 000 …) is compressed usingLZ78. This string of length 218 bits can be encoded usingLZ78 in 315 bits – a net loss in compression. However, asshown in Figure 1, simple circular right shifts changes thecompressibility of this string dramatically – a variation of up to28 bits. Since a circular right is a very simple operation thatcan be implemented in a Turing machine, the Kolmogorovcomplexity of a string undergoing such a shift should changeminimally. Thus, better estimators than LZ78 must be foundto truly harness complexity as a usable parameter.
C. Improved LZ78 Complexity Estimation
Despite the difficulties discussed above, LZ78 is among themore accessible universal complexity estimators. However,complexity estimation using LZ78 usually amounts toperforming the entire compression process and comparinginverse compression ratios as a measure of complexity. Infact, the simple Lempel Ziv partition contains enough data toestimate complexity without performing the entirecompression encoding process. Central to the LZ78 algorithm is the partitioning schemeintroduced by Ziv and Lempel in [8]. The LZ78 algorithmpartitions a string into prefixes that it hasn’t seen before,forming a codebook that will (given a long enough string withenough repetition) enable long strings to be encoded withsmall indexes. Consider an example to illustrate how thisalgorithm works: LZ partitioning of the string:
1011010010011010010011101001001100010
is performed by inserting commas each time a substring thathas not yet been identified is seen. The following partitionresults:
1,0,11,01,00,10,011,010,0100,111,01001,001,100,010.This can be represented by the binary tree shown in Figure 2.Tree Partition: 1,0,11,01,00,10,011,010,0100,111,01001,001,100,010010101Figure 1: LZ78 Compression of Circular Right Shifts of a String. Simpleshifts of a string result in significant variation in the strings compression.
Figure 2: LZ78 binary tree representation of the partition for the binarystring: 1011010010011010010011101001001100010. Nodes contained inthe partition are colored in black.
2 
温馨提示：
1、在论坛里发表的文章仅代表作者本人的观点，与本网站立场无关。
2、论坛的所有内容都不保证其准确性，有效性，时间性。阅读本站内容因误导等因素而造成的损失本站不承担连带责任。
3、若因线路及非本站所能控制范围的故障导致暂停服务期间造成的一切不便与损失，论坛不负任何责任。
4，本网站内容均摘自其他网站，如涉及侵权定当第一时间删除
5、如侵犯您的权益请联系936144721@qq.com
