ABSTRACT Title of dissertation: A THEORY OF CRAM?ER-RAOBOUNDS FOR CONSTRAINED PARAMETRIC MODELS Terrence Joseph Moore, Jr., Doctor of Philosophy, 2010 Dissertation directed by: Professor Benjamin Kedem Department of Mathematics A simple expression for the Cram?er-Rao bound (CRB) is presented for the scenario of estimating parameters ? that are required to satisfy a differentiable constraint function f(?). A proof of this constrained CRB (CCRB) is provided using the implicit function theorem, and the encompassing theory of the CCRB is proven in a similar manner. This theory includes connecting the CCRB to notions of identifiability of constrained parameters; the linear model under a linear constraint; the constrained maximum likelihood problem, it?s asymptotic properties and the method of scoring with constraints; and hypothesis testing. The value of the tools developed in this theory are then presented in the communications context for the convolutive mixture model and the calibrated array model. A THEORY OF CRAM?ER-RAO BOUNDS FOR CONSTRAINED PARAMETRIC MODELS by Terrence Joseph Moore, Jr. Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2010 Advisory Committee: Professor Benjamin Kedem, Chair/Advisor Professor Radu Balan Professor Paul Smith Professor Doron Levy Professor Prakash Narayan c?Copyright by Terrence Joseph Moore, Jr. 2010 Acknowledgments There are perhaps too many people to thank here without the regret of omit- ting someone important to me. So many have helped me become a mathematician. To the very talented high school math teachers at West Hills High School in Santee, CA, who encouraged the best students to strive further, to the many remarkable professors at American University who delighted in interacting with their students and no matter how difficult the material ensured the experience was always enjoy- able, and to the excellent class of professors at the University of Maryland who challenged and pushed their students to excel. To all these people, I thank you. I would like to acknowledge my undergraduate and Master?s thesis advisor, Dr. Stephen Casey at American University, for his direction in the beginning of my college education. I do need to acknowledge the contribution of my first advisor, Dr. Dennis Healy, who encouraged the construction of this current work from the beginning. He allowed me the freedom to develop the theory using an approach I developed and generated a positive exuberance for whatever results I discovered. I only wish he could have survived to see its completion. I also need to acknowledge my second advisor, Dr. BenjaminKedem, for hiswillingness to allow me to complete this document as it was originally conceived as well as for guiding me through the defense process and the selection of the committee. I also need to acknowledge the financial support I have received from my employer, the U.S. Army Research Lab. Additionally, the professional support and mentoring I received from Dr. Brian Sadler and the advice from team members at ii the lab has been invaluable in assisting me succeed through the process. I also need to thank Dr. John Gowens, Mr. Glenn Racine, Dr. Barbara Broome, Dr. Alex Kott, and Dr. Brian Rivera for allowing me the time at and away from work to complete this degree. iii Table of Contents List of Figures vii List of Symbols and Abbreviations viii 1 INTRODUCTION 1 1.1 A note on the notation . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 THE CRAM?ER-RAO BOUND 5 2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Identifiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Local identifiability . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.2 Strong Identifiability . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.1 Best Linear Unbiased Estimators . . . . . . . . . . . . . . . . 12 2.3.2 Gaussian noise . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4.1 Efficient estimation . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4.2 Asymptotic Normality . . . . . . . . . . . . . . . . . . . . . . 14 2.4.3 Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5.1 The Rao statistic . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5.2 The Wald statistic . . . . . . . . . . . . . . . . . . . . . . . . 17 2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3 THE CONSTRAINED CRAM?ER-RAO BOUND 19 3.1 The Constrained CRB . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.1.1 A proof of the CCRB . . . . . . . . . . . . . . . . . . . . . . . 25 3.1.2 Alternative formulas . . . . . . . . . . . . . . . . . . . . . . . 31 3.1.2.1 Gorman-Hero-Aitchison-Silvey CCRB . . . . . . . . 32 3.1.2.2 Aitchison-Silvey-Crowder CCRB . . . . . . . . . . . 35 3.1.2.3 Xavier?s & Barroso?s Intrinsic Variance Lower Bound 36 3.1.3 Simple Extensions to the CCRB . . . . . . . . . . . . . . . . . 38 3.1.4 Properties of the CCRB . . . . . . . . . . . . . . . . . . . . . 41 3.1.5 Derivation of g(?) . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.1.5.1 A Taylor series derivation . . . . . . . . . . . . . . . 50 3.1.5.2 A fixed point derivation . . . . . . . . . . . . . . . . 51 3.2 Identifiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.2.1 Local identifiability . . . . . . . . . . . . . . . . . . . . . . . . 53 3.2.1.1 Local identifiability in the Aitchison-Silvey-Crowder CCRB formula . . . . . . . . . . . . . . . . . . . . . 57 3.2.2 Strong Identifiability . . . . . . . . . . . . . . . . . . . . . . . 58 3.3 Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 iv 3.3.1 Best Linear Unbiased Estimation . . . . . . . . . . . . . . . . 60 3.3.2 Uniform Minimum Variance Estimation under Gaussian noise 63 3.4 Constrained Maximum Likelihood Estimation . . . . . . . . . . . . . 66 3.4.1 Efficient estimation . . . . . . . . . . . . . . . . . . . . . . . . 68 3.4.2 Asymptotic Normality . . . . . . . . . . . . . . . . . . . . . . 69 3.4.3 The Method of Scoring Under Parametric Constraints . . . . . 71 3.4.3.1 Convergence Properties . . . . . . . . . . . . . . . . 75 3.4.3.2 Linear constraints . . . . . . . . . . . . . . . . . . . 78 3.5 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.5.1 The Rao statistic . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.5.2 The Wald statistic . . . . . . . . . . . . . . . . . . . . . . . . 83 3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4 Applications of the CCRB in Communications Models 86 4.1 Convolutive Mixture Model . . . . . . . . . . . . . . . . . . . . . . . 88 4.1.1 Equivalent Convolutive Mixture Models . . . . . . . . . . . . 90 4.1.1.1 The Vector-Matrix Model . . . . . . . . . . . . . . . 90 4.1.1.2 The Z transform model . . . . . . . . . . . . . . . . 91 4.1.2 Strict Identifiability . . . . . . . . . . . . . . . . . . . . . . . . 95 4.1.3 The Fisher information of the convolutive mixture model . . . 98 4.1.3.1 Complex-valued Fisher information . . . . . . . . . . 98 4.1.3.2 CFIM for the convolutive mixture model . . . . . . . 99 4.1.3.3 Properties of the CFIM . . . . . . . . . . . . . . . . 100 4.1.4 Constraints for the convolutive mixture model . . . . . . . . . 108 4.1.4.1 Pathways to regularity . . . . . . . . . . . . . . . . . 109 4.1.4.2 Norm channel + real-valued source constraint . . . . 110 4.1.4.3 Semiblind constraints: s(k)(t) =p(t) for t?T . . . . 112 4.1.4.4 Unit Modulus constraint + Semiblind constraint . . 113 4.2 Calibrated Array Model . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.2.1 The Fisher information of the calibrated array model . . . . . 118 4.2.1.1 Indirect derivation of the FIM . . . . . . . . . . . . . 118 4.2.1.2 Direct derivation of the FIM . . . . . . . . . . . . . . 120 4.2.1.3 Properties of the FIM . . . . . . . . . . . . . . . . . 121 4.2.2 Constraints for the calibrated array model . . . . . . . . . . . 122 4.2.2.1 Constraints onthe complex-valuedchannelgain: ? = IK?K . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.2.2.2 Semiblind constraints: s(t) =p(t) for t?T . . . . . 124 4.2.2.3 Finite alphabet constraint: s(k)(n)?S . . . . . . . . 125 4.2.2.4 Unit modulus constraints: |s(n)|= 1 for all n . . . . 127 4.2.2.5 Unit modulus constraint; real-valued channel gain: Im(?k) = 0 for all k . . . . . . . . . . . . . . . . . . . 129 4.2.2.6 Semi-blind and unit modulus constraint . . . . . . . 129 4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 v A Appendices 136 A.1 A proof of the CCRB using the Chapman-Robbins version of the Barankin bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 A.2 A proof of the CCRB using the method of implicit differentiation . . 138 A.3 Alternative proof of asymptotic normality . . . . . . . . . . . . . . . 140 B Proofs of Convergence Properties of Constrained Scoring 142 C Proofs of Theorems in Chapter 4 147 Bibliography 161 vi List of Figures 3.1 Reparameterization of f(?) = 0 to ? =g(?). . . . . . . . . . . . . . . 26 3.2 Projection of the observations x onto the linear space H? and the linear constraint space H?f. . . . . . . . . . . . . . . . . . . . . . . 61 3.3 Path created by iterates from the method of scoring with constraints. 74 4.1 Convolutive Mixture: Finite Impulse Response (tapped delay line) model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.2 Convolutive Mixture: Norm-constrained channel estimation perfor- mance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.3 Convolutive Mixture: Example of multipath channel. . . . . . . . . . 115 4.4 Convolutive Mixture: Source estimation with varying ??. . . . . . . 116 4.5 Calibrated array model geometries. . . . . . . . . . . . . . . . . . . . 118 4.6 Calibrated Array: CCRBs on AOA for blind, constant modulus, and known signal models. . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 4.7 Calibrated Array: CCRBs on AOA for semiblind, constant modulus + semiblind, and known signal models. . . . . . . . . . . . . . . . . . 132 4.8 Calibrated Array: CCRBs on signal phase for blind, constant modu- lus, semiblind, and constant modulus + semiblind. . . . . . . . . . . . 134 vii List of Symbols and Abbreviations Symbols b(?) the bias of an estimator of ? C the complex numbers C? Cuniontext{?} CN(a,B) complex normal with mean a and (co)variance matrix B ?2r chi-square distribution with r degrees of freedom d? convergence in distribution E?(?) the expectation of (?) with respect to the density modeling ? f(?) = 0 the constraint function F(?) the Jacobian (or gradient) of the constraint function at ? I(?) the Fisher information for the parameter ? Im?m or Im the identity matrix of size m Im(?) the imaginary part of (?) s(x;?) the Fisher score N(a,B) normal with mean a and variance (covariance) matrix B p(x;?) the probability density function (pdf) R the real numbers Re(?) the real part of (?) t(x) an estimator of some parameter (vector) ? the parameter ? the parameter space ?f the constraint manifold (parameter space) ??ML(x) the ML estimator of ? depending on the observation x x the observation (sample) (?)+ (?) if the value is nonnegative, otherwise zero Abbreviations BER bit-error rate BLUE best linear unbiased estimate (or estimator) CCRB constrained Cram?er-Rao bound CFIM complex FIM CLSE constrained least squares estimate (or estimator) CML constrained maximum likelihood CMLE constrained maximum likelihood estimate (or estimator) CRB Cram?er-Rao bound FIM Fisher information matrix FIR finite impulse response iid independent and identically distributed LSE least squares estimate (or estimator) MIMO multiple-input, multiple-output ML maximum likelihood MLE maximum likelihood estimate (or estimator) MSE mean-square error MVUE minimum variance unbiased estimate (or estimator) viii NLB nullity lower bound of the FIM (or CFIM) SER symbol-error rate SIMO single-input, multiple-output SISO single-input, single-output ix Chapter 1 INTRODUCTION The Cram?er-Rao bound (CRB), the inverse of the Fisher information, is a limit on the performance of the estimation of parameters under certain conditions. Hence, the variance of any unbiased estimator cannot be lower than this bound, and the CRB may also be interpreted in some sense as a measure of performance potential. For a large number of scenarios, it is often of interest to gauge the performance of estimation under parametric constraints. The traditional approach in deriving CRBs for these cases is to find a reparameterization that represents the constraint. However, such an approach is not always feasible for a large class of constraints and numerical approximations are often restricted to the particular model. An alternative and equivalent option to reparameterization, the constrained Cram?er-Rao bound, is presented herein, which is also computationally simple to program. In communications design research, estimation performance metrics are often interpreted to represent performance potential to study the feasibility of a model to meet a certain measure of desired reliability. This approach is often practical since it avoids the necessity of searching for the best performer over a class of estimators for a particular trial model and since the CRB is analytically or numerically simple to compute. The downside to this approach is that for each model, the CRB needs 1 to be derived each time. This effectively prohibits the practitioner from studying an overly large class of models or, in the context of this work, a large class of constraints on some base model. This restriction is to a great extent eliminated using the constrained Cram?er-Rao bound, as the Fisher information, which involves an integration over many variables, for the base model only needs to be evaluated once. Chapter 2 offers a quick review of several connections with the Cram?er-Rao bound (CRB) within mathematical statistics and serves as a reference point for the study of parameters under parametric equality constraints, discussed in Chapter 3. With the possible exception of the identifiability relationship in section 2.2, much of this section is familiar and well represented in standard mathematical statistics texts. In Chapter 3, a general theory of the constrained Cram?er-Rao bound (CCRB) is presented. In section 3.1, the CCRB is defined and proven, alternative formu- las are presented, and several interesting properties of the bound are detailed. In section 3.2, a connection is made between the CCRB and two different notions of identifiability under certain conditions. In section 3.3, the linear model with linear constraints is examined in the context of the CCRB. In section 3.4, connections between the CCRB and constrained maximum likelihood estimation are detailed, includingan asymptotic normality result and an adaptation of the method of scoring to the constrained parameter scenario. This chapter concludes with the consider- ation of hypothesis testing under constraints in section 3.5. Chapters 2 and 3 are designed so that the section numbers correspond directly, i.e., section 3.x relates 2 a concept for constrained parameters that section 2.x reviews for unconstrained parameters. In Chapter 4, the analytic tools developed in Chapter 3 are applied in the communications context of the convolutive mixture model (section 4.1) and the calibrated array model (section 4.2). These models are defined and their Fisher information matrices developed in section 4.1.3 and section 4.2.1, respectively. A variety of constraints for these models are considered in sections 4.1.4 and 4.2.2. 1.1 A note on the notation All elements will be denoted in lowercase math font: a. All vectors will be column vectors and be denoted in a lowercase bold math font: a. Hence, the ith element of the column vectora will be denoted as ai. (This should not be confused with ai, which is often used as a subvector of the vector a and will be defined in context.) All matrices will be denoted in an uppercase bold math font: A. All scalars, vectors, and matrices are assumed to have elements with real-valued numbers unless otherwise noted as complex-valued (where the complex number i= j =??1 should be clear from context). For vectors and matrices, (?)T will denote the transpose operator (do not with confuse (?)prime, whichisoccasionally usedhere asa dummyvariable), (?)? willdenote the conjugate operator (do not confuse with (?)star, which is occasionally used as a variant of another vector or matrix), and (?)H will denote the Hermitian (or conjugate transpose) operator. When avector dependson another vectorvalue as ina(?), then 3 the Jacobian will be a matrix denoted as A(?) where the ith row is the transposed of the vector ???ai(?) and ai(?) is the ith row element of a(?). For square matrices, (?)?1 will denote the inverse of the matrix and (?)? will denote the pseudoinverse of the matrix. Of course, (?)2 will denote the square of the matrix. For symmetricmatrices(or Hermitianmatricesinthe complex-valuedcase, the expressionA>B will denote that the matrixA?B is positive definite. Similarly, A?B will denote that the matrix A?B is positive semidefinite. Sets will be denoted in an uppercase blackboard math font: A or in an up- percase Greek letter math font: ?. Also, all sets will be assumed open sets unless otherwise noted. For convenience to allow the reader to find referenced items, numbered theo- rems, corollaries, and examples share numbering. Thus, the theorem immediately following Example x.4 in Chapter x is numbered Theorem x.5 even though the previous theorem is Theorem x.1. 4 Chapter 2 THE CRAM?ER-RAO BOUND The Cram?er-Rao bound (CRB) is a lower bound on the error covariance of any unbiased estimator under certain regularity conditions. As such, it is a measure of the optimal performance of an unbiased estimator under a given model. There are other mean-square error lower bounds, such as the Ziv-Zakai bound [78], the Hammersley-Chapman Robbins bound [26, 16], the Barankin bound [8], the Bhat- tacharyya bound [11], etc., and indeed, depending on the model, there are numerous other possible performance measures, such as classification bounds like bit-error rate (BER) or symbol-error rate (SER), but the CRB remains a very popular benchmark due to it?s computational simplicity and its underlying well-developed theory. This theory has led to numerous connections in areas of mathematical statis- tics, e.g., identifiability, linear models, maximum likelihood, including asymptotic normality and the method of scoring, and hypothesis testing. This short chapter is a quick review of just a few of these connections. A more complete discussion of the utility and application of the topics discussed in this chapter, as well as proofs of the definitions, theorems and statements herein, may be found in many standard sta- tistical inference texts, such as Shao [62] or Casella and Berger [14] for statisticians, Kay [36] for signal processors, or Van Trees [72] for engineers. 5 2.1 Definition Suppose we have an observation x in X ? Rn from a probability density function (pdf) p(x;?) where? is in an open set ??Rm is a vector of deterministic parameters. The Fisher information matrix (FIM) of this model is given by I(?) ?= E?braceleftbigs(x;?)sT(x;?)bracerightbig where s(x;?) is the Fisher score defined by s(x;?) ?= ?logp(x;? prime) ??prime vextendsinglevextendsingle vextendsinglevextendsingle ?prime=? and the expectation is evaluated at ?, i.e., E?(?) = integraldisplay X (?)p(x;?)dx. And suppose as regularity conditions, the pdf is differentiable with respect to ? and satisfies ? ??TE?(h(x)) =E?(h(x)s T(x;?)) (2.1) for h(x)?1 and h(x)?t(x) where t(x) is an unbiased estimator of ? [62]. These conditions are assured under a number of scenarios, for example, when the Jacobian and Hessian of the density function p(x;?) is absolutely integrable with respect to bothxand? [72], and essentially permit switching between the order of integration and differentiation. Under these assumptions, we have the following information inequality theorem [14, 62, 36, 72], independently developed by Cram?er [17] and Rao [56]. Theorem 2.1. The Cram?er-Rao bound is the inverse of the FIM, CRB(?) ?=I?1(?), (2.2) 6 if it exists, and the variance of any unbiased estimator t(x) satisfies the inequality Var(t(x))?CRB(?) with equality if and only if t(x)?? = CRB(?)s(x;?) in the mean-square sense. Example 2.2. Let x ? CN(?,?2) with unknown complex-valued mean ? and known variance ?2. In terms of real-valued parameters, the equivalent model is bracketleftbiggRe(x) Im(x) bracketrightbigg ?N( bracketleftbiggRe(?) Im(?) bracketrightbigg ,? 2 2 I2?2). From a well-known result on normal distributions [36, equation (3.31)], I( bracketleftbiggRe(?) Im(?) bracketrightbigg ) = 2?2I2?2 and the CRB is ?22 I2?2. 2.1.1 Extensions The performance of a function of parameters, e.g. the transformation ? = k(?), is often of more interest than the performance of the parameters. If the Jacobian of the transformation function is K(?) = ?k(? prime) ??primeT vextendsinglevextendsingle vextendsingle ?prime=? , then the CRB on the performance of an unbiased estimator of ? is [62, 36] CRB(?) =K(?)I?1(?)KT(?), i.e., if S(x) is an unbiased estimator of ?, then Var(S(x))?K(?)CRB(?)KT(?). Implicit in this inequality for the transformation is that ? is differentiable with respect to ? and (2.1) must also be satisfied for h(x) ? S(x). Consequently, if 7 an estimator t(x) is biased with bias b(?) = E?t(x)??, then the transformation formula above can be used to attain a bound for ?= ?+b(?). Then Var(t(x))? CRB(?+b(?)) where CRB(?+b(?)) = (Im +B(?))CRB(?)parenleftbigIm +BT(?)parenrightbig with B(?) = ?b(? prime) ??primeT vextendsinglevextendsingle vextendsingle ?prime=? . Theorem 2.1 requires a nonsingular Fisher information, however, there are a number of interesting cases where this requirement can not be met yet the model is still of interest. For this scenario, the pseudoinverse of the FIM is occasionally used as a bound in place of the CRB, i.e., Var(t(x))?I?(?) for an unbiased estimator t(x). This bound inequality is trivial for nonidentifi- able functions of the parameters [66], i.e., the variance is finite only if H(?) = H(?)I(?)I?(?) where H(?) = K(?) + B(?) and t(x) is a biased estimator of ?=k(?) with bias b(?). 2.2 Identifiability The ability to identify parameters determines the validity and utility of cer- tain structural models. Criteria on the identifiability of parameters has numerous connections to parametric statistical measures, such as Kullback-Leibler distance [13] and the Fisher information matrix [58, 29, 69]. In this section, two of these con- nections are developed to establish conditions under which a particular parametric model is identifiable. 8 2.2.1 Local identifiability To proceed in examiningthe identifiabilitycriterionfrom the CRB, a definition of identifiability is required. A parameter ????Rm is identifiable in the model p(x;?) if there is no other other ?prime ?? such that p(x;?) = p(x;?prime) for all x?Rn. A parameter is locally identifiable if there exists an open neighborhood of ? such that ? is identifiable in that neighborhood. A parameter is (locally) identifiable in the additive noiseless case if the param- eter is solvable (locally). Estimable parameters, i.e., expected values of functions of the observations [57], are also identifiable.1 Hence, a non-identifiable parameter is not estimable regardless of the scheme or the number of observations. These scenarios exist when some inherent ambiguity exists in the model. Example 2.3. The parameter vector ? = [?1,?2]T in the model x= ?1 +?2 +w, where w represents the observation noise, is not identifiable since it is indistinguish- able from the parameter vector?prime = [?1 +a,?2?a]T for any real number a. The Fisher information matrix is called regular if I(?) is full rank, and ? is said to be a regular point of I(?). If the FIM is singular, ? is a singular point. Example 2.4. The FIM for the model x= ?2 +w, 1The converse is not true. While it is possible to develop estimation schemes for any identifiable parameters, there is no guarantee that those estimators will be unbiased. 9 where w?N(0,1), is I(?) = 4?2. For any ?negationslash= 0, ? is locally identifiable but not identifiable and at the same time a regular point of the FIM. For ? = 0, ? is a singular point, yet is identifiable. Fisher information-regularity implies local identifiability, but as the example demonstrates the converse is not true. Rothenberg [58] found a connection between local identifiability and the FIM under certain conditions. Theorem 2.5 (Rothenberg). Assume the FIMI(?) has constant rank locally about ?. Then ? is locally identifiable if and only if I(?) is regular. 2.2.2 Strong Identifiability Suppose that p(x;?) is a normal pdf with mean ?(?) ? Rp and variance ?(?), whose elements may be explicitly defined by a map ? : ? ? Rq where q?p+p(p+ 1)/2 and it is assumed m?q. Then by the given definitions, (local) identifiability holds when ? is injective (locally) and since by a transformation on the FIM [12, p. 157] I(?) = ?? T(?) ?? I(?(?)) ??(?) ??T regularity holds when the Jacobian ??(?)??T has full rank m. Suppose there exists a set of indicesi1,...,im?{1,...,q}that make??(?) = [?i1(?),...,?im(?)]T injective on ?. Then each ??? is strongly identifiable and ?? is a representative mapping. By this definition, if I(?) is regular at ? then ? is in a strongly identifiable open neighborhood, and if ? is strongly identifiable on ? then it is also identifiable on ?. The converses are not generally true. The following 10 theorem establishes conditions under which the converses are true [29]. Theorem 2.6 (Hochwald and Nehorai). Let?: ??Cq be a holomorphic mapping of z??????A??, where ????Cm and ?? is open in Cm for each ?. Then (a) ifI(z) is regular, there exists a strongly identifiable open neighborhood about z, and (b) if there exists a representative mapping ??? : ?? ? Cq for each ?, I(z) is regular for everyz??. Therefore, the existence of a proper holomorphic function(s) equates Fisher in- formation regularity with strong identifiability for normal distributions. And locally constant rank in the FIM equates regularity with local identifiability for arbitrary distributions. 2.3 Linear Model Suppose we have observations x from a linear model x=H?+w, (2.3) on ?, where H is an observation matrix consisting of known elements and w is the noise from the observations with mean zero and variance C. 11 2.3.1 Best Linear Unbiased Estimators The Gauss-Markov theorem [14] states that the best linear unbiased estimator (BLUE) is given by the (weighted) least squares solution ??LS(x) =parenleftbigHTC?1Hparenrightbig?1HTC?1x, (2.4) so calledfor minimizingthe (weighted)least squares (x?H?)T C?1 (x?H?). For any other LUE of ?, i.e. Ax, then Var(Ax) ? Var(??LS(x)) with equality if and only if A = parenleftbigHTC?1Hparenrightbig?1HTC?1. This assumes a full column rank observation matrix H. Otherwise, for any estimable function of the parameters dT?, where d is in the column space of the transposed observation matrix HT, its least squares solution is dT ??LS(x) where ??LS(x) =parenleftbigHTC?1Hparenrightbig?HTC?1x, (2.5) and (?)? is the generalized pseudoinverse of (?) [57, theorems 11.2B,11.3A-D]. This solution is also the BLUE with variance dT parenleftbigHTC?1Hparenrightbig?d. 2.3.2 Gaussian noise Additionally, if the noise has a Gaussian distribution, i.e.,w?N(0,C), then the least squares solution is also the maximum likelihood estimator (MLE) and the minimum variance unbiased estimate (MVUE). Theorem 2.7. If the observations obey the linear model in (2.3), where H is a known full column rank matrix, ? is an unknown parameter vector, and w is a 12 zero-mean normal random vector with variance C, then the MVUE is ??LS(x) =parenleftbigHTC?1Hparenrightbig?1HTC?1x (2.6) with estimator covariance equaling the CRB I?1(?) =parenleftbigHTC?1Hparenrightbig?1. Similarly, if H is not full rank, and dT? is an estimable function, then the MLE is dT ??MLE(x) where ??MLE(x) = ??LS(x) from (2.5). This MLE is also the MVUE [57, theorems 11.3F-G]. 2.4 Maximum likelihood Given observations x from a likelihood (or pdf) p(x;?) depending on an un- known parameter ?, a popular method of estimating the parameter is the method of maximum likelihood. This approach chooses as an estimator ??ML(x) that, if true, would have the highest probability (the maximum likelihood) of resulting in the given observations x, i.e., the optimization problem: ??ML(x) = argmax ? logp(x;?) where for convenience the log-likelihood is equivalently maximized since log(?) is monotone. An analytic solution of the MLE can be found from the first-order conditions on the log-likelihood by considering solutions ??(x) of s(x;?prime) = 0. (2.7) This is the method of maximum likelihood. Provided ? is an open set, ??ML(x) will satisfy (2.7). 13 2.4.1 Efficient estimation If an efficient estimator exists, it is well-known that the method of maximum- likelihood finds the estimator [36, exercise 7.12], i.e., such an estimator must be a stationary point of the maximum-likelihood optimization problem. More formerly, we have the following theorem. Theorem 2.8. Ift(x) is an estimator of?, which is also efficient with respect to the CRB, then the estimator is a stationary point of the following optimization problem: max ? logp(x;?). 2.4.2 Asymptotic Normality Let the samples x1,x2,...,xn be iid as x from the pdf p(x;?). Denote yn = (x1,x2,...,xn) to be the collection of these samples, so that the likelihood will be p(yn;?) = nproductdisplay i=1 p(xi;?), with the maximum likelihood of these samples denoted ??(yn). Theorem 2.9. Assuming the regularity conditions stated earlier on the pdf p(x;?), the MLE of the parameter ? is asymptotically distributed according to ?nparenleftBig??(y n)?? parenrightBig d ?Nparenleftbig0,I?1(?)parenrightbig where I(?) is derived from the pdf p(x;?), i.e., it is the Fisher information of a single observation or sample. 14 2.4.3 Scoring There exists a number of approaches to finding the maximum likelihood, in some cases requiring iterative techniques. One such technique is Fisher?s method of scoring. Given an observation x from a likelihood or pdf p(x;?) depending on an unknown parameter ? and given some initial estimate ??(1) of ?, then iteratively using the update ??(k+1) = ??(k) +I?1( ??(k))s(x; ??(k)) (2.8) will find the MLE under certain conditions, e.g., provided the initial estimate is sufficiently close in a locally convex region. 2.5 Hypothesis testing Given the inclusion of the CRB quantity in the asymptotic normality results in section 2.4.2, it is not surprising that there would also exist connections to some asymptotic hypothesis tests. Assumeh: Rm ?Rr is a consistent and nonredundant differentiable function, which defines the null hypothesis H0 :h(?) = 0 in the likelihood(or model)p(yn;?) versusthe alternative hypothesisH1 :h(?)negationslash= 0. 2.5.1 The Rao statistic The Rao (or score) test statistic is given by ?(yn) = 1nsT(yn; ??h(yn))I?1(??h(yn))s(yn; ??h(yn)) (2.9) 15 where the Fisher score at yn is s(yn, ??h(yn)) = nsummationdisplay i=1 s(xi, ??h(yn)) and ??h(yn) is the MLE of ? under the null hypothesis h(?) = 0. A variant of this statistic, the Lagrange multiplier test [3, 63] ??TnH(??h(yn))I?1(??h(yn))HT(??h(yn))??n, was developed by Silvey [63]. The equivalence between the Rao and Lagrange mul- tiplier test comes from the first order condition to satisfy the constraint [45], i.e., s(yn; ??h(yn)) +HT(??h(yn))??n = 0, h(??h(yn)) = 0 where ??n?Rr is a vector of Lagrange multiplier estimates. First order Taylor-series expansions of both equations about the true parameter ? produces s(yn;?)?In(?)(??h(yn)??) +HT(??h(yn))??n = o(n?1/2) h(?) +H(?)(??h(yn)??) +o(n?1/2) = h(??h(yn)) where In(?) = nI(?) (n times the Fisher information based on a single sample x). Hence under the null hypothesis, the latter implies H(?)(??h(yn)??) = o(n?1/2), and premultiplying the former by H(?)I?1n (?), then H(?)I?1n (?)s(yn;?)+H(?)I?1n (?)HT(??h(yn))??n = o(n?1/2). Since s(yn;?)?N(0,In(?)), then applying Slutsky?s theorem and the continuity of the Fisher information and the hypothesis function, H(??h(yn))I?1n (??h(yn))HT(??h(yn))??n d?Nparenleftbig0,H(?)I?1n (?)HT(?)parenrightbig. 16 Therefore, parenleftBig H(??h(yn))I?1n (??h(yn))HT(??h(yn)) parenrightBig?1/2 ? ?n isar-dimensional standard normal variable in distribution, and ?(yn) d??2r. Hypothesis H0 is rejected if ?(yn) > ?2r,?, where ?2r,? is the (1??)-quantile of the chi-square distribution with r degrees of freedom. 2.5.2 The Wald statistic The Wald test statistic is given by ?(yn) = nhT(??(yn)) parenleftBig H(??(yn))I?1(??(yn))HT(??(yn)) parenrightBig?1 h(??(yn)) (2.10) where H(?) = ?h(? prime) ??primeT vextendsinglevextendsingle vextendsinglevextendsingle ?prime=? and ??(yn) is an MLE of ?. (The nonredundancy of h implies that H(?) is full row rank.) From section 2.1.1, the CRB of h(?) in the model p(x;?) is H(?)I?1(?)HT(?) and therefore, using theorem 2.9, ?nparenleftBigh(??(y n))?h(?) parenrightBig d ?Nparenleftbig0,H(?)I?1(?)HT(?)parenrightbig. Hence, under H0, using Slutsky?s theorem, the convergence in probability of the MLE, and continuity of the FIM and Jacobian of the test function, ?(yn) d??2r. Therefore, the hypothesis H0 is rejected if ?(yn) >?2r,?. 2.6 Discussion In this section, the Cram?er-Rao bound (CRB) was defined and in theorem 2.1 it was stated to be a bound on mean-square error performance of an unbiased 17 estimator. The theory of the CRB?s connection to a variety of useful theorems and equations in mathematical statistics was demonstrated. The CRB is included in conditions for local identifiability (theorem 2.5) as well as conditions for strict identifiability (theorem 2.6). The CRB is the projection matrix of the BLUE under a Gaussian model in equation (2.6). There is a connection between the existence of efficiency with respect to the CRB and the method of ML (theorem 2.8). The CRB is also the asymptotic variance of the ML estimator (theorem 2.9) and appears in the update formula in (2.8) for the method of scoring. The CRB also appears in the formulas for the Rao test statistic in (2.9) and for the Wald test statistic (2.10). These connections are not exhaustive, e.g., the CRB formula can also be useful in defining confidence regions or in useful as a cost function, but these are perhaps the most prevalent general topics in mathematical statistics theory and for that reason serveas auseful comparison for the constrained Cram?er-Rao bound inchapter 3 and its connections in the theory. 18 Chapter 3 THE CONSTRAINED CRAM?ER-RAO BOUND While the Cram?er-Rao bound (CRB) is a useful measure of parametric esti- mation, it does not inherently measure the performance of estimators of parameters that satisfy side information in the form of a functional equality constraint f(?) = 0. (3.1) The statistical literature is surprisingly somewhat limitedin addressing performance measures under this general scenario. The traditional practice is to find some equiv- alent reparameterization of the particular model and then find the CRB on the parameters of interest using the reparameterized transformation. This approach, however, does not lend itself to theoretical meaning beyond the particular reparam- eterized model. Typically, works that do examine (3.1) in a general manner are focused on developing methods for decisions (hypothesis testing) instead of mea- suring estimation performance. Nevertheless, these results have connections to a CRB incorporating the side information in (3.1), or a constrained CRB. A number of papers, including Aitchison and Silvey [3] and Crowder [18], using the method of Lagrangian multipliers, examine the asymptotic normality of the constrained maxi- mum likelihood estimator (CMLE) and as a consequence unintentionally develop a CRB under equality constraints. Under certain conditions, the asymptotic variance of the MLE equaling the CRB lends credence to the claim that the asymptotic vari- 19 ance of the CMLE should equal to the CRB under equality constraints, although the authors did not always appear cognizant of this fact. Others, including Silvey [63] to develop his Lagrange multiplier test, Osborne [55] with linear constraints to develop a scoring algorithm, and Waldorp, Huizenga, and Grasman [73] to develop a Wald- type test, also use the Lagrange multiplier approach in developing a constrained bound. Again, since these authors were primarily focused on asymptotic properties or hypothesis testing, the nature and perhaps utility of this mathematical quantity in their work is not explicitly stated as a CRB or bound on performance estima- tion of parameters under constraints. The creation of a constrained bound strictly for the use in performance analysis wasn?t achieved until Gorman and Hero [23]. Gorman and Hero derived such a measure by taking the limit of the Hammersley- Chapman-Robbins bound with test points restricted to exist only in the constraint space. This constrained Cram?er-Rao bound (CCRB) I?1(?)?I?1(?)FT(?)parenleftbigF(?)I?1(?)FT(?)parenrightbig?1F(?)I?1(?) (3.2) utilizes the Jacobian F(?) of the functional constraint f(?) and the inverse of the Fisher Information matrix (FIM) I(?) (based on the unconstrained model), which must be nonsingular. As with the CRB, there exist a number of alternative derivations. The works of Crowder, Waldorp, et al, Gorman and Hero, and Aitchison [2] include the formula in (3.2) in some manner for which the CRB might be used for the unconstrained scenario in their works, a fact that implicitlyproves the validity of the CCRB.Withthe explicitproof by Gorman and Heroasa guideline,Marzetta[47] providesan elementaryproof of thisCCRB,whichavoidstheuse of the applicationof 20 the Cauchy-Schwarz inequality, and avoids the use of pseudo-inverses, by examining the inequality created from the positive-semidefiniteness property for the variance of a properly defined random variable. A similar construction was used by Stoica and Ng [68] to formulate a more general CCRB U(?)parenleftbigUT(?)I(?)U(?)parenrightbig?1UT(?) (3.3) that incorporates the constraint information without the assumption of a nonsin- gular FIM. This CCRB utilizes the unconstrained FIM and an orthonormal com- plement matrix U(?) whose vectors span the null space of the constraint Jacobian matrix. Furthermore, when the FIM is nonsingular, the Stoica-Ng CCRB in (3.3) is equivalent to the Gorman-Hero version of (3.2). Hence, while much of the previous work used the formula in (3.2), the more general formula (3.3) is also applicable. Osborne, independently from much of these other works, developed a method of scoring with constraints that utilizes the Stoica-Ng CCRB formula in (3.3) as the projection matrix in place of the CRB. There are, of course, numerous instances of matrix structures of the same form as (3.3), for example as part of the projection matrix of the generalized least squares estimate of the mean of a linear model. In this section, we develop a very simple derivation of the CCRB in (3.3). Rather than assuming the parameters satisfy functional constraints, we approach the problem theoretically from an alternative, yet equivalent, perspective and as- sume the parameters locally fit a reduced parametric model. This approach permits the extension of the existing classical theory underlying the CRB to the case of a model under parametric constraints. While it is true that several of these exten- 21 sions already exist in the literature, there does not exist a cohesive treatment of these results. However, this chapter should not be viewed as simply a collection of historical results, but a unified and comprehensive development of the theory of the constrained Cram?er-Rao bound. 3.1 The Constrained CRB Suppose we observe x?X?Rn from a probability density function p(x;?) where????Rm isa vectorof unknown deterministicparameters and, inaddition, suppose these parameters are required to satisfy k consistent and nonredundant continuously differentiable parametric equality constraints, i.e., f(?) = 0 for some consistent and nonredundant f : ??Rk. We shall denote ?f = braceleftBig ?prime ?? :f(?prime) = 0,f consistent, nonredundant bracerightBig (3.4) to be the feasible set satisfying the constraints. Hence, the constraint can also be stated???f. The constraints being consistent means that the set ?f is nonempty. The constraints beingnonredundant means that the JacobianF(?prime) = f(? prime) ??primeT has rank k whenever f(?prime) = 0. As before, the Fisher information matrix (FIM) of this model (ignoring the constraint) is given by I(?) = E?braceleftbigs(x;?)sT(x;?)bracerightbig where s(x;?) is the Fisher score defined by s(x;?) = ?logp(x;? prime) ??prime vextendsinglevextendsingle vextendsinglevextendsingle ?prime=? 22 and the expectation is evaluated at ?, i.e., E?(?) = integraldisplay X (?)p(x;?)dx. Incorporating this additional side information into the information from the observations x directly would require an alteration of the pdf?s dependence on the unknown parameter. Such an approach can often be analytically impractical or numerically complex. Hence, it is desirable to have a formulaic or prescriptive ap- proach to include the side information, or constraints, indirectly. To meet this need, Stoica and Ng developed a method to incorporate parametric equality constraints into the CRB [68, theorem 1]. Theorem 3.1 (Stoica & Ng). The constrained Cram?er-Rao bound on ? ? ?f is given by CCRB(?) =U(?)parenleftbigUT(?)I(?)U(?)parenrightbig?1UT(?) (3.5) where U(?) is a matrix whose column vectors form an orthonormal basis for the null space of the Jacobian F(?), i.e., F(?)U(?) = 0 , UT(?)U(?) =I(m?k)?(m?k). (3.6) Thus, ift(x) is an unbiased estimator of?, which satisfies the constraint (3.4), then Var(t(x))?CCRB(?) with equality if and only if t(x)?? = CCRB(?)s(x;?) in the mean-square sense.1 1The originaltheorem requires the estimator to satisfy the constraint. In general, the parameter and its unbiased estimator will not simultaneously satisfy the constraint since the implication, mainly that f(E?t(x)) = E?f(t(x)), is true only under particular conditions. However, the CCRB is the same if either assumption is made exclusively. In this treatise, I assume that the actual parameter ? satisfies the constraint and the unbiased estimator t(x) does not. (In section 3.4, the constrained maximum likelihood estimator (CMLE) is assumed to satisfy the constraint, but unbiasedness is not assumed.) 23 The Jacobian F(?) having full row rank is not necessary since the Jacobian does not explicitly appear in the CCRB formula in (3.5). Indeed, the requirement that the column vectors ofU(?) form an orthonormal basis is also unnecessary, only that they be linearly independent and that they span the basis of the null space of the row vectors of F(?), i.e., only that the column space of U(?) is an orthogonal complement of the row space of F(?), is required since it is clear from the structure of (3.5) that the CCRB is invariant to automorphisms inRm?rank(F(?)) onU(?). Re- gardless, for convenience,and except where otherwise noted, we will assume that the constraints are nonredundant and the columns of U(?) are orthonormal to ensure that rank(U(?)) = m?k and UT(?)U(?) = Im?k, respectively. The existence of the bound only requires thatUT(?)I(?)U(?) rather than the FIM be nonsingular.2 The original proof of this theorem given by Stoica and Ng considers the variance inequality generated by the random variable t(x)???WU(?)UT(?)s(x;?) and maximizesW to attain the tightest bound for Var(t(x)??). Example 3.2 (Unit Modulus Constraint). Suppose ? in example 2.2 is constrained to be unit modulus, i.e., f(?) =|?|2?1. Then its gradient in terms of the param- eter vector ? = bracketleftbiggRe(?) Im(?) bracketrightbigg is F(?) = bracketleftbig2Re(?), 2Im(?)bracketrightbig, which has an orthonormal complementU(?) = bracketleftbigg?Im(?) Re(?) bracketrightbigg . The CCRB for this constraint is then CCRB(?) = ? 2 2 bracketleftbigg Im(?)2 ?Im(?)Re(?) ?Re(?)Im(?) Re(?)2 bracketrightbigg . 2A corresponding regularity condition to that mentioned in section 2.1 will be discussed in section 3.1.4. 24 3.1.1 A proof of the CCRB While the proof given by Stoica and Ng is sufficient to establish the validity of the CCRB, these proofs ignore the existing classical theory encompassing the CRB and FIM, which is already sufficient to prove the CCRB. However, prior to developing the foundation for the CCRB from the existing theory, we require a foray into multivariable calculus and, specifically, the use of the implicit function theorem. The reward for this approach will be a seamless presentation of statistical inference involving the constrained Cram?er-Rao bound. From the perspective of multivariable calculus, the constraint f(?) = 0 effec- tively restricts? to a manifold ?f of the original vector space ?, with the manifold having dimension m?k since k degrees of freedom are lost when rank(F(?)) = k for all ? ? ?f. More formally, the following theorem [65, theorems 5-1 and 5-2] applies. Theorem 3.3 (Implicit Function Theorem). Let U ? Rm be an open set and assume f : U?Rk is a differentiable function such that F(?) has rank k whenever f(?) = 0. Then ?f ?U is an (m?k)-dimensional manifold in Rm, and for every ? ? ?f ?U there is an open set V owner ?, an open set W ? Rm?k, and a 1-1 differentiable function g? : W?Rm such that (a) g?(W) = ?f ?U?V, and (b) the Jacobian of g?(?prime) has rank m?k for each ?prime ?W. Therefore, there exists a function g? : Rm?k ? Rm, and sets O owner? and P open in ?f and Rm?k, respectively, such that g? : P?O is a diffeomorphism on P, 25 i.e., a continuously differentiable bijection with a continuously differentiable inverse. A geometric example is shown in Figure 3.1. Note this diffeomorphism depends on the parameter ? as the reparameterization is only guaranteed to exist in a local neighborhood of ?; however, for convenience, we will omit this notation in this subsection so that g=g?. Thus, we may proceed under the assumption that every ?prime ?O ? ?f is the image of a unique reduced parameter vector ?prime ? P?Rm?k under g, or simply ?prime =g(?prime). (3.7) Necessarily, there exists some unique ??P for which ? =g(?). We will denote the Jacobian of g to be G(?prime) = ?g(? prime) ??primeT , which also implicitly depends on ?. 4 6 ?0.8 ?0.6 ?0.4 ?0.2 0 0.2 0.4 0.6 ?o ?o 2D manifold in 3D space Reparameterized Space f(?)=0 ? g(?) g?1(?) ? Figure 3.1: Reparameterization of f(?) = 0 to ? =g(?). Example 3.4 (Unit Modulus Constraint). As an example of this principle, con- sider a complex parameter ? with a modulus constraint (as in example 3.2). The parameter vector in this case may be ? = [?1,?2]T = [Re(?),Im(?)]T ? R2 with the constraint being f(?) = ?21 + ?22 ?1 = 0. By the implicit function theorem, 26 constraining ? to be unit modulus is tantamount to assuming the existence of a ??R such that ?= e?j?. Hence, ? is a function of ?, i.e., ? = bracketleftbiggRe(?) Im(?) bracketrightbigg = bracketleftbigg cos(?) ?sin(?) bracketrightbigg =g(?). (3.8) Also, g(R) ={??R2 : f(?) = 0} and G(?) = bracketleftbig?sin(?) , ?cos(?)bracketrightbigT has rank 1 for every ?. For the model in example 3.2, then CCRB(?) = ? 2 2 bracketleftbigg sin2(?) ?sin(?)cos(?) ?cos(?)sin(?) cos2(?) bracketrightbigg , which is exactly as before. It must be noted thatgwillnot be unique. For the previousexample,?= ej? is another possible reparameterization. Nor is anyg satisfying the theorem necessarily a 1-1 correspondence between Rm?k and ?f; again, for the previous example, g is periodic. Thus, the bijection is only guaranteed locally. Finding a particularg for a givenf and? may not be obvious. Methods for approximating an implicit function will be discussed in section 3.1.5. Regardless, as shall be shown in the context of the CCRB, knowledge of any particular g is unnecessary; only its existence, given by the implicit function theorem, is necessary. Why? Using the implicit function theorem to assume a locally equivalent reparameterization for the constraint limits the information from the observations to the density function?s local dependence on the unknown parameter. But as the CRB (and hence CCRB) is a local bound that only characterizes the local noise ambiguities in the model, i.e., the average local curvature of the density at the parameter value of interest, this limitation is invariant to the local curvature restricted to a space determined by the constraints. 27 Theorem 3.5 ([50]). The CRB on??O under the assumption of (3.7) is given by G(?)parenleftbigGT(?)I(g(?))G(?)parenrightbig?1GT(?) (3.9) and if g in (3.7) is an implicit function of f(?) = 0 then this bound is equivalent to the constrained Cram?er-Rao bound in (3.5). Proof. This CRB can be developedfrom a transformation of parameters on the FIM and the CRB. From the CRB transformation of parameters on the CRB (see section 2.1.1), if t(x) is an unbiased estimator3 of ? =g(?) then Var(t(x))?CRB(g(?)) =G(?)?I?1(?)GT(?) (3.10) with equality if and only if t(x)?g(?) = G(?)?I?1(?)?s(x;?) in the mean-square sense [36, Appendix 3B], where ?I(?) = E??s(x;?)?sT(x;?) is the FIM on ? and ?s(x;?) = ? logq(x;? prime) ??prime vextendsinglevextendsingle vextendsingle ?prime=? is the Fisher score of the pdf with respect to?, this pdf be- ing q(x;?) = p(x;g(?)). By application of the derivative chain rule the Fisher score of x with respect to ? is ?s(x;?) = GT(?)s(x;?). Hence, from the transformation of parameters on the FIM [12, p. 157], the FIM on ? is4 ?I(?) = Ebraceleftbig?s(x;?)?sT(x;?)bracerightbig = EbraceleftbigGT(?)s(x;?)sT(x;?)G(?)bracerightbig = GT(?)I(?)G(?) = GT(?)I(g(?))G(?). (3.11) 3Again, there is no actual use of the assumption here that t(x)??f although if t(x)?g(P) then t(x) does indeed satisfy the constraint. The theorem result for unbiased estimators holds regardless and this will depend on the regularity condition detailed in section 3.1.4. 4Implicitly, it is assumed that the regularity conditions of section 2.1 hold with respect to ?. For how this applies to ?, see section 3.1.4. 28 Substituting (3.11) into (3.10) establishes the CRB under the assumption? =g(?). To establish the equivalence with the CCRB, note sincef?g(?) = 0, then by taking the Jacobian with respect to ? we have 0 = ???T 0 = ???f(g(?)) =F(g(?))G(?) =F(?)G(g?1(?)). Hence, the columns of G(?) reside in the null space of the row vectors of F(?). And since g has an inverse locally at ? = g?1(?), then G(?) has full column rank m?k locally about ?. (This is true on the whole set P ? Rm?k.) Therefore, span(G(g?1(?))) = span(U(?)) and there exists an m?k?m?k full rank trans- formation matrix S(?) such that G(g?1(?))S(?) = U(?). (This is true on the whole set O??f.) This matrixS(?) is merely an orthonormalizing change of basis on the columns of G(g?1(?)). Therefore, G(?)parenleftbigGT(?)I(g(?))G(?)parenrightbig?1GT(?) = G(g?1(?))parenleftbigS?T(?)UT(?)I(?)U(?)S?1(?)parenrightbig?1GT(g?1(?)) = G(g?1(?)S(?)parenleftbigUT(?)I(?)U(?)parenrightbig?1ST(?)G(g?1(?)) = U(?)parenleftbigUT(?)I(?)U(?)parenrightbig?1UT(?). Also, Var(t(?))?CCRB(?) with equality if and only if t(x)?g(?) = G(?)?I?1(?)?s(x;?) t(x)?? = G(?)?I?1(?)GT(?)s(x;?) = CCRB(?)s(x;?). 29 This proof of the CCRB uses the implicit function theorem, the CRB transfor- mation formula, the FIM transformation formula, as well as well-known properties of rank and the derivative chain rule. An advantage of establishing the CCRB from these classical results will become clear as we establish the connection throughout statistical information theory. An example of this is immediately evident in an alternative proof of a proposition of Stoica and Ng [68, proposition 1]. Corollary 3.6 (Stoica & Ng). Given the regularity conditions on ?, a necessary and sufficient condition for the existence of a finite CCRB of ? is vextendsinglevextendsingleUT(?)I(?)U(?)vextendsinglevextendsinglenegationslash= 0, i.e.,UT(?)I(?)U(?) is nonsingular. Proof. From the prior theorem, it is clear thatUT(?)I(?)U(?) is nonsingular if and only if GT(?)I(g(?))G(?) is nonsingular if and only if ?I(?) is nonsingular. Since a necessary and sufficient condition for the existence of a finite CRB of g(?) is that ?I(?) is nonsingular, the corollary is proven. Thus, with the usual regularity conditions (see section 2.1) being maintained for ?, the conditions for the existence of the CCRB with respect to? are equivalent to the conditions for the existence of the CRB in the reduced parameter space of ?. Moreover, the matrix UT(?)I(?)U(?) is nonsingular if and only if U(?) has fullcolumnrankm?kandno linearcombinationof the columnsofU(?)reside inthe null space of I(?). The first condition is satisfied always by definition. The second condition implies,for example, that ifL(?)LT(?) is the Cholesky decomposition [20, 30 p.194] of the FIM, whereL(?)?Rm?rank(I(?)) is a full column rank lower triangular matrix with strictly positive values on the diagonal, then U(?) = L(?)A(?) + K(?)B(?) where K(?) is an orthogonal complement of L(?) and for some full column rank A(?)?Rrank(I(?))?m?k. Further properties of the CCRB will be discussed in section 3.1.4. 3.1.2 Alternative formulas The formula Stoica and Ng used to express the constrained Cram?er-Rao bound is a generalization of an expression developed earlier first by Gorman and Hero [23, theorem 1] and later by Marzetta [47, theorem 2]. This is the CCRB formula in (3.12). Although Gorman and Hero?sformula requires a nonsingular Fisher informa- tion, the version developed by Stoica and Ng appears to be inspired by a result [23, (19) in lemma 2] in Gorman and Hero?s paper that unnecessarily assumes a positive definite FIM. However, this result was, in essence, not unknown in the literature. Gorman and Hero were aware of the prior work of Aitchison and Silvey [3, theorem 2 and P on p.823], which is concerned with the asymptotic variance of the maximum likelihood estimator subject to restraints. But they were perhaps unaware (by lack of citation) of a later paper on hypothesis tests associated with the maximum like- lihood, in which Aitchison and Silvey suggest a solution to the problem of singular information matrices [4, section 6]. This is the CCRB formula in (3.14). This ver- sion of the CCRB was also proven by Crowder [18, theorem 3]. Concurrent to the Gorman and Hero effort, Hendriks [27] and later Oller and Corcuera [53] developed 31 an extension of the Cram?er-Rao bound intrinsic to the manifold using Riemannian geometry. More recently, Xavier and Barroso [74, 75] specified the lower bound on the geodesic of estimators to the true parameter. The original and latest versions of their bound are expressed in (3.16) and (3.17), respectively. While these CCRB expressions are not the focus of the current treatise, they are still important for possible insights into the constrained Cram?er-Rao bound. In this section, aspects of these insights will be briefly discussed as well as conditions for equality with the CCRB expression in (3.5). 3.1.2.1 Gorman-Hero-Aitchison-Silvey CCRB Aitchison and Silvey used the method of Lagrange multipliers to show that the weighted asymptotic variance of the constrained maximum likelihood estimator, which should be the CCRB (implicitly), tends to CCRB2(?) =I?1(?)?I?1(?)FT(?)parenleftbigF(?)I?1(?)FT(?)parenrightbig?1F(?)I?1(?) (3.12) when the Fisher information is nonsingular. Alternatively, Gorman and Hero de- veloped this same CCRB by restricting test points in the Chapman-Robbins bound (a Barankin-type bound) to be in the constraint space ?f and then finding the derivatives as the limit of the finite difference expressions in the Chapman-Robbins bound.5 A simpler proof was provided by Marzetta by considering the positive semidefiniteness of a properly chosen random variable. A particular advantage to this presentation of the CCRB is the explicit quan- 5A definition of the Chapman-Robbins bound as well as a variant of the Gorman and Hero proof, which allows for a singular FIM, can be found in Appendix A.1. 32 tification of the gain in performance potential. Imposing a constraint f(?) = 0 on a set of parameters improves (lowers) the unconstrained bound from I?1(?) by exactly I?1(?)FT(?)parenleftbigF(?)I?1(?)FT(?)parenrightbig?1F(?)I?1(?). Since the CRB of f(?) is F(?)I?1(?)FT(?) then its inverse is the Fisher information of f(?) or 0. And FT(?)parenleftbigF(?)I?1(?)FT(?)parenrightbig?1F(?) is the Fisher information of ? generated from the constraint f(?) = 0. A disadvantage is the requirement of a nonsingular FIM. There exist numerous scenarios that require constraints for the original model to be identifiable (see Chapter 4). Additionally, this CCRB formula requires nonredun- dant constraints, i.e., the Jacobian F(?) must be full row rank. Similarities include the computational complexity of both formulas, which is O(m3). And when the Fisher information is nonsingular (and the constraints nonredundant), both formulas are equivalent. Theorem 3.7. When the Fisher information is nonsingular and the constraints nonredundant, then an equivalent formula for the CCRB in (3.5) is CCRB2(?). Proof. This is a different proof than the one provided in [68, corollary 1].6 The existence of the Gorman-Hero-Marzetta formula assumes that the FIM I(?) and the CRB of the constraint F(?)I?1(?)FT(?) are regular (non-singular) [23, 47]. Correspondingly, the existence of the Stoica and Ng CCRB formula assumes that 6In addition to this alternate proof, they reference an algebraic identity from [37] that is useful in establishing the result. Lemma 3.8 (Khatri). Suppose A is p?q and B is p?p?q have ranks q and p?q respectively such that BTA= 0. Then for any symmetric positive definite matrix S, S?1?S?1A(ATS?1A)?1ATS?1 =B(BTSB)?1BT. Substituting I(?) for S, F(?) for AT, and U(?) for B shows the equivalence between the two CCRBs for when the FIM is nonsingular. 33 UT(?)I(?)U(?) is non-singular. Now, right-multiplying both formulas by FT(?) returns the results CCRB(?)FT(?) = U(?)parenleftbigUT(?)I(?)U(?)parenrightbig?1UT(?)FT(?) = 0, CCRB2(?)FT(?) = I?1(?)FT(?)?I?1(?)FT(?) = 0 since F(?)U(?) = 0 in the first equation and the elimination of the inverse in CRB(f(?)) in the second. Alternatively, right-multiplying both formulas by the matrix I(?)U(?) returns the results CCRB(?)I(?)U(?) = U(?), CCRB2(?)I(?)U(?) = U(?) again by eliminating the inverse in the first equation and since F(?)U(?) = 0 in the second. Hence we have the equality CCRB(?)bracketleftbigFT(?) I(?)U(?)bracketrightbig= CCRB2(?)bracketleftbigFT(?) I(?)U(?)bracketrightbig and if it can be shown that the matrixbracketleftbigFT(?), I(?)U(?)bracketrightbigis regular, then the two CCRB formulas are shown to be equivalent. Suppose there exists vectors ??Rk and ??Rm?k such that FT(?)?+I(?)U(?)?= 0. (3.13) Pre-multiplying (3.13) by UT(?) implies that UT(?)I(?)U(?)?= 0. 34 Since UT(?)I(?)U(?) is regular then ? = 0. Likewise, premultiplying (3.13) by F(?)I?1(?) implies that F(?)I?1(?)FT(?)?= 0. Since F(?)I?1(?)FT(?) is regular then ? = 0. Hence FT(?)?+I(?)U(?)? = 0 implies?= 0 and ? = 0, which proves that bracketleftbigFT(?) I(?)U(?)bracketrightbig is full rank. 3.1.2.2 Aitchison-Silvey-Crowder CCRB The solution for resolvinginvertibilityof singular FIMsin the variance and test results of Aitchison and Silvey [4] was to load the Fisher information with a matrix of the formFT(?)F(?). This was made more rigorous by Crowder [18] by replacing the Fisher information I(?) with a loaded FIM D(?) = I(?) + FT(?)KF(?), where K is any positive semidefinite matrix such that D(?) is regular. Hence, a generalization of (3.12) is CCRB3(?) =D?1(?)?D?1(?)FT(?)parenleftbigF(?)D?1(?)FT(?)parenrightbig?1F(?)D?1(?). (3.14) This extension permits a singular FIM. This formulation is also independent of the choice of K. Theorem 3.9. An equivalent formula for the CCRB in (3.5) is CCRB3(?). Proof. Replace I(?) with D(?) in the proof of theorem 3.7.7 7Inappendix A.3, thisCrowder formulaforthe asymptoticvarianceof the constrained maximum likelihood estimate is shown to be equivalent to the CCRB using lemma 3.8 (also see section 3.4.2). 35 As the computational complexity of the two formulas are O(m3), there is no direct advantage of one CCRB version over the other. Certainly, in the nonsingular FIM case, CCRB3(?) is the same as the CCRB2(?) (with K = 0). Unfortunately, this CCRB formula does not appear to have simple connections to other areas of statistical inference in an inherent manner. Nevertheless, this possible inefficacy in theoretical applications does not effect its practical use. 3.1.2.3 Xavier?s & Barroso?s Intrinsic Variance Lower Bound The prior CCRB metrics were in Euclidean Rm space, i.e., the lower bound is on the measurementof the distance (insome directionor dimension) of the estimator to the true value of the parameter measured by ?cutting through? the manifold. In some scenarios, it may be of interest to know what the bound is on the measurement of the distance ?over the surface? of the manifold. Since dimensional directions can be somewhat ambiguous depending on the manifold, of particular interest is the geodesic, or shortest distance. For this scenario, Xavier and Barroso [74, 75] formulated an inequality, the intrinsicvariance lowerbound (IVLB),for the variance of the geodesic to anunbiased estimator intrinsic to the manifold. Their results are derived from those of Hendriks [27] and Oller and Corcuera [53]. Ignoring elementsof Riemannian geometric theory that are beyond the scope of this presentation, their IVLB result essentially relies on the inequality ? C radicalbig var(?)cot( ? C radicalbig var(?))? radicalbigvar(?) ?? ? , (3.15) 36 where C is an upper bound on the sectional curvature of the manifold, ?? is a bound on variance of the Euclidean estimator error, and var(?) is the variance of the geodesic. Precisely, C = max???f K? where K? = max v1,v2 orthonormal F(?)vi=0 < ?(v1,v1),?(v2,v2)>?< ?(v1,v2),?(v1,v2) > and ?(?,?) is the second fundamental form [34] of ?f defined by ?(a,b) =?FT(?)parenleftbigF(?)FT(?)parenrightbig?1 bracketleftBig aT ?Fi(?)?? b bracketrightBig k?1 onU?U whereU = span{U(?)}. Xavier and Barroso use a polynomial bound on the cotangent to solve the lower bound of a quadratic in terms of var(?). In an earlier variant of the IVLB [74], Xavier and Barroso chose ??1? = max v?U,||v||=1 vTI(?)v and bounded tcot(t)?1?23t2, for 0?t?T ?1.35, in (3.15) where t =?Cradicalbigvar(?), to bound on the variance of the estimator?s geodesic to the mean by var(?)? 4C + 3??? radicalbig? ?(9?? + 24C) 8 3C 2 . (3.16) Unfortunately, this bound was optimistic in the limit for the simple Euclidean case (C = 0). In a more recent paper [75], the authors improved the bound by choosing ?? = tr(UT(?)I(?)U(?))?1 and an alternative lower bound for tcot(t) to obtain var(?)???C + 1? ?2? ?C + 1 1 2C 2?? . (3.17) Although the authors omitted a proof8, the alternative lower bound for tcot(t) appears to be tcot(t) ? 1? 12t2 for 0 ?t?T. An immediate benefit from this 8Xavier and Barroso stated in [75] that the proof would ?be found in the companion paper [14]?, but this companion paper appears to have never been published. 37 improved bound is the agreement with the simple (C = 0) Euclidean case, i.e., var(?)??? = tr(UT(?)I(?)U(?))?1. This consistency is entirely due to the choice of ??. Even so, greater improvement of the bound, not found in the literature, is possible in the curved scenarios (C > 0) at least in the bound of tcot(t). Note, for 0?tq(x; ??(x)) and g??(x)(?prime)negationslash??f. 67 3.4.1 Efficient estimation In Section 2.4.1, it was explained that when an efficient estimator exists, the method of maximum likelihood finds the estimator [36, 62]. It is useful to note the connection between efficiency and the method of constrained maximum likelihood since Stoica and Ng ignored this extension in their paper [68], despite Marzetta having showed that this result extends to the constrained case when the FIM is non-singular [47, theorem 3]. What follows is the general extension of this result, including the case for singular FIMs. Theorem 3.31. If t(x) is a constrained estimator of ?, required to satisfy the constraint f(t(x)) = 0, which is also efficient with respect to the CCRB, then the estimator is a stationary point for the constrained optimization problem in (3.31). Proof. This is perhaps more easily proven strictly from the constrained parameter perspective, since the global maximum of the likelihood relative to the implicit reparameterization may not correspond to global maximum in ?f relative to the constrained parameterization. Since t(x) is efficient then in the mean-square sense we have t(x)?? = CCRB(?)s(x;?) as a function of ?. Then as ??t(x) (this assumes the observations are consistent with ?) we have 0?t(x)?? = CCRB(?)s(x;?). The continuity of the CCRB and the Fisher score impliess(x;t(x)) =FT(t(x))?? for some ??Rk or s(x;t(x))??TF(t(x)) = 0, 68 which defines the stationarity condition (3.32) of the constrained optimization prob- lem with ? being the vector of Lagrange multipliers. 3.4.2 Asymptotic Normality The asymptotic properties of the MLE can be found in section 2.4.2. Therein, it was mentioned that the maximum likelihood estimator was asymptotically un- biased and efficient with variance asymptotically equivalent to the CRB. A corre- sponding relationship exists between the CMLE and the CCRB. As before, let the samples x1,x2,...,xn, be iid as x from the likelihood p(x;?), where ? is assumed to exist in ?f. Denote yn = (x1,x2,...,xn) to be the collection of these samples, so that the likelihood will be p(yn;?) = nproductdisplay i=1 p(xi;?). Hence, the asymptotic CMLE will be denoted ??(yn). Theorem 3.32. Assuming the pdf satisfies the regularity conditions (see (3.19) and discussion after proof), then the CMLE is asymptotically distributed according to ?nparenleftBig??(y n)?? parenrightBig d ?Nparenleftbig0,CCRB(?)parenrightbig. There exists a number of results in the literature regarding the asymptotic characteristics of the CMLE (e.g., the works of Aitchison and Silvey [3, 63, 4, 2], and of Crowder [18]). For example, Crowder shows that ?nparenleftBig??(y n)?? parenrightBig d ? N parenleftBig 0,D?1(?)?D?1(?)FT(?)parenleftbigF(?)D?1(?)FT(?)parenrightbig?1F(?)D?1(?) parenrightBig where D(?) = I(?) +FT(?)KF(?) for any positive semi-definite matrix K that ensures the nonsingularity of D(?). And while it would be sufficient to use these 69 existing results to verify the connection between the CMLE and the CCRB, it is also insightful (and the point of this treatise) to examine the problem entirely from the perspective of the reduced parameter space, i.e., using the implicit function or a null space approach.13 Proof. By the implicit function theorem (Theorem 3.3), there exists an open set O ? ?f containing ?, an open set P ? Rm?k, and a continuously differentiable bijection g? : P?O such that ? = g?(?) for some ??P. The likelihood for ? is given by q(yn;?) = p(yn;g?(?)). Let ??(yn) be the MLE of ? based on the likelihood q(yn;?). Since the MLE is consistent and asymptotically efficient, then ?nparenleftBig??(y n)?? parenrightBig d ?Nparenleftbig0, ?I?1(?)parenrightbig. (3.34) In particular, since ??(yn)?? as n??, then for n sufficiently large, say n >N, ??(yn)?P. Let ??(yn) be the CMLE of ? based on the likelihood p(yn;?) and the constraint f(?) = 0. By the invariance property [36, 62], for n > N, ??(yn) = g?(??(yn)) and ??(yn)?O. Therefore, since ??(yn)?? as n??, ??(yn)?? also and the CMLE is consistent. The Taylor seriesexpansion (see section 3.1.5) ofg?(?prime) can be truncated using a Lagrange remainder term [38, p. 232] as in g?(??(yn)) =g?(?) +G?(?, ??(yn)) parenleftBig ??(yn)?? parenrightBig where G?(?, ??(yn)) is meant to represent a matrix of the form ofG?(?) where each 13Nevertheless, a proof directly from Crowder?s asymptotic normality result is detailed in ap- pendix A.3. 70 row is evaluated at possibly different points ?prime(i), i= 1,...,m?k, each existing on the line segment starting at ? and ending with ??(yn). From the invariance property of the MLE, this can be rewritten as ??(yn)?? =G?(?, ??(yn))parenleftBig??(yn)??parenrightBig. Since the MLE ??(yn) is consistent, then G?(?, ??(yn)) p?G?(?). Given this and (3.34), then by Slutsky?s theorem [62, p. 60] ?nparenleftBig??(y n)?? parenrightBig d ?Nparenleftbig0,G?(?)?I?1(?)GT?(?)parenrightbig, which by theorem 3.5 shows ??(yn) is asymptotically efficient with respect to the CCRB. The conditions for asymptotic normality with respect to the CCRB are the conditions that ??(yn) be asymptotically normal [14, p. 516]. For the MLE these include (a) differentiabilityof the Fisher score, (b) the Fisher information continuous with respect to the parameter and nonzero at?, and (c) consistency. For the CMLE and theorem 3.32, this translates to (a) differentiability of the Fisher score and the existence of first and second derivatives of any implicit function (or equivalently, the constraint f), (b) UT(?prime)I(?prime)U(?prime) continuous with respect to ?prime and regular at ?prime =?, and (c) consistency of the CMLE. 3.4.3 The Method of Scoring Under Parametric Constraints The methodof scoringfor unconstrained parametersisdetailedinsection2.4.3. Here, we examine scoring with constraints. Assume we have an iterate ??(k)??f and 71 we wish to improve this iterate in the sense of the optimization problem expressed in (3.31). The method of scoring does not directly apply, since any projection step will not take into account the constraint, i.e., it is likely the direction of steepest ascent is not the appropriate path in terms of maximizing the likelihood subject to the functional equality constraints. Thus, it is desirable to have projected direction and restoration steps that take the constraints into consideration. Given an initial estimate ??(k), there exists a set Oowner ??(1) open in ?f, a set P open in Rm?k, and a continuously differentiable function g??(k) : Rm?k ?Rm such that g??(k) is a diffeomorphism on P, g??(k)(P) = O, and in particular there exists a ??(k) ? P such that g??(k)( ??(k)) = ??(k). Scoring can now be applied in the reduced parameter space of Rm?k. For the given set of observationsxand this corresponding initial estimate ??(k), the method of scoring suggests the projection step ??(k+1) = ??(k) + ?I?1( ??(k))?s(x; ??(k)) to generate a better estimate ??(k+1) in the sense of maximizingthe likelihoodq(x;?prime) = p(x;g??(k)(?prime)). As with many iterativeprocedures, convergence is only guaranteed under certaininitialconditions. If the projectionstepor shift istoo large, then ??(k+1) may not be a usable point, i.e., an iterate that increases the value of the likelihood function. To add stability to the procedure, often a step-size rule or shift-cutting is employed. This amounts to the inclusion of a multiplicative factor ?(k) ?[0,1], modifying the projection step to ??(k+1) = ??(k) +?(k) ?I?1( ??(k))?s(x; ??(k)). 72 Choosing an appropriate step-size rule for ?(k) will guarantee convergence, although typically at a cost to the rate of convergence. The Taylor series expansion (see section3.1.5) ofg??(k) about ??(k) and evaluated at ??(k+1) is given by g??(k)( ??(k+1)) =g??(k)( ??(k)) +G??(k)( ??(k))?( ??(k+1)? ??(k)) +o(||??(k+1)? ??(k)||) where o(||??(k+1)? ??(k)||) is a term that shrinks faster than ||??(k+1)? ??(k)|| as k? ?. Ignoring this error term, this generates an iteration in the larger dimensional parameter space ??Rm by defining the next iterate ??(k+1) = g??(k)( ??(k+1)). That is, ??(k+1) = g??(k)( ??(k)) +G??(k)( ??(k))?( ??(k+1)? ??(k)) = ??(k) +?(k)G??(k)( ??(k))?I?1( ??(k))?s(x; ??(k)) = ??(k) +?(k)G??(k)( ??(k))parenleftbigGT??(k)( ??(k))I( ??(k))G??(k)( ??(k))parenrightbig?1GT??(k)( ??(k))s(x; ??(k)) = ??(k) +?(k)U( ??(k))parenleftbigUT( ??(k))I( ??(k))U( ??(k))parenrightbig?1UT( ??(k))s(x; ??(k)). In comparison with the classical method of scoring, this iteration ??(k+1) = ??(k) +?(k)CCRB( ??(k))s(x; ??(k)) (3.35) is essentially a replacement of the CRB with the CCRB. This is the projection step of the method of scoring with parametric equality constraints.14 Intuitively, this should seem appropriate since the CCRB is a generalization of the CRB. However, 14Osborne [55] used a Lagrangian multiplier approach to develop the method of scoring. But his scenario wasrestricted to linearconstraints and, hence, lacked the restoration step. Additionally,he makes no mention that the matrix projecting the negative Jacobian of the objective is a constrained Cram?er-Rao bound. Note the structure of this projection step is well-known as a nonstatistical formulation exists for the conventional optimization problem in [25, p. 178]. 73 even with an appropriate step-size rule to generate usable iterates, since there is no certainty that ??(k+1) ?P, then it is likely that ??(k+1) will not be a feasible point. To correct this, an encompassing restoration step is required to produce the next iterate, i.e., ??(k+1) =pibracketleftbig??(k) +?(k)CCRB( ??(k))s(x; ??(k))bracketrightbig (3.36) where pi[?] is the natural projection of Rm onto ?f. This is the method of scoring with parametric equality constraints. With this additional restoration step, the constraint projection level surfaces in the constraint space ?(5) ?(3) ?(2) ?(4) ?(1) linear gradient projection ?(k+1) = pi[?(k) + ?(k) CCRB(?(k)) J(x;?(k))] Figure 3.3: Path created by iterates from the method of scoring with constraints. usability of the iterate would be tested and accepted or rejected after (3.36) as opposed to after (3.35). For a convex set, the natural projection is well-defined. In general, though, some other rule will likely need to be applied in cases for which there does not exist a unique shortest distance to ?f, e.g., reducing the step size 74 ?(k). Simple projections, e.g., onto planar or spherical constraints, are relatively simple operations, however, it might be more commonlythe case that the restoration cannot be expressed analytically. To ensure the iterates satisfy the constraints approximately, one approach is to apply an additional iterative process [25] ??(k,l+1) = ??(k,l)?FT( ??(k,l))parenleftbigF( ??(k,l))FT( ??(k,l))parenrightbig?1f( ??(k,l)), where ??(k,1) = ??(k) and ??(k+1) = pibracketleftbig??(k)bracketrightbig = ??(k,l) when f( ??(k,l)) ? 0 to a desired degree of accuracy and provided the iterate is still usable. Alternatively, a penalty can be added to the cost (objective) function, e.g., as in logp(x;?prime) +? ksummationdisplay i=1 |fi(?prime)| for some positive?and wherefi istheithconstraint equation, to limitthe divergence of the iterations away from ?f. 3.4.3.1 Convergence Properties There is a large class of conditions that guarantee convergence in fixed point theorems, some of which can be found in [64, 25, 10]. The most general statement is that given an initialization ?sufficiently close? to the maximum value ??(x), the sequence {??(k)} generated by the algorithm will converge to this CMLE. As the method of scoring with parametric equality constraints is a Newton-type method, convergence properties that already exist for these methods can be adapted here. As it is impossible to cover all the potential approaches to developing properties for this constrained scoring algorithm, this section focuses on properties similar to those 75 found in Goldstein [22]. To ease reading of this section, the proofs of the theorems in this section are presented in appendix B. First, define ???(k) ={?prime ??f : p(x;?prime)?p(x; ??(k))}as the set of all feasible and usable iterates after the kth iterate ??(k). The step rule for the properties in this section is as follows: for a fixed??(0,1) choose the least positive integerm(k) such that ?(k) =?m(k) satisfies the inequality ?(k)parenleftbiglogp(x; ??(k+1))?logp(x; ??(k))parenrightbig?? vextenddoublevextenddouble vextenddouble??(k+1)? ??(k) vextenddoublevextenddouble vextenddouble 2 I( ??(k)) (3.37) where ??(k+1) is defined by (3.36). If no finite m(k) exists, then choose ?(k) = 0. This type of step-size rule enforces a stepwise Lipschitz condition. For theorems 3.33 and 3.37, we require that ?f be convex.15 Theorem 3.33. If for any iterate ??(k) ??f there does not exist an ?(k) > 0 that satisfies (3.37), then ??(k) is a stationary point. Therefore, when the step rule forces the choice of ?(k) = 0 then the method of scoring with parametric equality constraints has converged. The next theorem details a property on the sequence of likelihood functions generated by the iterates. Theorem 3.34. The sequence{p(x; ??(k))}is a monotone increasing sequence. Fur- thermore, if p(x;?) is bounded above, then{p(x; ??(k))}converges. 15For any nonlinear equality constraint, ?f will not be convex. However, locally, the restoration of the linear projection onto the tangent space of the CCRB has the appearance of the restoration onto a convex set forsufficiently small?(k), i.e., ?f appears locallyconvex. Whileitmay be possible to restrict the step size to this local convexity, such an enhancement is beyond the scope of the work presented here. For inequality constraints, ?f may be convex; although such constraints, as in example 3.15, will not inform the projection update in (3.35), they might inform the restoration update in (3.36). 76 Thus, ???(k) is a decreasing sequence of closed (nested) sets, i.e., ???(k+1) ????(k) or for any given sequence, ??(i) ????(j) provided i?j. That is, using a proper step size rule will guarantee usable iterates. The monotonicity of {p(x; ??(k))}, even if bounded above, does not imply monotonicity in the sequence {logp(x; ??(k+1))? logp(x; ??(k))}. However, this does guarantee convergence. Theorem 3.35. If the likelihood p(x;?) is bounded above, then the sequence {logp(x; ??(k+1))?logp(x; ??(k))} vanishes. Hence, a bounded likelihood function guarantees the existence of a maximum likelihood solution(s). This can also be shown by the nested interval theorem [38, problem 2-1-12]. These previous properties would be a consequence of any rule that chooses feasible, usable iterates. The value of the rule in (3.37) is that it allows for statements to be made on the sequence{??(k)}. Theorem 3.36. If the likelihood p(x;?) is bounded above, then the sequence {bardbl??(k+1)? ??(k)bardblI(??(k))} vanishes as k??. This theorem does not guarantee that the sequence {??(k)} will converge or even have a limit point.16 (The series ksummationdisplay j=1 1 j is an example of a sequence satisfying the above theorem with no real-valued limit points.) Although, if the set ???(k) is 16A point a is a limit point of a sequence {an} if for any integer K and any epsilon1 > 0, there exists an k > K such that|ak?a|< epsilon1. 77 bounded for somek, then the Bolzano-Weierstrass theorem [38, p. 52, theorem 2-12] implies the existence of a limit point of the sequence. Theorem 3.37. If ???(1) is compact and convex, then limit points of the sequence {??(k)}are also stationary points. The theorem remains true if ???(k) is compact and convex for any k. Theorem 3.38. If ???(1) is compact for all sequences in a closed set of ?f and if there is a unique limit point ?star for all such sequences then lim k?? ??(k) = ?star for every sequence{??(k)}. Also, ?star is the maximum of p(x;?). 3.4.3.2 Linear constraints Linear constraints on the parameter typically restrict the parameter to a set of the form ?f = {?prime ? ? : F?prime +v = 0} (see section 3.3). Under this linear constraint, the restoration operationpi[?] is redundant since any step remains in the constraint space, i.e., since ??(k)??f and F?CCRB( ??(k)) = 0. Thus the method of scoring with parametric equality constraints in (3.36) simplifies to the iteration ??(k+1) = ??(k) +?(k)CCRB( ??(k))s(x; ??(k)). Example 3.39 (Linear model with linear constraints). Inthe linear model withnor- mal noise case of section3.3.2, we havex=H?+wwithw?N(0,C). Inthiscase, the negative Hessian is the FIM and the optimization problem becomes a null space quadratic exercise [21], i.e., the minimization of a quadratic objective subject to a linear constraint. The CCRB =UparenleftbigUTHTC?1HUparenrightbig?1UT, which is constant with 78 respect to the parameter; and the Fisher score is s(x;?prime) = HTC?1parenleftbigx?H?primeparenrightbig. Hence, if ??(1) is any feasible vector, e.g., ??(1) =?FT parenleftbigFFTparenrightbig?1v, then the method of scoring with constraints finds the CMLE in one step to be ??(2) = ??(1) + CCRB?HTC?1parenleftbigx?H ??(1)parenrightbig which is exactly the formula in (3.29) from theorem 3.30. The next iterate ??(3) = ??(2) +CCRB?HTC?1parenleftbigx?H ??(2)parenrightbigreveals that the procedure reaches a fixed point since CCRBHTC?1parenleftbigx?H ??(2)parenrightbig = CCRBHTC?1 parenleftBig x?Hparenleftbig??(1) + CCRBHTC?1parenleftbigx?H ??(1)parenrightbigparenrightbig parenrightBig = CCRBHTC?1parenleftbigx?H ??(1)parenrightbig?CCRBHTC?1HCCRBHTC?1parenleftbigx?H ??(1)parenrightbig = 0 Therefore, ??CML(x) = ??(2). Example 3.40 (Jamshidian?sGP algorithm). Jamshidian[33] developedaGradient Projection (GP) algorithm for maximizingthe likelihoodsubject to linear parameter constraints using the iteration ??(k+1) = ??(k) +?(k)parenleftBigW?1?W?1FT parenleftbigFW?1FTparenrightbig?1FW?1parenrightBigs(x; ??(k)) (3.38) for some positive definite matrix W. An optimal choice with regard to the algo- rithm?s rate of convergence, Jamshidian suggests, is a possibly diagonally loaded Hessian of the log-likelihood W(x, ??(k)) = ???Ts(x; ??(k)) +?(k)Im?m, 79 where ?(k) ? 0 is chosen to be sufficiently large enough to ensure the positive definiteness of the matrix W. This formulation is closely connected to the method of scoring with constraints. The GP iteration is equivalent to scoring by choosing W(x, ??(k)) to be the FIM, when it is nonsingular. Indeed, the projecting matrix is similar to the Marzetta form of the CCRB in [47] with W(x, ??(k)) replacing the FIM. This fact produces a slight generalization of the GP iteration, given by ??(k+1) = ??(k) +?(k)UparenleftbigUTW(x, ??(k))Uparenrightbig?1UTs(x; ??(k)) whereU is defined as in (3.6). In this formulation, the projecting metricW(x, ??(k)) only needs to be positive semidefinite. Alternatively, in (3.38) the Aitchison and Silvey [4] substitution for the FIM, I( ??(k)) +FT( ??(k))KF( ??(k)), instead of a di- agonally loaded Hessian of the log-likelihood (or even a diagonally loaded Fisher information matrix). In this sense, the two iterations are equivalent for the linear model, when the FIM is simply the negative Hessian. This occurs when the log-likelihood is quadratic (normal). This also suggests the adaption of Jamshidian?s GP algorithm to cases of nonlinear constraints. Likewise, asymptotically, where yn is denoted as in section 3.4.2, then by the law of large numbers the Jamshidian projection matrix W(yn, ??(k))?nI( ??(k)) +?(k)Im?m for an arbitrarily small (possibly zero) ?(k) > 0, as I( ??(k)) is positive semidefinite. The projection step update then becomes ??(k+1) = ??(k) +?(k)CCRB( ??(k))s(yn; ??(k))??(k)?(k)s(yn; ??(k)), 80 i.e., essentially equivalent to the method of scoring with parametric equality con- straints in (3.35). 3.5 Hypothesis testing In section 2.5, hypothesis testing (the Rao and Wald tests) using the CRB was reviewed. In this section, hypothesis testing is considered under a constrained alternative. Assume h : Rm ?Rr is a consistent and nonredundant differentiable function, which is also consistent and nonredundant with the differentiable function f : Rm?Rk. Hence, ?h ={?prime :h(?prime) = 0}??f, where ?f is defined as in (3.4), and also rank( bracketleftbiggH(?) F(?) bracketrightbigg ) =r+k?m and r Lk and L = max k Lk. Furthermore, under certain conditions, the rank of HM is determined by the characterization of the basis H(z). Theorem 4.2. Assume M >K and N? Ksummationdisplay k=1 Lk. Then HM is full column rank if and only if H(z) is a minimum polynomial basis. Proof. This can be shown by a variant of the proof in Loubaton and Moulines [44, theorem 1]. The input sequence s(k) is said to have pk modes2 if it can be written as a 2There are a number of alternative definition of modes [30, 43], [35, p. 168]. 93 linear combination s(k)(n) = pksummationdisplay i=1 cimn+Lki , where ci, i = 1,...,pk are complex-valued weights and mi, i = 1,...,pk are the C?-valued roots of the polynomial a(0) +a(1)z?1 +a(2)z?2 +???+a(pk)z?pk, whose coefficients satisfy pksummationdisplay j=0 s(k)(i+j)a(j) = 0 for i=?Lk,...,N?pk?1. Hence, the Toeplitz matrix S(k)(n) = ? ?? ?? s(k)(n) s(k)(n?1) ??? s(k)(?Lk) s(k)(n+ 1) s(k)(n) ??? s(k)(?Lk + 1) ... ... ... s(k)(N?1) s(k)(N?2) ??? s(k)(N?n?Lk?1) ? ?? ?? (4.8) has rank min{N ?n,p,n + Lk + 1} [76, lemma 1]. Note, S(k)(0) = S(k) and S(k)(?Lk) = s(k) from section 4.1.1.1. If one of the modes of s(k) is a common zero of the channel transfer function H(k)(z), say mv, then the channel makes no distinction between s(k) and s(k)prime, defined by s(k)prime(n) = pksummationdisplay i=1inegationslash=v cimn+Lki , even though s(k) negationslash= s(k)prime. Ironically, if a channel lacks sufficient diversity, the more modes an input has leads to greater risk of lost information on the input but poten- tially less meaningful loss depending on the weights. Theorem 4.3. The matrix S(n) =bracketleftbigS(1)(n) S(2)(n) ??? S(K)(n)bracketrightbig (4.9) 94 is full column rank only if N?(K + 1)n+K + Ksummationdisplay j=1 Lj, ptotal ?Kn+K + Ksummationdisplay k=1 Lk, and pk ?Lk+1+n for eachk = 1,...,K and ifN?(K+1)n+K+(K+2) Ksummationdisplay j=1 Lj, ptotal?K?n+K + (K + 1) Ksummationdisplay k=1 Lk, and pk ?Lk + 1+n+ Ksummationdisplay j=1 Lj. Proof. This is a variation of the results in [76, 1, 48, 49]. In particular, forn = 0 thenS(0) requiresN?K+ Ksummationdisplay j=1 Lj,ptotal?K+ Ksummationdisplay j=1 Lj, and pk ?Lk + 1, for each k = 1,...,K, to be full rank. Or conversely, S(0) is full rank ifN?K+(K+2) Ksummationdisplay j=1 Lj, ptotal?K+(K+1) Ksummationdisplay k=1 Lk, andpk ?Lk+1+ Ksummationdisplay j=1 Lj. Before continuing to the development of the Fisher information for this con- volutive mixture model, it is relevant to consider a notion of identifiability using the concepts of the Z-transform model in (4.7). However, this next section is a brief aside that may be skipped if not of interest to the reader. 4.1.2 Strict Identifiability The K-user, M-channel FIR system in (4.2) or (4.3) is said to be strictly identifiable (SI) if and only if HMs=HprimeMsprime ??Hprime(z) =H(z)A and sprime(n) =A?1s(n) for some nonsingular matrix A. The converse statement is always true, so strict identifiability depends on the conditional statement. The term strict identifiability is a misnomer. As is clear from the definition, the deterministic parameters are not ?identifiable? when they are strictly identifiable. Instead, the parameters are 95 identifiable up to some minimal ambiguity. When this situation occurs, then it is possible to treat the channel (signal) components statistically to truly identify the deterministic signal (channel) parameters using stochastic approaches, e.g., the subspace method [24, 44]. In the context of this thesis, where the parameters are not treated as random but as deterministic, the reduction to a minimal ambiguity set also reduces the number of necessary constraints to eliminate it. Strict identifiability in the convolutive channel model, it shall be shown, has some interesting connections with the Fisher information matrix. This is not sur- prising, given the results in section 3.2. It is perhaps also intuitive to expect that as strict identifiability is a notion of near-identifiability, then the corresponding Fisher information will satisfy some notion of near-regularity. Abed-Meraim and Hua [1] developed necessary and sufficient conditions for strict identifiability in terms of characteristics in the Z-transform model, i.e., no channel zeros, the number of signal modes, more sources than sensors, etc. The following two theorems will be presented without proof, as they were in [1]. Proofs I developed can be found in [48]. Theorem 4.4 (SI necessary conditions). The M-channel K-source FIR system is strictly identifiable only if3 (a) H(z) is irreducible and column-reduced, (b) ptotal?K + Ksummationdisplay j=1 Lj, 3In [1], condition (a) omits the column-reducedness requirement, condition (c) did not include the special case, and condition (d) is originally(and I think erroneously) stated N ?2K+summationtextKj=1 Lj. 96 (c) pk ?Lk + 2 for k = 1,...,K of pk ?1 if Lk = 0, and (d) N?K + Ksummationdisplay j=1 Lj. Theorem 4.5 (SI sufficiency conditions). The M-channel K-source FIR system is strictly identifiable if (a) H(z) is irreducible and column-reduced, (b) ptotal?K + (K + 1) Ksummationdisplay j=1 Lj, (c) pk ?Lk + 1+ Ksummationdisplay j=1 Lj for k = 1,...,K, and (d) N?K + (K + 2) Ksummationdisplay j=1 Lj. Yet other notions of near-identifiability in the convolutive channel model exist in the literature, e.g., cross-relation-based identifiability [43], which have an equiv- alence to strict identifiability in the SIMO case [30]. The necessary and sufficient conditions presented here are independent of the number of channels, although one should expect that increasing the number of channels would increase channel di- versity thereby weakening the requirements on the other channel or source charac- teristics. Hence, theorems 4.4 and 4.5 are in some sense conditions for the least diversified, strictly identifiable scenario M = K + 1. Table 4.1 shows the growth in necessary and sufficient conditions as the number of users increases for a fixed channel size and for fixed channel orders. 97 Table 4.1: Necessary and sufficient data sizes for strict identifiability # of sources K 1 2 3 4 5 necessary data size N 6 12 18 24 30 sufficient data size N 16 42 78 124 180 (M = 6 channels, Lk = 5 for all sources) 4.1.3 The Fisher information of the convolutive mixture model 4.1.3.1 Complex-valued Fisher information Before presenting the Fisher information on the model in (4.2), certain details about Fisher information matriceson complex-valuedparameters are necessary. The structure of the FIM depends on the ordering of the parameters. For complex-valued parameters, the FIM parameter vector consists of the real and imaginary parts of the parameters. If the complex-valued parameters are collected in the vector?and the (FIM) real parameter vector is defined to be ?T = bracketleftbigRe(?T),Im(?T)bracketrightbig and the real-valued parameter FIM has the structure I(?) = bracketleftbiggE 1(?) ?E2(?) E2(?) E1(?) bracketrightbigg , (4.10) then the complex-valued parameter Fisher information matrix (CFIM) may be de- fined as4 I(?) = E??(x;?)?H(x;?) = 12parenleftbigE1(?) +j?E2(?)parenrightbig where ?(x;?) = ?logp(x;?)??? .5 A number of properties for the real-valued param- eter FIM can be gleaned from properties on the CFIM. 4This CFIM is a submatrix of the complex-valued parameter FIM developed by van den Bos [70], which would be the preferred FIM for use in a performance metric or in applying constraints [32]. However, in this and the next sections, the CFIM presented here is only used to obtain properties relevant to the real-valued parameter FIM. 5The complex derivative is defined to be ? ?? = ? ?Re??j? ? ?Im? in this thesis. 98 Theorem 4.6. The null space of the FIM of the form (4.10) has dimension exactly twice the dimension of the null space of the corresponding CFIM, i.e., nullity(I(?)) = 2?nullity(I(?)). Proof. Note bracketleftbiggE 1 ?E2 E2 E1 bracketrightbiggbracketleftbigga b bracketrightbigg = 0 if and only ifE1a?E2b= 0 andE2a+E1b= 0 if and only if parenleftbigE1 +jE2parenrightbigparenleftbiga+jbparenrightbig = parenleftbigE1a?E2bparenrightbig+ jparenleftbigE2a+E1bparenrightbig = 0. Also, bracketleftbiggE 1 ?E2 E2 E1 bracketrightbiggbracketleftbigga b bracketrightbigg = 0 if and only if bracketleftbiggE 1 ?E2 E2 E1 bracketrightbiggbracketleftbigg?b a bracketrightbigg = 0. Finally, the dimension of bracketleftbigga ?b b a bracketrightbigg is 2 unlessa=b= 0; however, the dimension ofbracketleftbiga+jb, ?b+jabracketrightbigis only 1 sinceparenleftbig?b+japarenrightbig=?j?parenleftbiga+jbparenrightbig. A version of Bang?s formula [36, a variant of (15.52) on p. 525] can be devel- oped for complex-valued parameters, i.e., Iij(?) = tr bracketleftbigg C?1(?)?C(?)??? i C?1(?)?C(?)?? j bracketrightbigg +?? H(?) ???i C ?1(?)??(?) ??j + ??T(?) ???i C ?1(?)?? ?(?) ??j for any observations y?CN(?(?),C(?)). 4.1.3.2 CFIM for the convolutive mixture model For the convolutive mixture model in (4.3) and (4.6), only the mean vector depends on the unknown parameters, hence I(?) = ?? H(?) ??? C ?1??(?) ??T + ??T(?) ??? C ?1?? ?(?) ??T . The mean vector is ?(?) = Ksummationdisplay k=1 H(k)M s(k) = Ksummationdisplay k=1 parenleftbigI M ?S(k) parenrightbigh(k). 99 Therefore, if the complex-valued parameter vector in the model in (4.3) is defined by ?T = ? ?? ?? ?? h(1)T s(1)T ... h(K)T s(K)T ? ?? ?? ?? (4.11) then the complex-valued FIM of the model can be shown to be I(?) = 1?2QHQ = 1?2 ? ?? ?? QH1 Q1 QH1 Q2 ??? QH1 QK QH2 Q1 QH2 Q2 ??? QH2 QK ... ... ??? ... QHKQ1 QHKQ2 ??? QHKQK ? ?? ?? (4.12) where Q =bracketleftbigQ1,...,QKbracketrightbig and Qk = bracketleftBig IM ?S(k) H(k)M bracketrightBig . 4.1.3.3 Properties of the CFIM In this section, I develop properties of this CFIM. In particular, the singularity of the CFIM is proven and a limit on the dimension of this singularity is detailed, as well as necessary and sufficient conditions on the parameter characteristics to attain this limit. Given the inherent relationship between regularity of the FIM and identifi- ability as detailed in section 3.2, it should not be surprising that the FIM, and hence CFIM, is singular. The model presented in (4.3) or (4.6) has a multiplicative ambiguity with any source interacting with its corresponding channel, i.e., for any nonsingular matrix A?RLk+1?Lk+1, the input source matrix S(k) and the channel vectorh(k)m are indistinguishable fromS(k)AandA?1h(k)m , respectively. Over all the channels, this presumes a complex-valued multiplicative ambiguity of at least MLk. Additionally, cross-source ambiguities exist between a source input and a different 100 source channel. The limit for the minimal degrees of freedom or the rank of the ambiguity is given in the following theorem. Theorem 4.7. The CFIM is singular and the dimension of its null space is lower bounded as nullity(I(?))? Ksummationdisplay i=1 Ksummationdisplay j=1 (Li?Lj + 1)+, (4.13) where (a)+ = a for a?0 and (a)+ = 0 for a< 0. This limit quantity is the nullity lower bound (NLB). The proof can be found in the appendix C. The proof constructs the following matrix, whose linearly independent columns are a basis for the null space of I(?), N = ? ?? ?? ?? ?? ? h(1) H(2)(1) 0 0 ??? H(K)(1) 0 ??? 0 ??? 0 0 ?s(1) 0 0 ?S(2)(1) ??? 0 0 ??? 0 ??? 0 ?S(K)(1) 0 0 h(2) H(1)(2) ??? 0 H(K)(2) ??? 0 ??? 0 0 0 ?S(1)(2) ?s(2) 0 ??? 0 0 ??? 0 ??? ?S(K)(2) 0 .. . .. . .. . .. . ... .. . .. . ??? .. . ??? .. . .. . 0 0 0 0 ??? 0 0 ??? h(K) ??? H(2)(K) H(1)(K) 0 0 0 0 ??? ?S(1)(K) ?S(2)(K) ??? ?s(K) ??? 0 0 ? ?? ?? ?? ?? ? , (4.14) where H(i)(j) and S(i)(j) are defined in (C.2) and (C.4), respectively. The nullity lower bound (NLB) seems to be an unusual quantity. The term (Li?Lj + 1)+ represents the ambiguity for the ?tap window? of the ith channel masking the interaction of the jth channel with its corresponding source. This quantity also reveals that an ambiguity exists only if the coherence-propagation delay of the ith channel is at least as greats as for the jth channel. Intuitively, the greater the spread between the channel orders increases the overlapping window for the jth channel?s interaction to be masked and thereby increases the ambiguity. Conversely, when the channel orders have narrow differences, this increases the 101 diversity of the channel by limiting the window where one channel can cover that of another. A simple corollary follows stating that the NLB or the dimension of the am- biguity set is at least the square of the number of sources. Corollary 4.8. nullity(I(?))?K2. Proof. Note (Li ?Lj + 1)+ + (Lj ?Li + 1)+ ? 2 with equality if and only if |Li?Lj|?1. Thus, Ksummationdisplay i=1 Ksummationdisplay j=1 (Li?Lj + 1)+ = Ksummationdisplay i=1 Ksummationdisplay j=1 jnegationslash=1 (Li?Lj + 1)+ + Ksummationdisplay i=1 (Li?Li + 1)+ = Ksummationdisplay i=1 Ksummationdisplay j=i+1 (Li?Lj + 1)+ + (Lj?Li + 1)+ +K ? K(K?1)2 ?2 +K = K2. The proof also reveals that thisK2 degree of uncertainty can only be attained if any two orders differ by at most 1 tap, whichagrees with the intuitionthat channel diversity is enhanced when the channel orders are not widely spread. In addition to the ambiguity due to the channel order spread, the degrees of freedom in the model depends on the number of parameters and the number of observations. Theorem 4.9. nullity(I(?))? Ksummationdisplay k=1 (N +Lk +M(Lk + 1))?MN. Proof. Since I(?) = 1?2QHQ then rank(I(?)) = rank(Q). Since Q is a MN? Ksummationdisplay k=1 (N+Lk+M(Lk+1)), thenrank(I(?))?min braceleftBigg MN, Ksummationdisplay k=1 (N +Lk +M(Lk + 1)) bracerightBigg 102 and therefore nullity(I(?)) ? colsize(I(?))?rank(I(?)) ? max braceleftBigg 0 , Ksummationdisplay k=1 (N +Lk +M(Lk + 1))?MN bracerightBigg . The number of columns ofQ(orI(?)) corresponds to the number of unknown parameters. Likewise, the number of rows of Q corresponds to the number of equa- tions (observations) in the model (4.2). In most scenarios, having more equations (rows) than unknowns (columns) is a necessary requirement for the unknowns to be solvable. Thus, the degrees of freedom of the ambiguity space is at least the number of unknowns (parameters) less the number of equations. However, if the equations are linearly dependent (redundant), as is the case in the convolutive mixture model, then the degrees of freedom is potentially greater. The following corollary combines theorems 4.7 and 4.9. Corollary 4.10. nullity(I(?))?max braceleftBigg Ksummationdisplay i=1 Ksummationdisplay j=1 (Li?Lj + 1)+, Ksummationdisplay k=1 (N +Lk +M(Lk + 1))?MN bracerightBigg . The NLB depends only on the channel orders and number of sources, whereas the ?equations vs. unknowns? nullity bound depends on the channel orders, the sample size per source, the number of subchannels per source, and the number of sources. Hence, it is of interest to determine under what conditions on the sample size N and the number of channels M per source will this second bound be of no consequence. Control over the dimensions K (the number of transmitters), M 103 (the number of receivers),N (the number of time snapshots or transmission length) is possible in the design of many communications models. The next two results determine conditions on the dimensions of the model which allow the NLB to be the minimumpossible degrees of freedom. The first condition requiresmore subchannels per source than sources (or more receivers than transmitters in a communications context). Theorem 4.11. The CFIM I(?) can attain the CFIM nullity lower bound only if M >K. Proof. If M ? K then HM, which is a MN?KN + Ksummationdisplay k=1 Lk matrix, cannot be full column rank by theorem 4.2. This implies the existence of a null space of a larger dimension than the space spanned by the columns of N in (4.14). Hence nullity(I(?)) > Ksummationdisplay i=1 Ksummationdisplay j=1 (Li?Lj + 1)+. Under the assumption that more receivers than transmitters are in the model, then corollary 4.10 implies a minimal requirement on the transmission snapshots. Theorem 4.12. Provided M > K, the CFIM I(?) can attain the CFIM nullity lower bound only if N? K + 1M?K Ksummationdisplay j=1 Lj + Ksummationdisplay j=1 Lj +K+ 1M?K parenleftBigg K2? Ksummationdisplay i=1 Ksummationdisplay j=1 (Li?Lj + 1)+ parenrightBigg (4.15) Proof. We desire Ksummationdisplay i=1 Ksummationdisplay j=1 (Li?Lj +1)+ ? Ksummationdisplay k=1 (N+Lk +M(Lk +1))?MN. Solving for N shows the result. As the channel diversity (in terms of the number of channels per source) in- creases, the necessary size on the data to attain the CFIM NLB decreases. As 104 M??then the bound on the sample size becomes simplyN?K+ Ksummationdisplay j=1 Lj, which is comparable to theorem 4.4(d). However, for any finite M and nontrivial channel orders, the bound is effectivelyN?K+ Ksummationdisplay j=1 Lj +1 since? Ksummationdisplay i=1 Ksummationdisplay j=1 (Li?Lj +1)+ ? ?K Ksummationdisplay j=1 (Lj + 1), and hence a looser bound than (4.15) is N ? K + 1M?K Ksummationdisplay j=1 Lj + Ksummationdisplay j=1 Lj +K + 1M?K parenleftBigg K2?K Ksummationdisplay j=1 (Lj + 1) parenrightBigg ? Ksummationdisplay j=1 Lj +K + 1M?K Ksummationdisplay j=1 Lj. In the scenario requiring the most data samples, M = K + 1, then the system requires N?(K +1) Ksummationdisplay j=1 Lj + Ksummationdisplay j=1 Lj +K + parenleftBigg K2? Ksummationdisplay i=1 Ksummationdisplay j=1 (Li?Lj + 1)+ parenrightBigg . (This last term is always nonpositive.) In the SIMO scenario (K = 1), this necessary condition becomes N?3L+ 1 which agrees with the requirement in [31].6 Given that M >K and N satisfies (4.15), then the minimum possible nullity of I(?) is the nullity lower bound, Ksummationdisplay i=1 Ksummationdisplay j=1 (Li?Lj + 1)+, i.e., it is possible to limit the ambiguity strict to the mixing of sources in the convo- lutive channel. If it is possible to understand the conditions under which the CFIM attains this nullity lower bound, then it is known that the null space is completely characterized by the columns of N in (4.14). The importance of this is that when 6For the SIMO (K = 1) case, this nullity lower bound is exactly 1 and parameters for which the CFIM attains this bound are said to be Fisher information identifiable in [30]. The notion of identifiability for the SIMO case is sensible since by scaling a single parameter it is possible to obtain a bound on all the remaining parameters relative to the scaled parameter. This naturally connects with the notion of strict identifiability in section 4.1.2. This interpretation does not extend simply to the MIMO (K > 1) case; however, it shall be shown that there does exist an inherent connection between a CFIM attaining the NLB and the notion of strict identifiability. 105 the null space can be parameterized, then it is possible to use theorems 3.23 and 3.24 to determine constraints that lead to regularity of the CFIM (and FIM). Us- ing the concepts of signal excitation (modes) and channel diversity (irreducibility and column-reducedness) as defined in section 4.1.1.2, the following theorems also establish a correlation of conditions between the idea of near-regularity when the FIM attains the NLB and of near-identifiability (or strict identifiability) of section 4.1.2. Theorem 4.13 (CFIM NLB necessary conditions). The M-channel K-source FIR system Fisher information matrix has a nullity of exactly the NLB in (4.13) only if (a) H(z) is irreducible and column-reduced, (b) ptotal?K + Ksummationdisplay j=1 Lj, (c) pk ?Lk + 2 for k = 1,...,K or pk ?1 if Lk = 0, (d) N?K + Ksummationdisplay j=1 Lj, and (e) M >K. Theorem 4.14 (CFIM NLB sufficiency conditions). The M-channel K-source FIR system FIM has a nullity of exactly the NLB in (4.13) if (a) H(z) is irreducible and column-reduced, (b) ptotal?K + (K + 1) Ksummationdisplay j=1 Lj, (c) pk ?Lk + 1+ Ksummationdisplay j=1 Lj for k = 1,...,K, 106 Table 4.2: Necessary and sufficient data sizes for the FIM to attain the NLB # of sources K 1 2 3 4 5 necessary data size N 8 20 38 74 180 sufficient data size N 16 42 78 124 180 (M = 6 channels, Lk = 5 for all sources) (d) N?K + (K + 2) Ksummationdisplay j=1 Lj, and (e) M >K. With the exception of condition (e), then theorems 4.13 and 4.14 are identical to theorems 4.4 and 4.5, respectively. This is an expected result since the Fisher information for identifiable parametric components within the model should meet certain regularity conditions (see theorem 2.5). For theorem 4.13(d), the condition is weaker than the condition in theorem 4.12, but it is left weaker to agree with the necessary condition for strict identifiability in theorem 4.4(d). (It is unclear if that conditionwouldchange if dependenceon channelsizeM isconsidered.) Table 4.1.3.3 show the growth in the necessary and sufficient data sizeN as the number of sources increases for fixed channel size and channel orders, using the stronger condition in theorem 4.12 instead of that in theorem 4.13(d). It is clear that in comparison, as the number of sources increases the number of channels, the necessary data size grows to match the sufficient data size, which lead to theorem 4.15 and corollary 4.16. First, the special case when the CFIM attains the minimal ambiguity is con- sidered. As stated earlier this occurs when the windows of the channels are, in some sense, minimally spread. 107 Theorem 4.15. Given sufficient channel diversity and modes in the signals, e.g., the sufficient conditions of theorem 4.14, then nullity(I(?)) = K2 if and only if Lj ?{L0,L0 + 1}for all j = 1,...,K, for some integer L0. Proof. Under the conditions of theorem 4.14, then nullity(I(?)) = NLB and from the proof of corollary 4.8, NLB = K2 if and only if (Li?Lj+1)++(Lj?Li+1)+ ?2 for eachi,j = 1,...,K. Either(Li?Lj+1)+ = (Lj?Li+1)+ = 1or (Li?Lj+1)+ = 2 and (Lj?Li + 1)+ = 0 or vice versa. The next corollary details the necessary and sufficient data size for the sources under the special case of equivalent channel orders. Corollary 4.16. Given sufficient signal modes and channel diversity, e.g., as in theorem 4.14, if Lk = L for each k and M = K+1, then the CFIM attains the NLB if and only if N?(K + 2)KL+K. 4.1.4 Constraints for the convolutive mixture model In the previous section, necessary and sufficient conditions on characteristics of the signal source and channel properties were detailed for the CFIM to have a null space with minimal dimensions equaling the nullity lower bound Ksummationdisplay i=1 Ksummationdisplay j=1 (Li? Lj +1)+. In this section, pathways to regularity in the CFIM (and hence the FIM) as well as several typical constraint sets are considered. 108 4.1.4.1 Pathways to regularity Requiring constraints to attain regularity in the Fisher information is unnec- essary to use this CCRB theory (theorem 3.17), but often without constraints the identifiable parameters are not in the desired framework to be of use to the practi- tioner. In the communications context, constraints on a model assumed a priori are simply a common method used to maintain a particular parametric structure in the model. Guided by theorems 3.23, 3.24, 3.25, 3.27, and 3.29, then an objective of communications system design is to discover constraints for which identifiability of the parameter, as they are defined, is achieved, i.e., what properties can be imposed on the signal or channel to guarantee parametric identifiability. For the convolutive mixture model in (4.2), this involves an examination of the Fisher information in (4.10) with the CFIM in (4.12). In some sense, this has already been done. In section 4.1.3.3, conditions under which the CFIM has a null space spanned by the columns of the matrix N, defined in (4.14), are derived. If I(?) is a Fisher information with Cholesky factorization L(?)LT(?) where V(?) is an orthogonal complement toLT(?) [20, p.194]. Then theorem 3.23 implies for any constraint to achieve local identifiability, its Jacobian must satisfy F(?) = ALT(?) +BVT(?) where BVT has full row rank. Similarly, theorem 3.24 implies the constraints must have an orthonormal complement U(?) satisfying U(?) = L(?)C+V(?)D where L(?)C if full column rank. 109 4.1.4.2 Norm channel + real-valued source constraint The norm channel constraint vextenddoublevextenddoubleh(1)vextenddoublevextenddouble2 = 1 (4.16) is a common scaling ?trick? used in SIMO (K = 1 source) models to obtain chan- nel estimates under second-order statistics assumptions. Combined with the (rota- tional) constraint of restricting the source elements to be real-valued, i.e., Im(s(1)(n)) = 0 (4.17) for n =?L1,?L1 +1,...,N?1, it is clear that the multiplicative ambiguity which is the basis of N is eliminated. For the parameter vector ? = bracketleftbigRe(?T),Im(?T)bracketrightbigT with ? defined in (4.11), the constraints are essentially separated as F(?) = bracketleftbiggF Re(h) 0 FIm(h) 0 0 FRe(s) 0 FIm(s) bracketrightbigg where h = h(1) and s = s(1). An orthonormal complement, or a basis for the null space, of bracketleftbigFRe(s) , FIm(s)bracketrightbig is simply bracketleftbiggI (N+L1)?(N+L1) 0 bracketrightbigg , but an analytic formula for the null space relating to the channel components is not so simple and requires numerical programming for arbitrary sizes L1 and N. Example 4.17. As a particular example of this constraint, consider the M = 2 SIMO convolutive channel model with L = 3 taps, where the channel is predefined by bracketleftBigg h(1)T1 h(1)T2 bracketrightBigg = bracketleftbigg 0.3079+ j0.0698 0.1657+ j0.2304 0.0198?j0.3823 0.0929?j0.1853 ?0.1841 + j0.3294 0.4484?j0.1689 0.0156+j0.1526 0.4750?j0.0952 bracketrightbigg , 110 under a white noise assumption (with known variance ?2) and the constraints in (4.16) and (4.17). The 54 (N = 50) signal elements are BPSK (?1) symbols randomly generated with equal probability (i.e., Bernoulli(12)) and fixed for the simulation. The subspace method [52, 68] is used with a smoothing factor of 8. Additionally scoring with constraints (CSA) using (3.36) is applied on the subspace method?s estimate ?h(1) for possible improvement. The results are shown in figure 4.2, where the mean-square error (MSE) per real channel coefficient is evaluated by 1 16R Rsummationdisplay r=1 vextenddoublevextenddouble vextenddouble?h(1)?h(1) vextenddoublevextenddouble vextenddouble 2 overR = 100 runs or trialsandthe SNR isgivenby 10log 10 1 ?2. This example is also in [68] and shows how the subspace method, which had no prior theory-based performance metric, tracks with the CCRB. The additional overlaying estimation using the method of scoring demonstrates the local maximum likelihood properties in the subspace method. 0 5 10 15 20 25 3010 ?5 10?4 10?3 10?2 10?1 SNR (dB) MSE per channel coeff subspace method CSA CCRB Figure 4.2: Norm-constrained channel estimation performance. 111 4.1.4.3 Semiblind constraints: s(k)(t) = p(t) for t?T For multiplesources, designing communicationswith channel constraints, such as in the previous section, is less tenable since it is not always possible to estab- lish guarantees on functions of the channel elements. The sources, however, are often entirely designable elements by restricting the class of signals which are to be passed through the channels. The class only needs to be defined by some functional constraint. Perhaps, the simplest constraint is knowledge of a parameter element, i.e., ?i =a. This constraint produces a row vectoreTi in the Jacobian F(?), where ei is the unit vector with unity in the ith position and zero values in the other positions. Therefore, the corresponding orthonormal complement U(?) for this unit vector eliminates or nulls out the i row and column of the Fisher information I(?) while preserving the other elements. Any other nonredundant constraint in f does not change this. Theorem 4.18. Assume the conditions of theorem 4.14, then nullity(I(?)) = Ksummationdisplay i=1 Ksummationdisplay j=1 (Li ?Lj + 1)+. Let I?(?) be denoted by I(?) with the rows {ip : p = 1,..., Ksummationdisplay i=1 Ksummationdisplay j=1 (Li?Lj + 1)+} and the corresponding columns removed. Then the nullity(I?(?)) = 0 if and only if N? is full column rank, where N? is the matrix formed by taking the rows{ip}of N in (4.14). Proof. Let F be a matrix that when multiplied by the matrix N selects the rows {ip}, i.e., F is the matrix consisting of the row vectors {eTi : i ?{ip}}. Then FN = N?, a square full rank matrix. Therefore F = AI(?) +BNT for some 112 A and some B where BNT is full row rank. As discussed in section 4.1.4.1, this identifies the parameters and hence nullity(I?(?)) = 0 by theorem 3.24. This theorem can be used to specify either channel or source parameters to achieve identifiability and FIM regularity. For example, if only source signal param- eters are specified, then it is necessary and sufficient to specify Ksummationdisplay j=1 (Li?Lj + 1)+ parameters of s(i) for each i = 1,...,K, under the conditions of theorem 4.18. 4.1.4.4 Unit Modulus constraint + Semiblind constraint The unit or constant modulus constraint is a particularly powerful and rea- sonable assumption. All the source elements are assumed to have unit modulus, i.e., |s(k)(n)|2 = 1, (4.18) for everyn =?Lk,...,N?1 and for eachk = 1,...,K. This constraint is useful for modeling P-ary phase-shift keying (PSK) in communications models with unknown P, where the signals are assumed to be derived from a finite constellation of P equispaced points on the unit circle. While in practice, this assumption is often viewed as a single constraint, it is ultimately Ksummationdisplay i=1 (N+Li) constraints. Nevertheless, despite this, it is insufficient to identify the parameters by itself. For the parameter vector in (4.11), the constraint has a gradient F(?) = ? ?? 0 2Re(S(1)d ) 0 0 ??? 0 2Im(S(1)d ) 0 0 ??? 0 0 0 2Re(S(2)d ) ??? 0 0 0 2Im(S(2)d ) ??? ... ... ? ??, 113 which has an orthonormal complement U(?) = ? ?? ?? ?? ?? ?? ?? ?? ?? ?? 0 IM(L1+1) 0 0 0 0 ??? ?Im(S(1)d ) 0 0 0 0 0 ??? 0 0 0 0 IM(L2+1) 0 0 0 0 ?Im(S(2)d ) 0 0 ... ... ... ... 0 0 IM(L1+1) 0 0 0 ??? Re(S(1)d ) 0 0 0 0 0 ??? 0 0 0 0 0 IM(L2+1) 0 0 0 Re(S(2)d ) 0 0 ... ... ... ... ? ?? ?? ?? ?? ?? ?? ?? ?? ?? satisfying (3.6) and where S(i)d = diag(s(i)). Using this complement with (4.10) and (4.12), we have that thei,j subblock ofUT(?)I(?)U(?) has the structureC(i,j) = ? ?? Re[S (i)H d H (i)H M H (j) M S (j) d ] Im[S (i)H d H (i)H M (Im?S (j))] Re[S(i)H d H (i)H M (Im?S (j))] ?Im[(Im?S(i)H)H(j)M S(j)d ] Re[Im?S(i)HS(j)] Im[Im?S(i)HS(j)] Re[H(i)HM H(j)M S(j)d ] Im[H(i)HM (Im?S(j))] Re[H(i)HM (Im?S(j))] ? ??. Lack of identifiability is noted since C(k,k)? ? ? 1N+Lk Im(h(k)?) Re(h(k)?) ? ?= 0 for each k = 1,...,K. Careful examination determines these are the only vectors in the null space of UT(?I(?)U(?), resulting in the following theorem. Theorem 4.19. Assume nullity(I(?)) = 2 Ksummationdisplay i=1 Ksummationdisplay j=1 (Li ?Lj + 1)+. If all sources are assumed to be unit modulus and one complex-valued parameter for each source is assumed known, then UT(?)I(?)U(?) will be regular (and the model locally identifiable). Intuitively, the unit modulus constraint eliminates the convolutive mixture and intra-source multiplicative amplitude ambiguity leaving only an intra-source multiplicative phase ambiguity. 114 Example 4.20. Consider aK = 2 source-M = 2 sensor linear instantaneous mixing model (Lk = 0 for all sources), with N = 30 samples per source, under a R = 3-ray multipath subchannels, i.e., the spatial signature of the kth source is expressed as a weighted sum of steering vectors, i.e.,hk = Rsummationdisplay r=1 ?kra(?kr) where ?kr and ?kr are the complex-valued amplitude and the real-valued AOA of the rth multipath of the kth source (seeFigure 4.3). The AOAsandcorresponding amplitudesare{??1,?,?+4} source sensors direct paths side paths Figure 4.3: Example of multipath channel. and{?0.2?(?pi6),?0.5,?0.15?(?pi5 )}for sources(1) and{?+???5,?+??,?+ ??+6}and{?0.15?(?pi5 ),?0.6,?0.25?(pi3)}for sources(2). (See section 4.2 for a greater description on steering vectors.) The source elements come from an 8PSK alphabet with signal powers SNR(s(1)) = 20dB and SNR(s(2)) = 15dB. The channel elements are normalized so that SNR(s(k)) = bardblh(k)bardbl 2 M?2 with ? 2 = 1. The constraints assumed are unit modulus (8PSK) as well as knowledge of the first T = 2 symbols for each source, more than sufficient for identifiability and FIM regularity (theorem 4.19). An initial estimate is obtained using the zero-forcing (bias reducing) variant of the algebraic constant modulus algorithm (ZF-ACMA) [71]. This algorithm is a useful tool for estimationof constant modulus source parameters inshort data length 115 experiments (only N?K2 or 2K required), but has no means of incorporating the training side information. The ZF-ACMA estimate is projected onto ?f to establish an initialization for the method of scoring with constraints. The step size rule chose ?(k) = 2?m for the least positive integer m satisfying (3.37). The average MSE (per real parameter coefficient) at each iteration over 5000 trails is compared with the CCRB for each source in figure 4.4. 0 10 20 30 40 5010 ?3 10?2 10?1 100 ?? MSE Source Estimation with varying ?? S1: ZF?ACMA est S1: Scoring est S1: CCRB S2: ZF?ACMA est S2: Scoring est S2: CCRB Figure 4.4: Source estimation with varying ??. The mean-square error improvementby utilizingthe complete side information in scoring maintains efficiency with respect to the constrained CRB for moderately separated angle of arrivals compared with ZF-ACMA. As should be expected, the estimation performance degrades as the primary source angles overlap, but even in 116 the worst case scenario with ?? = 0? of separation, the approximate corresponding 8PSK bit-error-rate (BER) for ZF-ACMA and scoring with constraints is.2914 and .0254, meaning the estimation schemes result in bit decision errors roughly 29% and 3% of the time, respectively. The departure of the estimation performance from the CCRB as ?? approaches 0? is possibly due to a loss of unbiasedness in the estimation. 4.2 Calibrated Array Model The narrowband (calibrated) array model may be written as ym(n) = Ksummationdisplay k=1 am(?k)?ks(k)(n) +wm(n) (4.19) for n = 1,...,N and m = 1,...,M, where s(k)(n) is the value of the kth input source at time index n, ?k is the complex-valued channel gain for the kth input, am(?k) is the mth sensor response to the kth input source, ?k is the angle-of-arrival (AOA) of the kth source, and wm(n) is the noise, modeled as zero-mean circular Gaussian with variance ?2 iid in both time and space. In vector-matrix notation, the model for each time slot can be written as y(n) = Ksummationdisplay k=1 a(?k)?ks(k)(n) +w(n) =A(?)?s(n) +w(n) (4.20) where the input is given by sT(n) =bracketleftbigs(1)(n),...,s(K)(n)bracketrightbig, the channel gain matrix is ? = diagparenleftbig?1,...,?Kparenrightbig, and the response matrix is given by A(?) =bracketleftbiga(?1) a(?2) ??? a(?K)bracketrightbig (4.21) 117 ? source sensors (a) uniform linear array ? source sensors (b) uniform circular array Figure 4.5: Calibrated array model geometriesfor (a) uniform linear and (b) circular arrays. with the array response vectorsa(?k) depending on the physical design of the array and the AOA. Calibration posits this known array geometry. The examples consid- ered in this section consist of the uniform linear array (ULA) and uniform circular array (UCA), as shown in Figure 4.5. For example, for the ULA the response vectors typically have a Vandermonde vector structure with base e?ipi sin(?k). 4.2.1 The Fisher information of the calibrated array model 4.2.1.1 Indirect derivation of the FIM As was mentioned in section 4.1, the calibrated array model can be a special case of the convolutive mixture model using constraints. The method for transform- 118 ing the model is to keep the parameters for the instantaneous mixing model and to the parameter vector add extraparameters for the calibratedarray and channel gain. When evaluating the FIM, the elements in the rows and columns corresponding to these extra parameters are zero. The constraint then represents the model reparam- eterization, e.g., for this case, we choose the constraints which define H =A(?)?, element by element. The resulting CCRB submatrix corresponding to the extra parameters will be equivalent to the CRB of those parameters. This procedure is made clear in the following example. Example 4.21. Assume x?N(ab,1). The FIM for?T = [a,b] isI(?) = bracketleftbiggb2 ab ab a2 bracketrightbigg , which is singular. Suppose we wish to reparameterize the model replacing ab with c. This is equivalent to the constraint f(?star) = ab?c for the expanded parameter vector ??T = [a,b,c]. For this orthonormal complement U(??) = ? ? 1 0 0 1 b a ? ? of the Jacobian of the constraints, thenUT(??)I(??)U(??) =I(?), whichisstill sin- gular. Hence CCRB(??) = U(??)I?(?)UT(??) where the pseudoinverse is I?(?) = 1 (a2+b2)2I(?), or CCRB(??) = ? ? b2 ab b3 +a2b ab a2 ab2 +a3 b3 +a2b ab2 +a3 b4 + 2a2b2 +a4 ? ?. The component corresponding to CCRB(c) = 1(a2+b2)2(b4 + 2a2b2 +a4) = 1, which agrees with the value of CRB(c) for the model x?N(c,1). Note that the original model need not be identifiable, nor does the replace- ment model. Additional constraints under either model can also be included in the 119 constraint function. The example verifies that even for the more difficult case it is possible to find the CRB and CCRB for the calibrated array model indirectly from the instantaneous mixing model using constraints. However, in the interests of clarity, in the next section the FIM and CCRB will be derived for the calibrated array model directly. 4.2.1.2 Direct derivation of the FIM The calibrated model in (4.19) has a likelihood given by p(y(1),...,y(N);?) = 1(pi?2)MN exp braceleftBigg ? 1?2 Nsummationdisplay n=1 parenleftbigy(n)?A(?)?s(n)parenrightbigHparenleftbigy(n)?A(?)?s(n)parenrightbig bracerightBigg .(4.22) For clarity, since there exists a mixture of complex and real parameters requiring estimation, then using the parameter vector7 ? = ? ?? ?? ?? ?? ?? Re(s(1)) Im(s(1)) ... Re(s(N)) Im(s(N)) ? ? ? ?? ?? ?? ?? ?? , (4.23) with ?T = bracketleftbig?1,...,?Kbracketrightbig and ?T = bracketleftbig?1,...,?Kbracketrightbig, the Fisher information matrix is given by I(?) = ? ?? ?? ?? ?? M 0 ??? 0 M1 L1 0 M 0 M2 L2 ... ... ... ... ... 0 0 ??? M MN LN MT1 MT2 ??? MTN K? L LT1 LT2 ??? LTN LT K? ? ?? ?? ?? ?? (4.24) 7It is also possible to include the noise variance parameter ?2 in the parameter vector. However, this parameter decouples from the other parameters resulting in an optimistic CRB of ?4MN [16, 59], and so is uninteresting to the results herein. 120 where M = bracketleftbiggRe(M) ?Im(M) Im(M) Re(M) bracketrightbigg , M = 2?2?HAH(?)A(?)?, (4.25) Mn = bracketleftbiggRe(M n) ?Im(Mn) Im(Mn) Re(Mn) bracketrightbigg , Mn = 2?2?HAH(?)A(?)S(n), (4.26) Ln = bracketleftbiggRe(L n) Im(Ln) bracketrightbigg , Ln = 2?2?HAH(?)D(?)?S(n), (4.27) L= bracketleftbiggRe(L) Im(L) bracketrightbigg , L= 2?2 Nsummationdisplay n=1 SH(n)AH(?)D(?)?S(n), (4.28) K? = bracketleftbiggRe(K ?) ?Im(K?) Im(K?) Re(K?) bracketrightbigg , K? = 2?2 Nsummationdisplay n=1 SH(n)AH(?)A(?)S(n), (4.29) and K? = 2?2 Nsummationdisplay n=1 ReparenleftbigSH(n)?HDH(?)D(?)?S(n)parenrightbig. (4.30) In the equations above, we redefine S(n) = diagparenleftbigs(1)(n),...,s(K)(n)parenrightbig and D(?) = bracketleftbigg?a(? 1) ??1 ?a(?2) ??2 ??? ?a(?K) ??K bracketrightbigg . 4.2.1.3 Properties of the FIM The model in (4.19) admits an ambiguity as ?ks(k)(n) is indistinguishable from parenleftbigc k?k parenrightbigparenleftBigs(k)(n) ck parenrightBig for any nonzerock ?C. From Section 2.2, then it should be expected that the FIM in (4.24) is singular. This is indeed the case, e.g., note that ? ?? ?? ?? ?? M 0 ??? 0 M1 0 M 0 M2 ... ... ... ... 0 0 ??? M MN MT1 MT2 ??? MTN K? LT1 LT2 ??? LTN LT ? ?? ?? ?? ?? ? ? ?? ?? ?? S(1) S(2) ... S(N) ?? ? ?? ?? ??= 0. 121 And since the FIM consists of real and imaginary parts of this matrix, it too has a null space, namely, the columns of N(?) = ? ?? ?? ?? ?? ?? ?? Re(S(1)) ?Im(S(1)) Im(S(1)) Re(S(1)) ... ... Re(S(N)) ?Im(S(N)) Im(S(N)) Re(S(N)) ?Re(?) Im(?) ?Im(?) ?Re(?) 0 0 ? ?? ?? ?? ?? ?? ?? are a basis for the null space (or at least a null subspace) of I(?). So the nullity, or dimension of the null space, of I(?) is at least 2K. In fact, it is exactly 2K (provided N? KM?K for reasons similar to that given in theorem 4.12). 4.2.2 Constraints for the calibrated array model 4.2.2.1 Constraints on the complex-valued channel gain: ? =IK?K One approach to eliminating the ambiguity between the complex-valued gain and the source input isto incorporate thisgain intothe signal. Insteadof remodeling the mean of (4.19) to be ?(n,?) = Ksummationdisplay k=1 a(?k)s(k)(n) (4.31) by eliminating the unknown gain, it is equivalent to impose the constraints ?k = 1 for k = 1,...,K. The model in (4.31) is a model presented in a paper on ?direction finding with narrow-band sensor arrays? by Stoica and Nehorai [67]. Hence, if the theory of Chapter 3 is to be trusted, then results found by imposing proper constraints that ? =IK?K (or? = 1K) should be equivalent to the results of Stoica 122 and Nehorai. TheK constraints of acomplex-valuedparameter?are 2K constraints of the corresponding real-valued parameters, i.e., the vector f of constraints can be defined as fk(?) = Re(?k)?1 = 0 fK+k(?) = Im(?k) = 0 for k = 1,...,K. For the parameter vector in (4.23), the Jacobian of these con- straints is given by F(?) =bracketleftbig02K?2KN I2K?2K 02K?Kbracketrightbig. Since F(?)N(?) = I2K?2K, then by theorem 3.23, this constraint is sufficient to (locally) identify the parameters and by theorem 3.24 the matrix UT(?)I(?)U(?) will be regular for any matrix U(?) satisfying (3.6). An orthonormal basis for the null space of the Jacobian would be the columns of U(?) = ? ? I2KN?2KN 02KN?K 02K?2KN 02K?K 0K?2KN IK?K ? ?, and this generates a reduced ?FIM? UT(?)I(?)U(?) = ? ?? ?? ?? M 0 ??? 0 L1 0 M 0 L2 ... ... ... ... 0 0 ??? M LN LT1 LT2 ??? LTN K? ? ?? ?? ??, which is equivalent to the Fisher information in [67, equation (E.9)]. Since the ? parameters are known and therefore it isunnecessary to understand the performance potential (or CCRB) of a known parameter, it is only of interest to have a bound 123 on the performance of estimators of the transformation ?=k(?) = ? ?? ?? ?? ?? Re(s(1)) Im(s(1)) ... Re(s(N)) Im(s(N)) ? ? ?? ?? ?? ?? , which has the Jacobian ???T? = UT(?). Hence, CCRB(?) = parenleftbigUT(?)I(?)U(?)parenrightbig?1 is the same as the CRB found by Stoica and Nehorai. This equivalence serves as further validation of the CCRB approach. 4.2.2.2 Semiblind constraints: s(t) =p(t) for t?T An alternative approach to eliminating the ambiguity between the source and coefficient is the have prior knowledge of some of the source signals. These known elementsare oftenreferredto as training or pilot symbols incommunications. Knowl- edge of any kth source element s(k)(t) = p(k)(t) at any time sample t resolves the ambiguity between ?ks(k)(n) for all n since ?k is solvable in the observation corre- sponding to time sample t and can thus be used to solve the unknown source values when n /?T. This model can be written as ?(n,?) = Ksummationdisplay k=1 a(?k)parenleftbig?kp(k)(n)?n?T +?ks(k)(n)?n/?Tparenrightbig (4.32) where ?statement = 1 when the statement is true and = 0 when the statement is false. This model is equivalent to the model designed by Kozick and Sadler [39, equation (9)] except with ?ks(k)(n) being simply s(k)(n), a distinction that still allows for a match of results of the CCRB of a properly chosen transformation. The model is also equivalent to the model designed by Li and Compton [41, equation (2)] when 124 T = {1,2,...,N}. Equivalence of the reparameterized CRBs in [39, 41] with the CCRB can be found in [59]. As in the previous example, the CCRB on the unknown parameters is the inverseof the (unconstrained) FIM after the eliminationof its rows and columns corresponding to the known or specified parameters. 4.2.2.3 Finite alphabet constraint: s(k)(n)?S This constraint derives from the assumption that the source elements exists in a (finite) discrete set S. In communications models, this corresponds to digital mod- ulation designs such as pulse amplitude modulation (PAM), quadrature amplitude modulation (QAM), phase-shift keying (PSK), etc. As such, the model can also be the same as any of the previous models in (4.19), (4.31), or (4.32), but the defining characteristic is that the source samples only exist on a discrete set. There do not exist any CRB-type bounds for this model in the literature due to the problem of differentiability with respect to the parameters.8 It is certainly possible to constrain any single real-valued parameter to a discrete set by creating a polynomial (or sine function) whose zeros match the set values. But what information can be gained from a constraint formulation of this discrete-alphabet model? This is perhaps best answered with the following example. Example 4.22. Reconsider the model in example 2.2, y?CN(?,?2). Suppose the 8A Chapman-Robbins or Barankin-type bound, which does not require differentiability with respect to the parameters would be possible but even this approach does not seem to exist in the literature. Many communications engineers also discount the importance of a mean-square error bound for a digital signal and rely an alternative performance criteria, such as bit-error rate (BER). 125 parameter is required to satisfy the constraints f1(?) = (Re(?))2?12 = 0 f2(?) = (Im(?))2?12 = 0. This constraint restricts the real and imaginary part of ? to reside in the discrete set{ ?2 2 ,? ?2 2 }, i.e. ??S = braceleftBigg? 2 2 +i ?2 2 , ?2 2 ?i ?2 2 ,? ?2 2 +i ?2 2 ,? ?2 2 ?i ?2 2 bracerightBigg . (In the communications vernacular, this is quadrature PSK or 4-QAM.) The Jaco- bian for this constraint is F(?) = bracketleftbigg2Re(?) 0 0 2Im(?) bracketrightbigg , which is full column rank for the possible set of values for ?. Hence, to satisfy (3.6), U(?) = [ ] is a 2?0 null matrix. This is analogous to the determinate case of example 3.14, therefore knowledge that a parameter exists in a discrete set is equivalent to complete knowledge of the parameter value in terms of mean-square error performance potential-the result being a Cram?er-Rao bound of zero. This does not mean that the mean-square error will be zero (the decision of which set value the parameter actually is can be wrong given sufficient power in the noise), but it does mean that the mean-square error bound is trivial and not particularly helpful. This result is not surprising considering the degrees of freedom of the parameter that are restricted from such constraints. The restriction of a real-valued parameter to satisfying a single root equation eliminates its single degree of freedom. 126 4.2.2.4 Unit modulus constraints: |s(n)|= 1 for all n Since knowledge of parameters from a discrete (finite) alphabet results in a trivial bound, then to obtain a relevant and useful measure of performance potential a relaxation of the side information is necessary. One such example of this approach is using a constant or unit modulus constraint on the source elements as in section 4.1.4.4 for the convolutive mixture model. This approach remodels the mean as ?(n,?) = Ksummationdisplay k=1 a(?k)?kej?(k)(n). (4.33) This is the constraint considered in example 3.4 applied to the communications context. Therefore, imposing the constraints f(n?1)K+k(?) =vextendsinglevextendsingles(k)(n)vextendsinglevextendsingle2?1 = 0, fork = 1,...,K, n = 1,...,N, is an alternative approach than rederivingthe Fisher information for the model in (4.33). The Jacobian for this constraint has the form F(?) = ? ?? 2Re(S(1)) 2Im(S(1)) 0K?3K ... ... 2Re(S(N)) 2Im(S(N)) 0K?3K ? ??, (4.34) which has a null space generated by the columns of the matrix U(?) = ? ?? ?? ?? ?? ?Im(S(1)) Re(S(1)) ... ?Im(S(N)) Re(S(N)) I3K?3K ? ?? ?? ?? ?? . (4.35) From this we can check the (local) identifiability of this model under the (unit) constant modulus constraint using theorem 3.24. The matrix UT(?)I(?)U(?) is 2 66 66 66 66 4 Re(S?(1)MS(1)) Im(S?(1)M1) Re(S?(1)M1) Im(S?(1)L1) ... ... ... ... Re(S?(N)MS(N)) Im(S?(N)MN) Re(S?(N)MN) Im(S?(N)LN) ?Im(MH1 S(1)) ??? ?Im(MHNS(N)) Re(K?) ?Im(K?) Re(L) Re(MH1 S(1)) ??? Re(MHNS(N)) Im(K?) Re(K?) Im(L) ?Im(LH1 S(1)) ??? ?Im(LHNS(N)) Re(LH) ?Im(LH) K?. 3 77 77 77 77 5 127 Since UT(?)I(?)U(?)? ? ?? ?? ?? ?? IK?K ... IK?K ?Im(?) Re(?) 0K?K ? ?? ?? ?? ?? = 0, then UT(?)I(?)U(?) is singular and by theorem 3.24 the (unit) constant modulus constraints are not sufficient for identifiability. This was also the case for the convo- lutive mixture model in section 4.1.4.4. Furthermore, reviewing the model in (4.33), this result has more reason to be expected. The original identifiability issue in the calibrated model (4.19) is the multiplicative ambiguity between the sources and the channel gain. While the constant modulus constraint resolves any amplitude ambi- guity of the channel gain, i.e.,|?k|= vextendsinglevextendsingle?ks(k)(n)vextendsinglevextendsingle for any n, there still exist a phase rotation ambiguity. It is clear from both the model and from theorem 3.23, that for each source k knowledge of an element for either the real or imaginary part of either the channel gain or a source sample will be sufficient for identifiability and a regular UT(?)I(?)U(?). Of course, given this constant unit modulus constraint, knowl- edge of the real (imaginary) part of any source sample is equivalent to restricting the imaginary (real) part of the source sample to a finite discrete alphabet, which as discussed in section 4.2.2.3 is the same as knowledge of the parameter in regards to the CCRB performance potential but not in regards to the estimation. Hence for estimation performance it is a necessity to constrain the real and/or imaginary part of the source sample. 128 4.2.2.5 Unit modulus constraint; real-valued channel gain: Im(?k) = 0 for all k This model is that given in (4.33) except with ?k being real-valued. As seen in sections 4.2.2.2 and 4.2.2.3, the constraint is knowledge of the imaginary parts of ? and the effect of the correspondingU(?) matrix is the elimination of the columns and rows corresponding to the Im(?k) parameters. This model is equivalent to that of Leshem and van der Veen [40]. Verification that the CCRB is equivalent can be found in [59]. The bound is essentially the inverse of the reduced-parameter-space Fisher informationUT(?)I(?)U(?) from section 4.2.2.4 after the elimination of the rows and columns corresponding to the channel gain parameters. 4.2.2.6 Semi-blind and unit modulus constraint This model merges the models in (4.33) and (4.32). Without loss of generality, assume the source elementsare known for the firstT time slots for eachsource. Then the constraint Jacobian is F(?) = ? ?? I2TK?2TK 02TK?3K 2Re(S(T + 1)) 2Im(S(T + 1)) 0K?3K ... ... 2Re(S(N)) 2Im(S(N)) 0K?3K ? ??, (4.36) which has an orthonormal complement U(?) = ? ?? ?? ?? ?? ?? 02TK?K ?Im(S(1)) Re(S(1)) ... ?Im(S(N)) Re(S(N)) I3K?3K ? ?? ?? ?? ?? ?? . (4.37) 129 The reduced FIM is then U(?)I(?)U(?) = ? ?? ?? ?? ?? G(T + 1) C(T + 1) B(T + 1) G(T + 2) C(T + 2) B(T + 2) ... ... ... G(N) C(N) B(N) CT(T + 1) CT(T + 2) ??? CT(N) K? L BT(T + 1) BT(T + 2) ??? BT(N) LT K? ? ?? ?? ?? ?? (4.38) where G(t) = Re[S?(t)MS(t)] and M is defined as in (4.25), where C(t) = bracketleftbigIm[S?(t)M t] Re[S?(t)Mt] bracketrightbig and M n is defined in (4.26), B(t) = Im[S?(t)Lt] and Lt is defined in (4.27). To analytically invert this matrix, the Schur comple- ment, ? = ? ?? ?? ?? K?? Nsummationdisplay t=T+1 CT(t)G?1(t)C(t) L? Nsummationdisplay t=T+1 CT(t)G?1(t)B(t) LT ? Nsummationdisplay t=T+1 BT(t)G?1(t)C(t) K?? Nsummationdisplay t=T+1 BT(t)G?1(t)B(t) ? ?? ?? ??, (4.39) is useful. If ? is partitioned into corresponding subblocks bracketleftbigg? K? ?L ?LT ?K? bracketrightbigg , then the CCRB subblocks for the unknown elements are given by CCRB(?) = bracketleftBig ?K? ??LT??1K??L bracketrightBig?1 CCRB(?) = bracketleftbig?K? ??L??1K??LTbracketrightbig?1 CCRB( bracketleftbiggRe(s(t)) Im(s(t)) bracketrightbigg ) = bracketleftbigg?Im(S(t)) Re(S(t)) bracketrightbigg X(t)bracketleftbig?Im(S(t)) Re(S(t))bracketrightbig where X(t) =G?1(t) +G?1(t)bracketleftbigC(t) B(t)bracketrightbig??1 bracketleftbiggCT(t) BT(t) bracketrightbigg G?1(t). Example 4.23. A distinct advantage of the CCRB, as implementedin this treatise, is the ability to compare the performance potential of a number of different models seamlessly. Suppose we consider M = 5 omni-directional sensors with a beamwidth 130 0 5 10 15 20 25 300 0.1 0.2 0.3 0.4 0.5 0.6 0.7 SOURCE 2 AOA (DEGREES) RMS ERROR (DEGREES) CRB ON SOURCE 2 AOA SNR2 = 5 dB SNR2 = 15 dB BLIND CM SIGNAL KNOWN SIGNAL Figure 4.6: CCRBs on AOA for blind, constant modulus, and known signal models. of?23.6? inauniform linear array receivingK = 2 source signals overN = 100 time samples. Then, it is possible to compare various communications design scenarios, e.g., (a) the ?blind? case9: sk(1) known for k = 1,2, (b) the unit modulus case: |sk(t)|2 = 1 for k = 1,2 and t = 1,...,100, (c) the semiblind case: sk(t) known for k = 1,2 and t = 1,...,T = 20, (d) the unit modulus and semiblind case: sk(t) known for k = 1,2 and t = 1,...,T = 20, and|sk(t)|2 = 1 for k = 1,2 and t = T + 1,...,100, and (e) the known signal case: sk(t) known for k = 1,2 and t= 1,...,100. 9This is not a truly blind scenario, but is often referred to as blind in the literature. 131 0 5 10 15 20 25 300 0.1 0.2 0.3 0.4 0.5 0.6 0.7 RMS ERROR (DEGREES) CRB ON SOURCE 2 AOA SNR2 = 5 dB SNR2 = 15 dB SEMI?BLIND CM SIGNAL + SEMI?BLIND KNOWN SIGNAL Figure 4.7: CCRBs on AOA for semiblind, constant modulus + semiblind, and known signal models. 132 Figure 4.6 displays a comparison between blind, unit modulus constraints, and known signal model designs of the CCRBs on angle-of-arrival (AOA) estimation for the second source signal over varying directions and different signal-to-noise ratios (SNRs) with the first source signal arriving at 0?. Figure 4.7 displays a comparison of CCRBs on AOA estimation between semiblind constraints, unit modulus with semiblind constraints, and known signal model designs. And finally, figure 4.8 dis- plays CCRBs on signal phase estimation between blind, unit modulus constraints, semiblind constraints, and a mixture of unit modulus and semiblind constraints. The known signal model is the best case scenario for AOA estimation potential and is a useful guideline for more desirable scenarios where information (data or un- known parameters) is included in the transmission. Figures 4.6 and 4.7 demonstrate the characteristic loss of performance when the sources? AOAs differ by roughly the beamwidth. The value of the unit modulus constraint is evident when the sources? AOAs are closely spaced as the CCRB performance potential approximates that of the known signal model. In figures 4.7 and 4.8, the estimation potential actually improves for closely space sources with the semiblind and unit modulus constraint mixture. 4.3 Discussion This chapter includes extensions of only a brief sampling of my prior research [59, 39, 48, 51, 39, 48, 59, 49] as it relates to the practical application of the CCRB. In this chapter, the convolutive mixture model and the calibrated array model were 133 0 5 10 15 20 25 300 5 10 15 20 25 30 35 40 RMS ERROR (DEGREES) MEAN CRB ON SOURCE 2 SIGNAL PHASE SNR2 = 5 dB 20% TRAINING BLIND CM SIGNAL SEMI?BLIND CM SIGNAL + SEMI?BLIND Figure 4.8: CCRBs on signal phase for blind, constant modulus, semiblind, and constant modulus + semiblind models. 134 treated as base models for which the Fisher information was derived a single time and then a series of variations on the models in the form of differentiable parametric constraints were considered. This approach presents a simple procedure to compare and contrast a large class of constraints, essentially different models, in an efficient manner to determine the value of particular formulation in terms of performance potential as measured in the CCRB. 135 Appendix A Appendices A.1 A proof of the CCRB using the Chapman-Robbins version of the Barankin bound Gorman and Hero developed a CCRB using the multiparameter version of the Hammersley-Chapman-Robbins bound (HCRB) [16, 26]. However, the result produced a variant form of the CCRB I?1(?)?I?1(?)FT(?)parenleftbigF(?)I?1(?)FT(?)parenrightbig?1F(?)I?1(?), which requires a nonsingular FIM. What follows is a shorter variation of their ap- proach that does not assume a nonsingular FIM, starting with a brief description of the HCRB. Rather than relying on the Fisher score, which is the derivative of the log- likelihood, i.e., s(x;?) = ??? logp(x;?) = 1p(x;?)?p(x;?)?? the regularity conditions requiring a differentiable likelihood can be relaxed by con- sidering finite differences, i.e., for each i= 1,...,m, 1 p(x;?) p(x;?+epsilon1iei)?p(x;?) epsilon1i where the ei are canonical unit vectors. If the likelihood is differentiable then the limit as each epsilon1i ? 0 is the Fisher score. Of course, the finite differences need not 136 be with respect to the canonical axis and the number of finite differences need not be the same as the dimension of the parameters. This is the generalization of the CRB known as the Hammersley-Chapman-Robbins (HCR) version of the Barankin bound. Theorem A.1. If t(x) be an unbiased estimate of h(?)?h(Rm)?Rt, and ? = bracketleftbig?(1),...,?(p)bracketrightbig is a matrix whose columns are test points in Rm, all distinct from each other as well as from ?, then the variance of t(x) is bounded below by the inequality Var(t(x))? sup ?(1),...,?(p),p ?(?,?)??1(?,?)?T(?,?) where ?(?,?) is called a translation matrix defined by ?(?,?) =bracketleftbigh(?(1))?h(?), ???, h(?(p))?h(?)bracketrightbig and ?(?,?) is called an HCR information matrix defined by ?ij(?,?) = E?p(x;? (i))?p(x;?) p(x;?) p(x;?(j))?p(x;?) p(x;?) . This result encompasses the CRB result when the test points satisfy ?(i) = ?+epsilon1iei and the epsilon1i ?0 for i = 1,...,m = p, i.e., as a properly chosen set of test points approach the parameter. If the set of vectors ?(1)??,...,?(m)?? span an m-dimensional space, then the limit as the finite differences approach derivatives will still obtain the CRB. Since f(?) = 0 it makes sense to restrict the test points to also satisfy the constraints,f(?(i)) = 0, and examine the limit of the HCRB as the finite differences 137 approach derivatives. The Taylor series approximation of f(?(i)) about ? is f(?(i)) =f(?)F(?)parenleftbig?(i)??parenrightbig+o(||?(i)??||). Since f(?(i)) = f(?) = 0, then ?(i)?? almost entirely resides in null(F(?)). So without loss of generality we can allow ?(i) =?+?iui(?) +o(||?(i)??||) where ui(?) is the ith column of the matrix U(?) satisfying (3.6). Then p(x;?(i))?p(x;?) ?ip(x;?) = p(x;?+?iui(?)+o(||?(i)??||))?p(x;?) ?ip(x;?) ? uTi (?)?p(x;?)?? 1p(x;?) =uTi (?)s(x;?) and ?(i)?? ?i =ui(?) + 1 ?io(||? (i)??||)?ui(?) as ?i ? 0. This is true for any i, so if the test points are chosen such that each column of U(?) is used, this gives us the CCRB U(?)parenleftbigUT(?)I(?)U(?)parenrightbig?1UT(?) as the limit of a constrained HCRB when the finite differences become derivatives. A.2 A proofof theCCRB using themethodof implicit differentiation Suppose ? is restricted to the zeros of f : Rm ?Rk with Jacobian F(?) = ?f(?) ??T having rank k whenever f(?) = 0. The method of implicit differentiation assumes that parameters that would be eliminated under a reparameterization can 138 be written in terms of the remaining parameters, since the conditions satisfy the implicit function theorem. The actual function generally remains unknown, but it?s derivative is calculated by first taking partial derivatives of constraint function and using linear algebra to solve for the partial of the eliminated parameters in terms of the remaining parameters. This approach was also used by Marzetta [47, proof of theorem 1] to prove the regularity conditions given in (3.20). The parameter vector may be separated as ? = bracketleftbigg? 1 ?2 bracketrightbigg and the constraint f may be rewritten as fstar : Rm?k?Rk ?Rk via the mapping fstar(?1,?2) = f( bracketleftbigg? 1 ?2 bracketrightbigg ). Then the Jacobian of f can be represented as F(?) =bracketleftbigfstar?1(?1,?2) fstar?2(?1,?2)bracketrightbig where fstar?i(?1,?2) = ???T i fstar(?1,?2) for each i= 1,2. Without loss of generality, assume ?2 ?Rk is a function of ?1 ?Rm?k, i.e., ?2 =?2(?1) is an implicit function. Therefore, fstar is implicitly only a parameter of ?1 and ? ??T1 f star(?1,?2(?1)) =fstar ?1(?1,?2(?1)) +f star ?2(?1,?2(?1)) ??2(?1) ??T1 . If this derivative is only taken where fstar(?1,?2) = 0, then in matrix form, bracketleftbigf ?1(?1,?2) f?2(?1,?2) bracketrightbig bracketleftBigg Im?k,m?k ??2(?1) ??1 bracketrightBigg = 0. The first matrix above is F(?); the second matrix above consists of m?k linearly independent columns which exist in the null space of the row vectors of F(?), i.e., the second matrix is merely some transformation of some matrix U(?) defined as in (3.6). 139 A.3 Alternative proof of asymptotic normality Crowder [18] proved the following theorem. Theorem A.2 (Crowder). If 1. there is a consistent solution ( ??n, ??n) of the likelihood equations, 2. s(x;?0) d?N(0,I(?0)), 3. D?1( ??n) parenleftBig I(?0) + ???Ts(x;?0, ??n) parenrightBig p ?0, 4. Q( ??n)?Q(?0) p?0, and 5. detQ( ??n)?K 0 that satisfies (3.37), then ??(k) is a stationary point. Proof. Let ????f and define ??(?) = pibracketleftbig??+?CCRB( ??)s(x; ??)bracketrightbig. By a property of the natural projection of convex sets, vextenddoublevextenddouble vextenddouble??(?)? ?? vextenddoublevextenddouble vextenddouble I(??) ?? vextenddoublevextenddouble vextenddoubleCCRB( ??)s(x; ??) vextenddoublevextenddouble vextenddouble I( ??) . Hence it is sufficient to show there exists an ?> 0 such that parenleftbiglogp(x; ??(?))?logp(x; ??)parenrightbig ? ??s T(x; ??)CCRB( ??)s(x; ??). To show this by contradiction, assume not and take the limit as ??0. Then sT(x; ??)CCRB( ??)s(x; ??) ? ?sT(x; ??)CCRB( ??)s(x; ??) 0 ? (??1)sT(x; ??)CCRB( ??)s(x; ??). This inequality implies CCRB( ??)s(x; ??) = 0 and s(x; ??) ? span(FT( ??)) since ?< 1, so ?? satisfies the stationarity condition (3.32). Theorem (Theorem 3.34). The sequence{p(x; ??(k))}is a monotone increasing se- quence. Furthermore, if p(x;?) is bounded above, then{p(x; ??(k))}converges. Proof. Since ? ? 0 and bardbl??(k+1)? ??(k)bardbl2I( ??(k)) ? 0, then by the rule in (3.37), the value of the likelihood function can only increase after each iteration. The second 142 statement is a consequence of the monotone convergence principle, i.e., a bounded monotone sequence converges [38, p. 44, theorem 2-6]. Theorem (Theorem 3.35). If the likelihood p(x;?) is bounded above, then the sequence {logp(x; ??(k+1))?logp(x; ??(k))} vanishes. Proof. Since{p(x; ??(k))}converges, then p(x; ??(k+1))p(x; ??(k)) ?1.1 And since log(?) is contin- uous, then log p(x; ??(k+1))p(x; ??(k)) ?0. Theorem (Theorem 3.36). If the likelihood p(x;?) is bounded above, then the sequence {bardbl??(k+1)? ??(k)bardblI(??(k))} vanishes as k??. Proof. Again, by the rule in (3.37), the sequence{bardbl??(k+1)? ??(k)bardbl2I( ??(k))}is bounded above by the product of a bounded sequence {?(k)} and a vanishing sequence {logp(x; ??(k+1))?logp(x; ??(k))}, and clearly each element of the sequence is non- negative. Hence, by the squeezing theorem2, the sequence vanishes as k ? ?. Theorem (Theorem 3.37). If ???(1) is compact and convex, then limit points of the sequence{??(k)}are also stationary points. 1If ak?a, bk?b, and bknegationslash= 0 for any k, then ak bk ? a b [38, p. 41, theorem 2-4(d)].2 If 0?ak?bk and bk?0, then ak?0 [38, p. 43, theorem 2-5(c)]. 143 Proof. Let ?star be a limit point of the sequence{??(k)}. Then by virtue of Bolzano- Weierstrass [38, p. 54, theorem 2-14], there exists a convergent subsequence{??(ki)} that converges to?star. Additionally, since{?(ki)}is a bounded sequence, it contains a convergent subsequence{?(kij)}with a limit point we shall denote?star. It is still true that ??(kij) ??star [38, p.49, theorem 2-10]. Then we can bound the norm-distance between pibracketleftbig?star +?starCCRB(?star)s(x;?star)bracketrightbig and ?star using the triangle inequality, e.g., vextenddoublevextenddoublepibracketleftbig?star + ?starCCRB(?star)s(x;?star)bracketrightbig??starvextenddoublevextenddouble I(?star) ? vextenddoublevextenddouble vextenddoublepibracketleftbig?star + ?starCCRB(?star)s(x;?star)bracketrightbig?pi bracketleftBig ??(kij) + ?(kij)CCRB( ??(kij))s(x; ??(kij)) bracketrightBigvextenddoublevextenddouble vextenddouble I(?star) + vextenddoublevextenddouble vextenddoublepi bracketleftBig ??(kij) +?(kij)CCRB( ??(kij))s(x; ??(kij)) bracketrightBig ? ??(kij) vextenddoublevextenddouble vextenddoubleI(?star) + vextenddoublevextenddouble vextenddouble??(kij)??star vextenddoublevextenddouble vextenddoubleI(?star) ? vextenddoublevextenddouble vextenddouble?star + ?starCCRB(?star)s(x;?star)? ??(kij)??(kij)CCRB( ??(kij))s(x; ??(kij)) vextenddoublevextenddouble vextenddoubleI(?star) + vextenddoublevextenddouble vextenddouble??(kij+1)? ??(kij) vextenddoublevextenddouble vextenddouble I(?star) + vextenddoublevextenddouble vextenddouble??(kij)??star vextenddoublevextenddouble vextenddouble I(?star) . The second inequality is a result of a property of projections on convex sets for the first term and the definition of our method of scoring with constraints for the second term. Note this second term will vanish by theorem 3.36 as kij ??. Also, the third term will vanish as kij ??since?star is the limit of the sequence{??(kij)}. The first term is bounded, using the triangle inequality again, as in vextenddoublevextenddouble vextenddouble?star +?starCCRB(?star)s(x;?star)? ??(kij)??(kij)CCRB( ??(kij))s(x; ??(kij)) vextenddoublevextenddouble vextenddoubleI(?star) ? vextenddoublevextenddouble vextenddouble??(kij)??star vextenddoublevextenddouble vextenddouble I(?star) + vextenddoublevextenddouble vextenddouble?starCCRB(?star)s(x;?star)??(kij)CCRB( ??(kij))s(x; ??(kij)) vextenddoublevextenddouble vextenddouble I(?star) . Again, this first term will vanish as kij ??. This last term, using the triangle 144 inequality, satisfies vextenddoublevextenddouble vextenddouble?starCCRB(?star)s(x;?star)??(kij)CCRB( ??(kij))s(x; ??(kij)) vextenddoublevextenddouble vextenddouble I(?star) ? vextenddoublevextenddouble vextenddouble?starCCRB(?star)s(x;?star)??(kij)CCRB(?star)s(x;?star) vextenddoublevextenddouble vextenddouble I(?star) + vextenddoublevextenddouble vextenddouble?(kij)CCRB(?star)s(x;?star)??(kij)CCRB(?(kij))s(x;?(kij)) vextenddoublevextenddouble vextenddouble I(?star) ? vextenddoublevextenddouble vextenddouble?(kij)??star vextenddoublevextenddouble vextenddouble I(?star) bardblCCRB(?star)s(x;?star)bardblI(?star) + vextenddoublevextenddouble vextenddouble?(kij) vextenddoublevextenddouble vextenddouble I(?star) vextenddoublevextenddouble vextenddoubleCCRB(?(kij))s(x;?(kij))?CCRB(?star)s(x;?star) vextenddoublevextenddouble vextenddouble I(?star) . The second inequality used the distributive property of norms [19, p.170, theorem 6.9.2]. By compactness, {bardblCCRB(?star)s(x;?star)bardblI(?star)} is bounded. So the first term above will vanish as kij ??since?star is the limit of the sequence{?(kij)}. Similarly, {bardbl?(kij)bardblI(?star)}is a bounded sequence, and so the second term above will vanish as kij ??since CCRB(?(kij))?CCRB(?star) ands(x;?(kij))?s(x;?star) by continuity [38, p.78, corollary 4-2]. Therefore, pibracketleftbig?star +?starCCRB(?star)s(x;?star)bracketrightbig=?star and one of the following holds: (a) ?star = 0, (b) CCRB(?star)s(x;?star) = 0, or (c) the step projection ?star +?starCCRB(?star)s(x;?star) is perpendicular to ?f at ?star. This last case (c) is impossible since the step is directed by linear combinations of the vectors ofU(?star) which are tangent to the constraint space at?star. This first case (a) implies stationarity by applying continuity on the step size rule condition (3.37) 145 and then theorem 3.33. And (b) impliess(x;?star) is some linear combination of the columns of FT(?star), i.e.,?star satisfies (3.32). Therefore,?star is a stationary point. Theorem (Theorem 3.38). If ???(1) is compact for all sequences in a closed set of ?f and if there is a unique limit point ?star for all such sequences then lim k?? ??(k) =?star for every sequence{??(k)}. Also, ?star is the maximum of p(x;?). Proof. Let ??(2) be any point in the compact set ???(1). Since {??(k)} resides in a compact set it has a limit point (Bolzano-Weierstrass [38, p. 52, theorem 2-12]), which must be unique and therefore lim k?? ??(k) =?star. By theorem 3.34, p(x; ??(k+d))? p(x; ??(k)), and by continuity p(x; ??(k)) ?p(x;?star). Hence, p(x;?star) ?p(x;?prime) for every ?prime ????(1). 146 Appendix C Proofs of Theorems in Chapter 4 Theorem (Theorem 4.7). The CFIM is singular and the dimension of its null space is lower bounded as nullity(I(?))? Ksummationdisplay i=1 Ksummationdisplay j=1 (Li?Lj + 1)+, (C.1) where (a)+ = a for a?0 and (a)+ = 0 for a< 0. This limit quantity is the nullity lower bound (NLB). Proof. This proof is by construction. We shall develop a null subspace of the sub- matrix bracketleftbigQi Qjbracketrightbig of Q. In particular, consider the submatrix of consisting of the ith source elements and j channel elements corresponding to the mth channel, i.e., bracketleftBig S(i) H(j)(m) bracketrightBig . There exists three case to consider: (1) Li = Lj, (2) Li > Lj, and (3) Li (Li?Lj + 1)+. (2) H(j)(i)m in (C.3) has rank (Li?Lj + 1) unless h(j)m = 0, which is impossible by definition. Likewise, S(i)(j) is full column rank unless s(i) has fewer than (Li?Lj + 1) modes or N K. Proof. If any of these conditions fail, then the null space of I(?) is greater than the NLB. (a) If H(z) is reducible or not column-reduced, then by theorem 4.2 HM is not full column rank. Let v?null(HM) and partition v as v = ? ?? ?? v(1) v(2) ... v(K) ? ?? ?? where v(k) is a N +Lk length vector. (It is assumed that v is independent of s otherwise E?y = 0 and the model itself is not identifiable.) Then v? = ? ?? ?? ?? 0(1) v(1) ... 0(K) v(K) ? ?? ?? ???null(I(?)) 149 where 0(k) is a M(Lk + 1) length zero vector. Assume v? is a linear com- bination of the columns of N in (4.14). Then for some k the columns of bracketleftBig H(1)(k) H(2)(k) ??? H(K)(k) bracketrightBig must be linearly dependent. (Otherwise without this assumption, then v? /?span(N). This corresponds to the 0(k) subvector in v?.) Let u(k) = ? ?? ?? ? u(1)(k) u(2)(k) ... u(K)(k) ? ?? ?? ? ?null( bracketleftBig H(1)(k) H(2)(k) ??? H(K)(k) bracketrightBig ) with u(i)(k) a (Lk?Li + 1)+ length vector. Then U(k) = ? ?? ?? ? U(1)(k) U(2)(k) ... U(K)(k) ? ?? ?? ? ?null(HM) where U(i)(k) = ? ?? ?? ? u(i)T(k) 0 ??? 0 0 u(i)T(k) ??? 0 ... ... ... 0 ??? u(i)T(k) ? ?? ?? ? N+Li?N+Lk . (If Li > Lk then U(i)(k) is a null matrix. Also, U(k)(k) = u(k)IN+Lk?N+Lk.) The matrixU(k) can be arranged to create N+Lk linearly independent columns in the null space of I(?), where the submatrices U(i)(k) correspond to the rows of N containing bracketleftBig ?S(1)(i) ?S(2)(i) ??? ?S(K)(i) bracketrightBig , which has rank at most Ksummationdisplay j=1 (Lj? Li+1)+ ? Ksummationdisplay j=1 (Lj +1). For the columns ofN to be a basis of the null space of I(?), it isneededthenthat Ksummationdisplay j=1 (Lj+1)? Ksummationdisplay j=1 (Lj?Lk+1)+ ?N+Lk ?N+Li, 150 which implies at most N = K + Ksummationdisplay j=1 Lj contradicting theorem 4.12 unless all the channel orders are zero. In this latter case, then note that N? ? ?? ?? ? IK?K? ? ?? ?? ? u(1)(k) u(2)(k) ... u(K)(k) ? ?? ?? ? ? ?? ?? ? will not be full rank, and hence neither can N be so. (b) If ptotal < K + Ksummationdisplay k=1 Lk, then there exists v? null(S) where the matrix S = bracketleftbigS(1) S(2) ??? S(K)bracketrightbig. If v is partitioned as v = ? ?? ?? v(1) v(2) ... v(K) ? ?? ?? where v(k) is a Lk + 1 length vector, then v? = ? ?? ?? ?? 1M?1?v(1) 0(1) ... 1M?1?v(K) 0(K) ? ?? ?? ???null(I(?)) with 0(j) is an N + Lj length zero vector. If v? ? span(N), then for some k we have rank( bracketleftBig S(1)(k) S(2)(k) ??? S(K)(k) bracketrightBig ) < Ksummationdisplay j=1 (Lj ?Lk + 1)+. Then by a construction similar to part (a), nullity(S)?Lk + 1 and this contribution to the null space of I(?) has rank at least M(Lk + 1), with submatrices that coincide with bracketleftBig H(1)(k) H(2)(k) ??? H(K)(k) bracketrightBig , which has rank at most Ksummationdisplay j=1 (Lk ? Lj + 1)+ ? Ksummationdisplay j=1 (Lk + 1) = K(Lk + 1) K. Therefore, nullity(I(?)) > Ksummationdisplay i=1 Ksummationdisplay j=1 (Li?Lj + 1)+. 151 (c) If pk < Lk + 1 then by [31, lemma 1], S(k) and, consequently S, has a null space and the argument in (b) applies. So assume pk = Lk + 1. If N < Lk + 1 (and Lk negationslash= 0) then S(k) has a null space [76, lemma 1], so assume N ? Lk + 1. From [31, the proof of theorem 1], it is possible to construct a v independent from s(k) such that span(V) = span(S(k)) for V defined similarly asS(k). So for anyh(k) there exists ah? such thatH(k)M v = (IM?M? V)h(k) = (IM?M ?S(k))h?. Therefore both bracketleftbigg?v h? bracketrightbigg and bracketleftbigg?s(k) h(k) bracketrightbigg reside in null( bracketleftBig H(k)M (IM?M ?S(k)) bracketrightBig ), which increases the nullity lower bound by at least one. (d) This is a looser bound and hence required by theorem 4.12. Theorem (CFIM NLB sufficiency conditions, theorem 4.14). The M-channel K- source FIR system FIM has a nullity of exactly the NLB in (4.13) if (a) H(z) is irreducible and column-reduced, (b) ptotal?K + (K + 1) Ksummationdisplay j=1 Lj, (c) pk ?Lk + 1+ Ksummationdisplay j=1 Lj for k = 1,...,K, (d) N?K + (K + 2) Ksummationdisplay j=1 Lj, and (e) M >K. This result is conceptually easier to prove from yet another alternative matrix model from the ones in (4.3) or (4.6) using S(n) as defined in (4.9). Then define 152 the (Lk + 1+n)?(n+ 1) matrix H(k)(m)(n) = ? ?? ?? ? h(k)m (0) ... ... h(k) m (0) h(k)m (Lk) ... ... h(k)m (Lk) ? ?? ?? ? , which is the impulse response for the ith subchannel of the kth source. Then define H(k)(n) = bracketleftBig H(k)(1)(n),...,H(k)(M)(n) bracketrightBig and the (K(n+1)+ Ksummationdisplay k=1 Lk)?M(n+1) matrix H(n) = ? ? H(1)(n) ??? H(K)(n) ? ?. The observations for the mth channels (receiver) can be collected into the (N?n)? (n+ 1) matrix Y(m)(n) = ? ?? ym(n) ??? ym(0)... ... ym(N?1) ??? ym(N?1?n) ? ?? with Y(n) = bracketleftbigY(1)(n),...,Y(M)(n)bracketrightbig. The noise matrix W(n) is defined similarly. Then the alternative model is Y(n) =S(n)H(n)+W(n). (C.5) Before this theorem is proven, a lemma will be needed. This lemma is a generalization of a result in [52, theorem 3]. The proof was originally shown in [49, Appendix]. Lemma C.1. AssumeH(n) be full row rank andhprime(k) be any nontrivialM(Lk +1) length vector, and defineHprime(k)(n?) to be theLk+1+n??M(1+n?) matrix composed from hprime(k) as in (4.4) and (4.5). Then the following two statements are equivalent: (i) corange{Hprime(k)(n?)}?corange{H(1)(n?),...,H(K)(n?)}= corange{H(n?)}. 153 (ii) hprime(k) ?range{H(1)(k),...,H(K)(k)}. We shall prove a more general version of the lemma, which is essentially an extension of the identifiability theorem of [52, theorem 3] from the SIMO to the MIMO scenario. First, defineh(j)(l) = bracketleftBig h(j)(1)(l),...,h(j)(M)(l) bracketrightBigT and theMn?(n+Lj) matrix H(j)(n) = ? ?? ?? h(j)(Lj) h(j)(Lj?1) ??? h(j)(0) h(j)(Lj) h(j)(Lj?1) ??? h(j)(0) ... ... ... ... h(j)(Lj) h(j)(Lj?1) ??? h(j)(0) ? ?? ?? so that H(N) = bracketleftbigH(1)(N),...,H(K)(N)bracketrightbig is similar to HM in the original model (4.3), except for the order of the rows. Let L satisfy min j {Lj}?L? max j {Lj}. Also define HL as the M(L+ 1)?(L?Lj + 1)+ matrix H(j)L = ? ?? h(j) 0M?1 ??? 0M(L?Lj?1)?1 0M?1 h(j) ... 0M?1 0M(L?Lj?1)?1 0M(L?Lj?1)?1 h(j) ? ?? where h(j) = bracketleftbigh(j)(0)T,...,h(j)(Lj)TbracketrightbigT, an M(Lj + 1)?1 matrix. (The matrix is not to scale, i.e., 0M?1 does not align withh(j).) Note, that H(j)L is null for L Ksummationdisplay j=1 Lj and H(N ?1) is full-column rank. Let hprime be any M(L+ 1)?1 nonzero complex vector, and define Hprime(N) to be the MN?N +L channel matrix composed from hprime. We?ll first assume (1), and show (2). Note that H(j)(N) = bracketleftbigg h(j)(0) p(j)(N?1) 0M(N?1)?1 H(j)(N?1) bracketrightbigg = bracketleftbigg H(j)(N?1) 0 M(N?1)?1 q(j)(N?1) h(j)(Lj) bracketrightbigg 154 where p(j)(N?1) = bracketleftbigh(j)(1),...,h(j)(Lj),0M?N?1bracketrightbig, and q(j)(N?1) = bracketleftbig0M?N?1,h(j)(0),...,h(j)(Lj?1)bracketrightbig. Note statement (1) implies the first column of Hprime(N) satisfies bracketleftbigg hprime(0) 0M(N?1)?1 bracketrightbigg = Ksummationdisplay j=1 bracketleftbigg h(j)(0) p(j)(N?1) 0M(N?1)?1 H(j)(N?1) bracketrightbigg ? bracketleftBigg ?(j)0 a(j)0 bracketrightBigg where ?(j)0 is a constant and a(j)0 is an N?1 +Lj ?1 vector. Hence, we have the following two linear systems: (A1) hprime(0) = Ksummationdisplay j=1 ?(j)0 h(j)(0) +p(j)(N?1)a(j)0 (A2) 0M(N?1)?1 = Ksummationdisplay j=1 H(j)(N?1)a(j)0 But H(N?1) is full-column rank, thus a(j)0 = 0N+Lj?1?1 for all j. Therefore, hprime(0) = Ksummationdisplay j=1 ?(j)0 h(j)(0). (C.6) Likewise, the next column of Hprime(N) is given by ? ? hprime(1) hprime(0) 0M(N?2)?1 ? ?= Ksummationdisplay j=1 bracketleftbigg h(j)(0) p(j)(N?1) 0M(N?1)?1 H(j)(N?1) bracketrightbigg ? bracketleftBigg ?(j)1 a(j)1 bracketrightBigg where, again, ?(j)1 is a constant and a(j)1 is an N +Lj?1?1 vector. Now, we have the following two systems: (B1) hprime(1) = Ksummationdisplay j=1 ?(j)1 h(j)(0) +p(j)(N?1)a(j)1 (B2) bracketleftbigg hprime(0) 0M(N?2)?1 bracketrightbigg = Ksummationdisplay j=1 H(j)(N?1)a(j)1 155 Note (C.6) implies that bracketleftbigg hprime(0) 0M(N?2)?1 bracketrightbigg = Ksummationdisplay j=1 ?(j)0 ? bracketleftbigg h(j)(0) 0M(N?2)?1 bracketrightbigg , i.e., it is a linear combination of column vectors of H(N?1). Since H(N?1) is full-column rank, then (B2) implies that a(j)1 = bracketleftbigg ?(j) 0 0N+Lj?2?1 bracketrightbigg . Thus, evaluating (B1), we have hprime(1) = Ksummationdisplay j=1 ?(j)1 h(j)(0) +?(j)0 h(j)(1). (C.7) Continuing in this fashion, we arrive at the general expression hprime(l) = Ksummationdisplay j=1 lsummationdisplay i=0 ?(j)l?ih(j)(i) (C.8) for 0?l?L. For convenience, we define h(j)(i) to be null if iLj, then?(j)l = ?(j)Lj+l for 0?l?L?Lj and?(j)l ,?(j)L?l = 0 forL?Lj+1? l?L. Define an (L?Lj + 1)+?1 vector ?(j) = bracketleftBig ?(j)0 ,...,?(j)L?Lj bracketrightBigT for each j. If L