ABSTRACT
Title of dissertation: ALGORITHMS FOR SOLVING
LINEAR AND POLYNOMIAL
SYSTEMS OF EQUATIONS
OVER FINITE FIELDS
WITH APPLICATIONS TO
CRYPTANALYSIS
Gregory Bard
Doctor of Philosophy, 2007
Dissertation directed by: Professor Lawrence C. Washington
Department of Mathematics
This dissertation contains algorithms for solving linear and polynomial systems
of equations overGF(2). The objective is to provide fast and exact tools for algebraic
cryptanalysis and other applications. Accordingly, it is divided into two parts.
The first part deals with polynomial systems. Chapter 2 contains a successful
cryptanalysis of Keeloq, the block cipher used in nearly all luxury automobiles.
The attack is more than 16,000 times faster than brute force, but queries 0.6?232
plaintexts. The polynomial systems of equations arising from that cryptanalysis
were solved via SAT-solvers. Therefore, Chapter 3 introduces a new method of
solvingpolynomialsystemsofequationsbyconvertingthemintoCNF-SATproblems
and using a SAT-solver. Finally, Chapter 4 contains a discussion on how SAT-solvers
work internally.
The second part deals with linear systems over GF(2), and other small fields
(and rings). These occur in cryptanalysis when using the XL algorithm, which con-
verts polynomial systems into larger linear systems. We introduce a new complexity
model and data structures for GF(2)-matrix operations. This is discussed in Ap-
pendix B but applies to all of Part II. Chapter 5 contains an analysis of ?the Method
of Four Russians? for multiplication and a variant for matrix inversion, which is
logn faster than Gaussian Elimination, and can be combined with Strassen-like al-
gorithms. Chapter 6 contains an algorithm for accelerating matrix multiplication
over small finite fields. It is feasible but the memory cost is so high that it is mostly
of theoretical interest. Appendix A contains some discussion of GF(2)-linear algebra
and how it differs from linear algebra in R and C. Appendix C discusses algorithms
faster than Strassen?s algorithm, and contains proofs that matrix multiplication,
matrix squaring, triangular matrix inversion, LUP-factorization, general matrix in-
version and the taking of determinants, are equicomplex. These proofs are already
known, but are here gathered into one place in the same notation.
ALGORITHMS FOR SOLVING LINEAR AND POLYNOMIAL
SYSTEMS OF EQUATIONS OVER FINITE FIELDS,
WITH APPLICATIONS TO CRYPTANALYSIS
by
Gregory V. Bard
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
2007
Advisory Committee:
Professor Lawrence C. Washington, Chair/Advisor
Professor William Adams,
Professor Steven Tretter,
Professor William Gasarch,
Assistant Professor Jonathan Katz.
a169 Copyright by
Gregory V. Bard
2007
Preface
Pigmaei gigantum humeris impositi plusquam ipsi gigantes vident1.
(Attributed to Bernard of Chartres, ????)
One of the many reasons the subject of Mathematics is so beautiful is the
continuing process of building one theorem, algorithm, or conjecture upon another.
This can be compared to the construction of a cathedral, where each stone gets laid
upon those that came before it with great care. As each mason lays his stone he
can only be sure to put it in its proper place, and see that it rests plumb, level, and
square, with its neighbors. From that vantage point, it is impossible to tell what
role it will play in the final edifice, or even if it will be visible. Could George Boole
have imagined the digital computer?
Another interesting point is that the cathedrals of Europe almost universally
took longer than one lifetime to build. Therefore those that laid the foundations
had absolutely no probability at all of seeing the completed work. This is true in
mathematics, also. Fermat?s Last Theorem, the Kepler Conjecture, the Insolubility
of the Quintic, the Doubling of the Cube, and other well-known problems were only
solved several centuries after they were proposed. And thus scholarly publication
is a great bridge, which provides communication of ideas (at least in one direction)
across the abyss of death.
On example is to contemplate the conic sections of Apollonius of Perga, (circa
??? bc). Can one imagine how few of the ordinary or extraordinary persons of
Western Europe in perhaps the seventh century ad, would know of them. Yet, ??
centuries after their introduction, they would be found, by Kepler, to describe the
motions of astronomical bodies. In the late twentieth century, conic sections were
studied, at least to some degree, by all high school graduates in the United States
of America, and certainly other countries as well.
Such is the nature of our business. Some papers might be read by only ten
persons in a century. All we can do is continue to work, and hope that the knowledge
we create is used for good and not for ill.
An old man, going a lone highway,
Came at the evening, cold and gray,
To a chasm, vast and deep and wide,
Through which was flowing a sullen tide.
The old man crossed in the twilight dim;
The sullen stream had no fears for him;
But he turned when safe on the other side
And built a bridge to span the tide.
?Old man,? said a fellow pilgrim near,
?You are wasting strength with building here;
Your journey will end with the ending day;
You never again must pass this way;
1Dwarves, standing on the shoulders of giants, can further than giants see.
ii
You have crossed the chasm, deep and wide?
Why build you the bridge at the eventide??
The builder lifted his old gray head:
?Good friend, in the path I have come,? he said,
?There followeth after me today
A youth whose feet must pass this way.
This chasm that has been naught to me
To that fair-haired youth may a pitfall be.
He, too, must cross in the twilight dim;
Good friend, I build this bridge for him.?
by William Allen Drumgoole
iii
Foreword
The ignoraunte multitude doeth, but as it was euer wonte, enuie that
knoweledge, whiche thei can not attaine, and wishe all men ignoraunt,
like unto themself...Yea, the pointe in Geometrie, and the unitie in
Arithmetike, though bothe be undiuisible, doe make greater woorkes,
& increase greater multitudes, then the brutishe bande of ignoraunce is
hable to withstande...
But yet one commoditie moare...I can not omitte. That is the fily-
ing, sharpenyng, and quickenyng of the witte, that by practice of Arith-
metike doeth insue. It teacheth menne and accustometh them, so cer-
tainly to remember thynges paste: So circumspectly to consider thynges
presente: And so prouidently to forsee thynges that followe: that it maie
truelie bee called the File of witte.
(Robert Recorde, ????, quoted from [vzGG03, Ch. 17]).
iv
Dedication
With pleasure, I dedicate this dissertation to my parents. To my father, who
taught me much in mathematics, most especially calculus, years earlier than I would
have been permitted to see it. And to my mother, who has encouraged me in every
endeavor.
v
Acknowledgments
I am deeply indebted to my advisor, Professor Lawrence C. Washington, who
has read and re-read many versions of this document, and every other research paper
I have written to date, provided me with countless hours of his time, and given me
sound advice, on matters technical, professional and personal. He has had many
students, and all I have spoken to are in agreement that he is a phenomenal advisor,
and it is humbling to know that we can never possibly equal him. We can only try
to emulate him.
I have been very fortunate to work with Dr. Nicolas T. Courtois, currently
Senior Lecturer at the University College of London. It suffices to say that the field
of algebraic cryptanalysis has been revolutionized by his work. He permitted me
to study with him for four months in Paris, and later even to stay at his home in
England. Much of my work is joint work with him.
My most steadfast supporter has been my best friend and paramour, Patrick
Studdard. He has read every paper I have written to date, even this dissertation.
He has listened to me drone on about topics like group theory or ring theory and
has tolerated my workaholism. My life and my work would be much more empty
without him.
My first autonomous teaching experience was at American University, a major
stepping stone in my life. The Department of Computer Science, Audio Technology
and Physics had faith in me and put their students in my hands. I am so very grateful
for those opportunities. In particular I would like to thank Professor Michael Gray,
vi
and Professor Angela Wu, who gave me much advice. The departmental secretary,
Yana Shabaev was very kind to me on numerous occasions and I am grateful for
all her help. I also received valuable teaching advice from my future father-in-
law, Emeritus Professor Albert Studdard, of the Philosophy Department at the
University of North Carolina at Pembroke.
MyadmissiontotheAppliedMathematicsandScientificComputation(AMSC)
Program at the University of Maryland was unusual. I had been working as a PhD
student in Computer Engineering for several years when I decided to add a Mas-
ters in AMSC as a side task. Later, I switched, and made it the focus of my PhD.
Several faculty members were important in helping me make that transition, includ-
ing Professor Virgil D. Gligor, Professor Richard Schwartz, and Professor Jonathan
Rosenberg. But without the program chair, Professor David Levermore, and my
advisor, Professor Lawrence C. Washington, this would have been impossible. I am
grateful that many rules were bent.
Several governmental agencies have contributed to my education. In reverse
chronological order,
a136 The National Science Foundation (NSF) VIGRE grant to the University of
Maryland provided the Mathematics Department with funds for ?Dissertation
Completion Fellowships.? These are exceedingly generous semester-long gifts
and I had the pleasure to receive two of them. I am certain this dissertation
would look very different without that valuable support.
a136 I furthermore received travel funding and access to super-computing under the
vii
NSF grant for the Sage project [sag], Grant Number 0555776.
a136 The European Union established the ECRYPT organization to foster research
among cryptographers. The ECRYPT foundation was kind enough to support
me during my research visit in Paris. Much of the work of this dissertation
was done while working there.
a136 The Department of the Navy, Naval Sea Systems Command, Carderock Sta-
tion, supported me with a fascinating internship during the summer of 2003.
They furthermore allowed me to continue some cryptologic research toward
my graduate degrees while working there.
a136 The National Security Agency was my employer from May of 1998 until
September of 2002. Not only did I have the chance to work alongside and
learn from some of the most dedicated and professional engineers, computer
scientists and mathematicians who have ever been assembled into one orga-
nization, but also I was given five semesters of financial support, plus several
summer sessions, which was very generous, especially considering that two of
those semesters and one summer were during wartime. Our nation owes an
enormous debt to the skill, perseverance, self-discipline, and creativity of the
employees of the National Security Agency. Several employees there were sup-
porters of my educational hopes but security regulations and agency traditions
would forbid my mentioning them by name.
Professor Virgil D. Gligor, Professor Jonathan Katz, and Professor William
R. Franklin (of RPI), have all been my academic advisor at one time or another
viii
and I thank them for their guidance. Each has taught me from their own view of
computer security and it is interesting how many points of view there really are.
Our field is enriched by this diversity of scholarly background. Likewise Dr Harold
Snider gave me important advice during my time at Oxford. My Oxford days were
an enormous influence on my decision to switch into Applied Mathematics from
Computer Engineering.
My alma mater, the Rensselaer Polytechnic Institute, has furnished me with
many skills. I cannot imagine a better training ground. My engineering degree
was sufficiently mathematically rigorous that I succeeded in a graduate program in
Applied Mathematics, without having a Bachelor?s degree in Mathematics. In fact,
only two courses, Math 403, Abstract Algebra, and Math 404 Field Theory, were
my transition. Yet I was practical enough to hired into and promoted several times
at the NSA, from a Grade 6 to a Grade 11 (out of a maximum of 15), including sev-
eral commendations and a Director?s Star. I am particularly indebted to the faculty
there, and its founder, Stephen Van Rensselaer, former governor of New York and
(twice) past Grand Master of Free & Accepted Masons of the State of New York. It
was at Rensselaer that I learned of and was initiated into Freemasonry, and the In-
stitute?s mission of educating ?the sons of farmers and mechanics? with ?knowledge
and thoroughness? echoes the teachings of Freemasonry. My two greatest friends,
chess partners, and Brother Freemasons, Paul Bello and Abram Claycomb, were
initiated there as well, and I learned much in discussions with them.
I learned of cryptography at a very young age, perhaps eight years old or so.
Dr. Haig Kafafian, also a Freemason and a family friend, took me aside after Church
ix
for many years and taught me matrices, classical cryptography, and eventually,
allowed me to be an intern at his software development project to make devices to
aid the handicapped. There I met Igor Kalatian, who taught me the C language. I
am grateful to them both.
x
Table of Contents
List of Tables xvi
List of Figures xviii
List of Abbreviations xx
1 Summary 1
I Polynomial Systems 5
2 An Extended Example: The Block-Cipher Keeloq 6
2.1 Special Acknowledgment of Joint Work . . . . . . . . . . . . . . . . . 7
2.2 Notational Conventions and Terminology . . . . . . . . . . . . . . . . 7
2.3 The Formulation of Keeloq . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 What is Algebraic Cryptanalysis? . . . . . . . . . . . . . . . . 8
2.3.2 The CSP Model . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.3 The Keeloq Specification . . . . . . . . . . . . . . . . . . . . . 9
2.3.4 Modeling the Non-linear Function . . . . . . . . . . . . . . . . 10
2.3.5 I/O Relations and the NLF . . . . . . . . . . . . . . . . . . . 11
2.3.6 Disposing of the Secret Key Shift-Register . . . . . . . . . . . 12
2.3.7 Describing the Plaintext Shift-Register . . . . . . . . . . . . . 13
2.3.8 The Polynomial System of Equations . . . . . . . . . . . . . . 13
2.3.9 Variable and Equation Count . . . . . . . . . . . . . . . . . . 14
2.3.10 Dropping the Degree to Quadratic . . . . . . . . . . . . . . . 15
2.3.11 Fixing or Guessing Bits in Advance . . . . . . . . . . . . . . . 16
2.3.12 The Failure of a Frontal Assault . . . . . . . . . . . . . . . . . 17
2.4 Our Attack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.1 A Particular Two Function Representation . . . . . . . . . . . 19
2.4.2 Acquiring an f(8)k -oracle . . . . . . . . . . . . . . . . . . . . . 19
2.4.3 The Consequences of Fixed Points . . . . . . . . . . . . . . . . 20
2.4.4 How to Find Fixed Points . . . . . . . . . . . . . . . . . . . . 21
2.4.5 How far must we search? . . . . . . . . . . . . . . . . . . . . . 23
2.4.6 Fraction of Plainspace Required . . . . . . . . . . . . . . . . . 24
2.4.7 Comparison to Brute Force . . . . . . . . . . . . . . . . . . . 26
2.4.8 Some Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.9 Cycle Lengths in a Random Permutation . . . . . . . . . . . . 30
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.6 A Note about Keeloq?s Utilization . . . . . . . . . . . . . . . . . . . . 35
xi
3 Converting MQ to CNF-SAT 36
3.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.1 Application to Cryptanalysis . . . . . . . . . . . . . . . . . . . 39
3.3 Notation and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4 Converting MQ to SAT . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4.1 The Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4.1.1 Minor Technicality . . . . . . . . . . . . . . . . . . . 41
3.4.2 Measures of Efficiency . . . . . . . . . . . . . . . . . . . . . . 44
3.4.3 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.4 Fixing Variables in Advance . . . . . . . . . . . . . . . . . . . 47
3.4.4.1 Parallelization . . . . . . . . . . . . . . . . . . . . . 49
3.4.5 SAT-Solver Used . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4.5.1 Note About Randomness . . . . . . . . . . . . . . . 49
3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.5.1 The Source of the Equations . . . . . . . . . . . . . . . . . . . 51
3.5.2 The Log-Normal Distribution of Running Times . . . . . . . . 51
3.5.3 The Optimal Cutting Number . . . . . . . . . . . . . . . . . . 53
3.5.4 Comparison with MAGMA, Singular . . . . . . . . . . . . . . 56
3.6 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.8 Cubic Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.9 NP-Completeness of MQ . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.10 Final Note on Sparse MQ and F5 . . . . . . . . . . . . . . . . . . . . 63
4 How do SAT-Solvers Operate? 64
4.1 The Problem Itself . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1.1 Conjunctive Normal Form . . . . . . . . . . . . . . . . . . . . 65
4.2 Chaff and its Descendants . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2.1 Variable Management . . . . . . . . . . . . . . . . . . . . . . . 66
4.2.2 The Method of Watched Literals . . . . . . . . . . . . . . . . 68
4.2.3 How to Actually Make This Happen . . . . . . . . . . . . . . 68
4.2.4 Back-Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3 Enhancements to Chaff . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3.1 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3.2 The Alarm Clock . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3.3 The Third Finger . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4 Walk-SAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.5 Economic Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.5.1 An Interesting Coincidence . . . . . . . . . . . . . . . . . . . . 75
xii
II Linear Systems 76
5 The Method of Four Russians 77
5.1 Origins and Previous Work . . . . . . . . . . . . . . . . . . . . . . . . 78
5.1.1 Strassen?s Algorithm . . . . . . . . . . . . . . . . . . . . . . . 79
5.2 Rapid Subspace Enumeration . . . . . . . . . . . . . . . . . . . . . . 80
5.3 The Four Russians Matrix Multiplication Algorithm . . . . . . . . . . 82
5.3.1 Role of the Gray Code . . . . . . . . . . . . . . . . . . . . . . 83
5.3.2 Transposing the Matrix Product . . . . . . . . . . . . . . . . . 84
5.3.3 Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.3.4 A Quick Computation . . . . . . . . . . . . . . . . . . . . . . 84
5.3.5 M4RM Experiments Performed by SAGE Staff . . . . . . . . . 85
5.4 The Four Russians Matrix Inversion Algorithm . . . . . . . . . . . . . 86
5.4.1 Stage 1: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.4.2 Stage 2: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.4.3 Stage 3: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.4.4 A Curious Note on Stage 1 of M4RI . . . . . . . . . . . . . . . 89
5.4.5 Triangulation or Inversion? . . . . . . . . . . . . . . . . . . . . 92
5.5 Experimental and Numerical Results . . . . . . . . . . . . . . . . . . 93
5.6 Exact Analysis of Complexity . . . . . . . . . . . . . . . . . . . . . . 98
5.6.1 An Alternative Computation . . . . . . . . . . . . . . . . . . . 99
5.6.2 Full Elimination, not Triangular . . . . . . . . . . . . . . . . . 100
5.6.3 The Rank of 3k Rows, or Why k+epsilon1 is not Enough . . . . . . 101
5.6.4 Using Bulk Logical Operations . . . . . . . . . . . . . . . . . . 103
5.6.5 M4RI Experiments Performed by SAGE Staff . . . . . . . . . 104
5.6.5.1 Determination of k . . . . . . . . . . . . . . . . . . . 104
5.6.5.2 Comparison to Magma . . . . . . . . . . . . . . . . . 104
5.6.5.3 The Transpose Experiment . . . . . . . . . . . . . . 105
5.7 Pairing With Strassen?s Algorithm for Matrix Multiplication . . . . . 105
5.8 The Unsuitability of Strassen?s Algorithm for Inversion . . . . . . . . 107
5.8.1 Bunch and Hopcroft?s Solution . . . . . . . . . . . . . . . . . 109
5.8.2 Ibara, Moran, and Hui?s Solution . . . . . . . . . . . . . . . . 110
6 An Impractical Method of Accelerating Matrix Operations in Rings of Finite
Size 115
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.1.1 Feasibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.2 The Algorithm over a Finite Ring . . . . . . . . . . . . . . . . . . . . 117
6.2.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.2.2 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.2.3 Taking Advantage of znegationslash= 1 . . . . . . . . . . . . . . . . . . . . 120
6.2.4 The Transpose of Matrix Multiplication . . . . . . . . . . . . 120
6.3 Choosing Values of b . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.3.1 The ?Conservative? Algorithm . . . . . . . . . . . . . . . . . . 121
6.3.2 The ?Liberal? Algorithm . . . . . . . . . . . . . . . . . . . . . 122
xiii
6.3.3 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.4 Over Finite Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.4.1 Complexity Penalty . . . . . . . . . . . . . . . . . . . . . . . . 125
6.4.2 Memory Requirements . . . . . . . . . . . . . . . . . . . . . . 125
6.4.3 Time Requirements . . . . . . . . . . . . . . . . . . . . . . . . 126
6.4.4 The Conservative Algorithm . . . . . . . . . . . . . . . . . . . 126
6.4.5 The Liberal Algorithm . . . . . . . . . . . . . . . . . . . . . . 127
6.5 Very Small Finite Fields . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.6 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.7 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.7.1 Ring Additions . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.7.2 On the Ceiling Symbol . . . . . . . . . . . . . . . . . . . . . . 131
A Some Basic Facts about Linear Algebra over GF(2) 133
A.1 Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
A.2 Boolean Matrices vs GF(2) Matrices . . . . . . . . . . . . . . . . . . 133
A.3 Why is GF(2) Different? . . . . . . . . . . . . . . . . . . . . . . . . . 133
A.3.1 There are Self-Orthogonal Vectors . . . . . . . . . . . . . . . . 134
A.3.2 Something that Fails . . . . . . . . . . . . . . . . . . . . . . . 134
A.3.3 The Probability a Random Square Matrix is Singular or In-
vertible . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
B A Model for the Complexity of GF(2)-Operations 137
B.1 The Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
B.1.1 Is the Model Trivial? . . . . . . . . . . . . . . . . . . . . . . . 138
B.1.2 Counting Field Operations . . . . . . . . . . . . . . . . . . . . 138
B.1.3 Success and Failure . . . . . . . . . . . . . . . . . . . . . . . . 139
B.2 Notational Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . 139
B.3 To Invert or to Solve? . . . . . . . . . . . . . . . . . . . . . . . . . . 140
B.4 Data Structure Choices . . . . . . . . . . . . . . . . . . . . . . . . . . 140
B.4.1 Dense Form: An Array with Swaps . . . . . . . . . . . . . . . 141
B.4.2 Permutation Matrices . . . . . . . . . . . . . . . . . . . . . . . 141
B.5 Analysis of Classical Techniques with our Model . . . . . . . . . . . . 142
B.5.1 Na??ve Matrix Multiplication . . . . . . . . . . . . . . . . . . . 142
B.5.2 Matrix Addition . . . . . . . . . . . . . . . . . . . . . . . . . 142
B.5.3 Dense Gaussian Elimination . . . . . . . . . . . . . . . . . . . 142
B.5.4 Strassen?s Algorithm for Matrix Multiplication . . . . . . . . . 144
C On the Exponent of Certain Matrix Operations 147
C.1 Very Low Exponents . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
C.2 The Equicomplexity Theorems . . . . . . . . . . . . . . . . . . . . . . 148
C.2.1 Starting Point . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
C.2.2 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
C.2.3 A Common Misunderstanding . . . . . . . . . . . . . . . . . . 155
xiv
Bibliography 156
xv
List of Tables
2.1 Fixed points of random permutations and their 8th powers . . . . . . 23
3.1 CNF Expression Difficulty Measures for Quadratic Systems, by Cut-
ting Number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2 Running Time Statistics in Seconds . . . . . . . . . . . . . . . . . . . 55
3.3 SpeedsofComparisonTrialsbetweenMagma, SingularandANFtoCNF-
MiniSAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4 CNF Expression Difficulty Measures for Cubic Systems, by Cutting
Number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.1 M4RM Running Times versus Magma . . . . . . . . . . . . . . . . . 86
5.2 Confirmation that k = 0.75log2n is not a good idea. . . . . . . . . . 86
5.3 Probabilities of a Fair-Coin Generated n?n matrix over GF(2), hav-
ing given Nullity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.4 Experiment 1? Optimal Choices of k, and running time in seconds. . 96
5.5 Running times, in msec, Optimization Level 0 . . . . . . . . . . . . . 96
5.6 Percentage Error for Offset of K, From Experiment 1 . . . . . . . . . 97
5.7 Results of Experiment 3?Running Times, Fixed k=8 . . . . . . . . . 112
5.8 Experiment 2?Running times in seconds under different Optimiza-
tions, k=8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.9 Trials between M4RI and Gaussian Elimination (msec) . . . . . . . . 113
5.10 The Ineffectiveness of the Transpose Trick . . . . . . . . . . . . . . . 113
5.11 Optimization Level 3, Flexible k . . . . . . . . . . . . . . . . . . . . . 114
6.1 The Speed-Up and Extra Memory Usage given by the four choices . . 128
6.2 Memory Required (in bits) and Speed-Up Factor (w/Strassen) for
Various Rings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
xvi
B.1 Algorithms and Performance, for m?n matrices . . . . . . . . . . . 146
xvii
List of Figures
2.1 The Keeloq Circuit Diagram . . . . . . . . . . . . . . . . . . . . . . . 9
3.1 The Distribution of Running Times, Experiment 1 . . . . . . . . . . . 52
3.2 The Distribution of the Logarithm of Running Times, Experiment 1 . 53
5.1 A Plot of M4RI?s System Solving in Sage vs Magma . . . . . . . . . 106
C.1 The Relationship of the Equicomplexity Proofs . . . . . . . . . . . . . 149
xviii
List of Algorithms
1 Method of Four Russians, for Matrix Multiplication . . . . . . . . . . 82
2 Method of Four Russians, for Inversion . . . . . . . . . . . . . . . . . 87
3 The Finite-Ring Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 118
4 Fast Mb(R) multiplications for R a finite field but not GF(2) . . . . . 125
5 Gram-Schmidt, over a field of characteristic zero. . . . . . . . . . . . 134
6 To compose two row swap arrays r and s, into t . . . . . . . . . . . . 141
7 To invert a row swap array r, into s . . . . . . . . . . . . . . . . . . . 142
8 Na??ve Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . 142
9 Dense Gaussian Elimination, for Inversion . . . . . . . . . . . . . . . 143
10 Dense Gaussian Elimination, for Triangularization . . . . . . . . . . . 144
11 Strassen?s Algorithm for Matrix Multiplication . . . . . . . . . . . . . 144
xix
List of Abbreviations
f(x) = o(g(x)) limx?? f(x)g(x) = 0
f(x) = O(g(x)) ?c,n0 ?n>n0 f(x)?cg(x)
f(x) = ?(g(x)) ?c,n0 ?n>n0 f(x)?cg(x)
f(x) = ?(g(x)) limx?? g(x)f(x) = 0
f(x) = ?(g(x)) f(x) = O(g(x)) while simultaneously f(x) = ?(g(x))
f(x)?g(x) limx?? f(x)g(x) = 1
Cnf-Sat The conjunctive normal form satisfiability problem
Des The Data Encryption Standard
GF(q) The finite field of size q
Hfe Hidden Field Equations (a potential trap-door OWF)
M4rm The Method of Four Russians for matrix multiplication
M4ri The Method of Four Russians for matrix inversion
Mc The Multivariate Cubic problem
Mq The Multivariate Quadratic problem
Owf One-way Function
Quad The stream cipher defined in [BGP05]
Rref Reduced Row Echelon Form
Sage Software for Algebra and Geometry Experimentation
Sat The satisfiability problem
Uutf Unit Upper Triangular Form
xx
Chapter 1
Summary
Generally speaking, it has been the author?s objective to generate efficient and
reliable computational tools to assist algebraic cryptanalysts. In practice, this is a
question of developing tools for systems of equations, which may be dense or sparse,
linear or polynomial, over GF(2) or one of its extension fields. In addition, the
author has explored the cryptanalysis of block ciphers and stream ciphers, targeting
the block ciphers Keeloq and the Data Encryption Standard (Des), and the stream
cipher Quad. Only Keeloq is presented here, because the work on the last two are
still underway. The work on Des has seen preliminary publication in [CB06].
The dissertation is divided into two parts. The first deals with polynomial
systems and actual cryptanalysis. The second deals with linear systems. Linear
systems are important to cryptanalysis because of the XL algorithm [CSPK00], a
standard method of solving overdefined polynomial systems of equations over finite
fields by converting them into linear systems.
Chapter 2 is the most practical, and contains a detailed study of the block
cipher Keeloq and presents a successful algebraic cryptanalysis attack. The cipher
Keeloq is used in the key-less entry systems of nearly all luxury automobiles. Our
attack is 214.77 times faster than brute force, but requires 0.6?232 plaintexts.
Chapter 3 has the largest potential future impact, and deals not with linear
1
systems but with polynomial systems. Also, it deals not only with dense systems
(as Part II does), but with sparse systems also. Since it is known that solving a
quadratic system of equations is NP-hard, and solving the Cnf-Sat problem is NP-
hard, and since all NP-Complete problems are polynomially reducible to each other,
it makes sense to look for a method to use one in solving the other. This chapter
shows how to convert quadratic systems of equations into Cnf-Sat problems, and
that using off-the-shelf Sat-solvers is an efficient method of solving this difficult
problem.
Chapter 4 describes in general terms how Sat-solvers work. This tool is often
viewed as a black box, which is unfortunate. There is no novel work in this chapter,
except the author does not know of any other exposition on how these tools operate,
either for experts or a general audience.
The second part begins with Chapter 5, and contains the Method of Four
Russians, which is an algorithm published in the 1970s, but mostly forgotten since,
for calculating a step of the transitive closure of a digraph, and thus also squaring
boolean matrices. Later it was adapted to matrix multiplication. This chapter
provides an analysis of that algorithm, but also shows a related algorithm for matrix
inversion that was anecdotally known among some French cryptanalysts. However,
the algorithm was not frequently used because it was unclear how to eliminate the
probability of abortion, how to handle memory management, and how to optimize
the algorithm. The changes have made negligible the probability of abortion, and
have implemented the algorithm so that it outperforms Magma [mag] in some cases.
The software tool Sage [sag], which is an open source competitor to Magma, has
2
adopted the author?s software library (built on the Method of Four Russians) for its
dense GF(2)-matrix operations, and this code is now deployed.
Chapter 6 has an algorithm of theoretical interest, but which may find some
practical application as well. This algorithm is for finite-ring matrix multiplication,
with special attention and advantage to the finite-field case. The algorithm takes
any ?baseline? matrix multiplication algorithm which works over general rings, or
a class of rings that includes finite rings, and produces a faster version tailored to
a specific finite ring. However, the memory required is enormous. Nonetheless, it is
feasible for certain examples. The algorithm is based on the work of Atkinson and
Santoro [AS88], but introduces many more details, optimizations, techniques, and
detailed analysis. This chapter also modifies Atkinson and Santoro?s complexity
calculations.
Three appendices are found, which round out Part II. Appendix A contains
some discussion of GF(2)-linear algebra and how it differs from linear algebra in R
and C. These facts are well-known.
We introduce a new complexity model and data structures for GF(2)-matrix
operations. This is discussed in Appendix B but applies to all of Part II.
Appendix C discusses algorithms faster than Strassen?s algorithm, and con-
tains proofs that matrix multiplication, matrix squaring, triangular matrix inver-
sion, LUP-factorization, general matrix inversion and the taking of determinants,
are equicomplex. These proofs are already known, but are here gathered into one
place in the same notation.
Finally, two software libraries were created during the dissertation work. The
3
first was a very carefully coded GF(2)-matrix operations and linear algebra library,
that included the Method of Four Russians. This library was adopted by Sage,
and is written in traditional ANSI C. The second relates to Sat-solvers, and is
a Java library for converting polynomials into Cnf-Sat problems. I have decided
that these two libraries are to be made publicly available, as soon as the formalities
of submitting the dissertation are completed. (the first is already available under
the GPL?Gnu Public License).
4
Part I
Polynomial Systems
5
Chapter 2
An Extended Example: The Block-Cipher Keeloq
The purpose of this chapter is to supply a new, feasible, and economically
relevant example of algebraic cryptanalysis. The block cipher ?Keeloq?1 is used in
the keyless-entry system of most luxury automobiles. It has a secret key consisting
of 64 bits, takes a plaintext of 32 bits, and outputs a ciphertext of 32 bits. The
cipher consists of 528 rounds. Our attack is faster than brute force by a factor of
around 214.77 as shown in Section 2.4.7 on page 27. A summary will be given in
Section 2.5 on page 33.
This attack requires around 0.6?232 plaintexts, or 60% of the entire dictionary,
as calculated in Section 2.4.5 on page 23. This chapter is written in the ?chosen
plaintext attack? model, in that we assume that we can request the encryption of
any plaintext and receive the corresponding ciphertext as encrypted under the secret
key that we are to trying guess. This will be mathematically represented by oracle
access to Ek(vectorP) = vectorC. However, it is easy to see that random plaintexts would
permit the attack to proceed identically.
1This is to be pronounced ?key lock.?
6
2.1 Special Acknowledgment of Joint Work
The work described in this chapter was performed during a two-week visit
with the Information Security Department of the University College of London?s
Ipswich campus. During that time the author worked with Nicolas T. Courtois.
The content of this chapter is joint work. Nonetheless, the author has rewritten
this text in his own words and notation, distinct from the joint paper [CB07]. Some
proofs are found here which are not found in the paper.
2.2 Notational Conventions and Terminology
Evaluating the function f eight times will be denoted f(8).
For any lscript-bit sequence, the least significant bit is numbered 0 and the most
significant bit is numbered lscript?1.
If h(x) = x for some function h, then x is a fixed point of h. If h(h(x)) = x
but h(x)negationslash= x then x is a ?point of order 2? of h. In like manner, if h(i)(x) = x but
h(j)(x)negationslash= x for all j <i, then x is a ?point of order i? of h. Obviously if x is a point
of order i of h, then
h(j)(x) = x if and only if i|j
7
2.3 The Formulation of Keeloq
2.3.1 What is Algebraic Cryptanalysis?
Given a particular cipher, algebraic cryptanalysis consists of two steps. First,
one must convert the cipher and possibly some supplemental information (e.g. file
formats) into a system of polynomial equations, usually over GF(2), but sometimes
over other rings. Second, one must solve the system of equations and obtain from
the solution the secret key of the cipher. This chapter deals with the first step
only. The systems of equations were solved with Singular [sin], Magma [mag], and
with the techniques of Chapter 3, as well as ElimLin, software by Nicolas Courtois
described in [CB06].
2.3.2 The CSP Model
In any constraint satisfaction problem, there are several constraints in several
variables, including the key. A solution must satisfy all constraints, so there are
possibly zero, one, or more than one solution. The constraints are models of a
cipher?s operation, representing known facts as equations. Most commonly, this
includes ? plaintext-ciphertext pairs, P1,...,P? and C1,...,C?, and the ? facts:
E(Pi) = Ci for all i?{1,...,?}. Almost always there are additional constraints
and variables besides these.
If no false assumptions are made, because these messages were indeed sent,
we know there must be a key that was used, and so at least one key satisfies all
the constraints. And so it is either the case that there are one, or more than one
8
Figure 2.1: The Keeloq Circuit Diagram
solution. Generally, algebraic cryptanalysis consists of writing enough constraints
to reduce the number of possible keys to one, and few enough that the system is
solvable in a reasonable amount of time. In particular, the entire process should be
faster than brute force by some margin.
2.3.3 The Keeloq Specification
In Figure 2.1 on page 9, the diagram for Keeloq is given. The top rectangle
is a 32-bit shift-register. It initially is filled with the plaintext. At each round, it is
shifted one bit to the right, and a new bit is introduced. The computation of this
bit is the heart of the cipher.
Five particular bits of the top shift-register are and are interpreted as a 5-bit
9
integer, between 0 and 31. Then a non-linear function is applied, which will be
described shortly (denoted NLF).
Meanwhile the key is placed initially in a 64-bit shift-register, which is also
shifted one bit to the right at each iteration. The new bit introduced at the left is
the bit formerly at the right, and so the key is merely rotating.
The least significant bit of the key, the output of the non-linear function, and
two particular bits of the 32 bit shift-register are XORed together (added in GF(2)).
The 32-bit shift-register is shifted right and the sum is now the new bit to be inserted
into the leftmost spot in the 32-bit shift-register.
After 528 rounds, the contents of the 32 bit shift-register form the ciphertext.
2.3.4 Modeling the Non-linear Function
The non-linear function NLF(a,b,c,d,e) is denoted NLF3A5C742E. This means
that if (a,b,c,d,e) is viewed as an integer i between 0 and 31, i.e. as a 5-bit number,
then the value of NLF(a,b,c,d,e) is the ith bit of the 32-bit hexadecimal value
3A5C742E.
The following formula is a cubic polynomial and gives equivalent output to
the NLF for all input values, and was obtained by a Karnaugh map. In the case,
the Karnaugh map is a grid with (for five dimensions) two variables in rows (i.e.
4 rows), and three variables in columns (i.e. 8 columns). The rows and columns
are arranged via the Gray Code. This is a simple technique to rapidly arrive at the
algebraic normal form (i.e. polynomial), listed below, by first trying to draw boxes
10
around regions of ones of size 32, 16, 8, 4, 2, and finally 1. See a text such as [Bar85,
Ch. 3] for details.
NLF(a,b,c,d,e) = d?e?ac?ae?bc?be?cd?de?ade?ace?abd?abc
2.3.5 I/O Relations and the NLF
Also note that while the degree of this function is 3, there is an I/O relation of
degree 2, below. An I/O relation is a polynomial in the input variables and output
variables of a function, such that no matter what values are given for input to the
function, the I/O relation always evaluates to zero. Note y signifies the output of
the non-linear function.
(e?b?a?y)(c?d?y) = 0
This can be thought of as a constraint that the function must always satisfy.
If there are enough of these, then the function is uniquely defined. What makes
them cryptanalyticly interesting is that the degree of the I/O relations can be much
lower than the degree of the function itself. Since degree impacts the difficulty
of polynomial system solving dramatically, this is very useful. The I/O degree of
a function is the lowest degree of any of its I/O relations, other than the zero
polynomial.
Generally, low I/O-degree can be used for generating attacks but that is not
the case here, because we have only one relation, and this above relation is true
11
with probability 3/4 for a random function GF(2)5 ?GF(2), and a random input.
Heuristically, relations that are valid with low probability for a random function
and random input produce a more rapid narrowing of the keyspace in the sense of
a Constraint Satisfaction Problem or CSP. We are unaware of any attack on Keeloq
that uses this fact to its advantage.
An example of the possibility of using I/O degree to take cryptanalytic advan-
tage is the attack from the author?s joint paper on DES, with Nicolas T. Courtois
[CB06]. This attack is made possible because the S-Boxes have I/O degree 2, after
the introduction of several new variables, but the S-Box?s actual closed-form formu-
las are of higher degree. It is interesting to note that without these extra variables,
the I/O degree of the S-Boxes is 3 [CP02].
2.3.6 Disposing of the Secret Key Shift-Register
The 64-bit shift-register containing the secret key rotates by one bit per round.
Only one bit per round (the rightmost) is used during the encryption process. Fur-
thermore, the key is not modified as it rotates. Therefore the key bit being used is
the same in round t,t+ 64,t+ 128,t+ 192,...
Therefore we can dispose of the key shift-register entirely. Denote k63,...,k0
the original secret key. The key bit used during round t is merely kt?1 mod 64.
12
2.3.7 Describing the Plaintext Shift-Register
Denote as the initial condition of this shift-register as L31,...,L0. This corre-
sponds to the plaintext P31,...,P0. Then in round 1, the values will move one place
to the right, and a new value will enter in the first bit. Call this new bit L32. Thus
the bit generated in the 528th and thus last round will be L559. The ciphertext is
the final condition of this shift-register, which is L559,...,L528 = C31,...,C0.
A subtle change of index is useful here. The computation of Li, for 32?i?
559, occurs during the round numbered t = i?31. Thus the key bit used during
the computation of Li is ki?32 mod 64.
2.3.8 The Polynomial System of Equations
This now gives rise to the following system of equations.
Li = Pi ?i?[0,31]
Li = ki?32 mod 64?Li?32?Li?16?NLF(Li?1,Li?6,Li?12,Li?23,Li?30) ?i?[32,559]
Ci = Li?528 ?i?[528,559]
Note, some descriptions of the cipher omit the Li?16. This should have no
impact on the attack at all. The specification given by the company [Daw] includes
the Li?16.
Since the NLF is actually a cubic function this is a cubic system of equations.
13
Substituting, we obtain
Li = Pi ?i?[0,31]
Li = ki?32 mod 64?Li?32?Li?16?Li?23?Li?30?Li?1Li?12?Li?1Li?30
?Li?6Li?12?Li?6Li?30?Li?12Li?23?Li?23Li?30
?Li?1Li?23Li?30?Li?1Li?12Li?30?Li?1Li?6Li?23?Li?1Li?6Li?12 ?i?[32,559]
Ci = Li?528 ?i?[528,559]
In other words, the above equations were repeated for each i as applicable,
and for each of ? total plaintext-ciphertext message pairs.
2.3.9 Variable and Equation Count
Consider a plaintext-ciphertext pair vectorP,vectorC. There are 560 equations, one for
each Li, with i?[0,559], plus another 32 for the Ci, with i?[0,32]. However, the
first 32 of these are of the form Li = Pi for i?[0,32], and the last 32 of these are
of the form Li?528 = Ci for i?[528,559]. Thus we can substitute and drop down to
528 equations. This is precisely one equation for each round, which is the new bit
introduced into the shift register.
The 64 bits of the key are unknown. Also, of the 560 Li, the first and last 32
are known, but the inner 496 are not. This yields 560 variables.
If there are ? plaintext-ciphertext message pairs, then there are 528? equa-
tions. However, there are only 496?+64 variables, because the key does not change
from pair to pair.
14
2.3.10 Dropping the Degree to Quadratic
Instead of the previously derived
NLF(a,b,c,d,e) = d?e?ac?ae?bc?be?cd?de?ade?ace?abd?abc
one can do
NLF(a,b,c,d,e) = d?e?ac???bc?be?cd?de?d??c???d??c
? = ab
? = ae
Since the non-linear function was the sole source of non-linear terms, this gives
rise to a quadratic rather than cubic system of equations.
This introduces two new variables per original equation, and two new equations
as well. Thus m equations and n variables becomes 3m equations and n + 2m
variables. Thus with ? plaintext-ciphertext message pairs, we have 1584? equations
and 1552?+ 64 variables. Thus, it must be the case that ? > 1 for the system to
be expected to have at most one solution. As always with algebraic cryptanalysis,
unless we make an assumption that is false, we always know the system of equations
has at least one solution, because a message was sent. And thus we have a unique
solution when ?> 1.
15
Li = Pi ?i?[0,31]
Li = ki?32 mod 64?Li?32?Li?16?Li?23?Li?30?Li?1Li?12??i
?Li?6Li?12?Li?6Li?30?Li?12Li?23?Li?23Li?30
??iLi?23??iLi?12??iLi?23??iLi?12 ?i?[32,559]
?i = Li?1Li?6 ?i?[32,559]
?i = Li?1Li?30 ?i?[32,559]
Ci = Li?528 ?i?[528,559]
Even with ? = 2 this comes to 3168 equations and 3168 unknowns, well
beyond the threshold of size for feasible polynomial system solving at the time this
dissertation was written.
2.3.11 Fixing or Guessing Bits in Advance
Sometimes in Gr?obner basis algorithms or the XL algorithm, one fixes bits in
advance [Cou04b, et al]. For example, in GF(2), there are only two possible values.
Thus if one designates g particular variables, there are 2g possible settings for them,
but one needs to try 2g/2 on average if exactly one solution exists. For each guess,
one rewrites the system of equations either by substituting the guessed values, or
if not, then by adding additional equations of the form: k1 = 1,k2 = 0,.... If
the resulting Gr?obner or XL running time is more than 2g/2 times faster, this is a
profitable move.
In cryptanalysis however, one generates a key, encrypts?messages, and writes
16
equations based off of the plaintext-ciphertext pairs and various other constraints
and facts. Therefore one knows the key. Instead of guessing all 2g possible values,
we simply guess correctly. However, two additional steps must be required. First,
we must adjust the final running time by a factor of 2g. Second, we must ensure
that the system identifies a wrong guess as fast, or faster, than solving the system
in the event of a correct guess.
In the event that the second condition above is not present, an alarm clock
can be used. If, for example, the correctly guessed key bits cause a good solution
after t1 seconds, and after t2, the incorrectly guessed key bits cause the SAT-solver
to declare the program unsatisfiable, then t1lessmucht2 suggests the following technique:
Guess some key bits. After ct1 seconds, where 1 < c < t2/t1 is some constant,
declare the system unsatisfiable and try another guess. Repeat until a good guess is
found (i.e. the SAT-solver returns a valid key). We make no comment now on how
to find the optimal c, but six standard deviations of t1 is a good starting-point.
2.3.12 The Failure of a Frontal Assault
First we tried a simple CSP. With ? plaintext messages under one key, for
various values of ? we encrypted and obtained ciphertexts, and wrote equations as
described already, in Section 2.3.10 on page 15. We also used fewer rounds than
528, to see the impact of the number of rounds, as is standard. The experiments
were an obvious failure, and so we began to look for a more efficient attack.
a136 With 64 rounds, and ? = 4, and 10 key bits guessed, Singular required 70
17
seconds, and ElimLin in 10 seconds.
a136 With 64 rounds, and? = 2 but the two plaintexts differing only in one bit (the
least significant), Singular required 5 seconds, and ElimLin 20 seconds. Min-
iSat [ES05], using the techniques of Chapter 3, required 0.19 seconds. Note, it
is natural that these attacks are faster, because many internal variables during
the encryption will be identically-valued for the first and second message.
a136 With 96 rounds, ? = 4, and 20 key bits guessed, MiniSat and the techniques
of Chapter 3, required 0.3 seconds.
a136 With 128 rounds, and ? = 128, with a random initial plaintext and each other
plaintext being an increment of the previous, and 30 key bits guessed, ElimLin
required 3 hours.
a136 With 128 rounds, and ? = 2, with the plaintexts differing only in the least
significant bit, and 30 key bits guessed, MiniSat requires 2 hours.
These results on 128 rounds are slower than brute-force. Therefore we did
not try any larger number of rounds or finish trying each possible combination
of software and trial parameters. Needless to say the 528 round versions did not
terminate. Therefore, we need a new attack.
18
2.4 Our Attack
2.4.1 A Particular Two Function Representation
Recall that each 64th round uses the same key bit. In other words, the same
bit is used in rounds t,t+64,t+128,.... Note further, 528 = 8?64+16. Thus the
key bits k15,...,k0 are used nine times, and the key bits k63,...,k16 are used eight
times.
With this in mind, it is clear that the operation of the cipher can be represented
as
Ek(vectorP) = gk(fk(fk(???fkbracehtipupleft bracehtipdownrightbracehtipdownleft bracehtipupright
8 times
(vectorP) = gk(f(8)k (vectorP)) = vectorC
where the fk represents 64 rounds, and the gk the final 16 ?extra? rounds.
2.4.2 Acquiring an f(8)k -oracle
Suppose we simply guess the 16 bits of the key denoted k15,...,k0. Of course,
we will succeed with probability 2?16. But at that point, we can evaluate gk or its
inverse g?1k . Then,
g?1k (Ek(vectorP)) = g?1k (gk(f(8)k (vectorP))) = f(8)k (vectorP)
and our oracle for Ek now gives rise to an oracle for f(8)k .
19
2.4.3 The Consequences of Fixed Points
For the moment, assume we find x and y such that fk(x) = x and fk(y) = y.
At first, this seems strange to discuss at all. Because fk(x) = x and therefore
f(8)k (x) = x, we know Ek(x) = gk(f(8)k (x)) = gk(x). But, gk(x) is part of the cipher
that we can remove by guessing a quarter (16 bits) of the key. Therefore, if we
?know something? about x we know something about multiple internal points, the
input, and output of Ek(x). Now we will make this idea more precise.
Intuitively, we now know 64 bits of input and 64 bits of output of the function
f (32 bits each from each message). This forms a very rigid constraint, and it is
highly likely that only one key could produce these outputs. This means that if we
solve the system of equations for that key, we will get exactly one answer, which is
the secret key. The only question is if the system of equations is rapidly solvable or
not.
The resulting system has equations for the 64 rounds of f. For both of x
and y, there are equations for L0,...,L95 and 32 additional output equations, but
the first 32 of these and last 32 of these (in both cases) are of the forms Li = xi
and Li?64 = xi, and can be eliminated by substituting. Thus there are actually
96+32?32?32 = 64 equations (again one per round) for both x and y, and thus
128 total equations. We emphasize that this is the same system of equations as
Section 2.3.8 on page 13 but with only 64 rounds for each message.
The xi?s and yi?s are known. Thus the unknowns are the 64 bits of the key,
and the 32 ?intermediate? values ofLi for bothxandy. This is 128 total unknowns.
20
After translating from cubic into quadratic format, it becomes 384 equations
and 384 unknowns. This is much smaller than the 3168 equations and 3168 un-
knowns we had before. In each case, ElimLin, Magma, Singular, and the methods
of Chapter 3 solved the system fork0,...,k63 in time too short to measure accurately
(i.e. less than 1 minute).
It should be noted that we require two fixed points, not merely one, to make
the attack work. One fixed point alone is not enough of a constraint to narrow
the keyspace sufficiently. However, two fixed points was sufficient each time it was
tried. Therefore, we will assume f has two or more fixed points, and adjust our
probabilities of success accordingly. One way to look at this is to say that only
those keys which result in two or more fixed points are vulnerable to our attack.
However, since the key changes rapidly in most applications (See Section 2.6 on
page 35), and since approximately 26.42% of random functions GF(2)32?GF(2)32
have this property (See Section 2.4.8 on page 30), we do not believe this to be a
major drawback.
2.4.4 How to Find Fixed Points
Obviously a fixed point of fk is a fixed point of f(8)k as well, but the reverse
is not necessarily true. Stated differently, the set of fixed points of f(8)k will contain
the set of all fixed points of fk.
We will first calculate the set of fixed points of f(8)k , which will be very small.
We will try the attack given in the previous subsection, using every pair of fixed
21
points. If it is the case that fk has two or more such fixed points, then one such pair
which we try will indeed be a pair of fixed points offk. This will produce the correct
secret key. The other pairs will produce spurious keys or inconsistent systems of
equations. But this is not a problem because spurious keys can be easily detected
and discarded.
The running time required to solve the system of equations is too short to
accurately measure, with a valid or invalid pair. Recall, that this is 384 equations
and 384 unknowns as compared to 3168, as explained in Section 2.4.3 on page 21.
There are probably very few fixed points of f(8)k , which we will prove below.
And thus the running time of the entire attack depends only upon finding the set
of fixed points of f(8)k . One approach would be to iterate through all 232 possible
plaintexts, using the f(8)k oracle. This would clearly uncover all possible fixed points
of f(8)k and if fk has any fixed points, they would be included. However, this is not
efficient.
Instead, one can simply try plaintexts in sequence using the f(8)k oracle. When
theithfixedpointxi isfound, onetriestheattackwiththei?1pairs(x1,xi),...,(xi?1,xi).
If two fixed points of fk are to be found in x1,...,xi, the attack will succeed at this
point, and we are done. Otherwise, continue until xi+1 is found and try the pairs
(x1,xi+1,..., xi,xi+1), and so forth.
22
Table 2.1 Fixed points of random permutations and their 8th powers
Size 212 212 213 214 215 216
Experiments 1000 10,000 10,000 10,000 10,000 100,000
Abortions (n1 < 2) 780 7781 7628 7731 7727 76,824
Good Examples (n1?2) 220 2219 2372 2269 2273 23,176
Average n1 2.445 2.447 2.436 2.422 2.425 2.440
Average n8 4.964 5.684 5.739 5.612 5.695 5.746
Average Location 2482 2483 4918 9752 19,829 39,707
Percentage (?) 60.60% 60.62% 60.11% 59.59% 60.51% 60.59%
2.4.5 How far must we search?
One could generate a probability distribution on the possible values of n1 and
n8, the number of fixed points of fk and f(8)k . However, if all we need to know is
how many plaintexts must be tried until two fixed points of f are discovered, then
this can be computed by an experiment.
We generated 10,000 random permutations of size 212,213,214,215 and 100,000
of 216. Then we checked to see if they had two or more fixed points, and aborted
if this were not the case. If two or more fixed points were indeed present, we
tabulated the number of fixed points of the eigth power of that permutation on
composition. Finally, we examined at which value the second fixed point of f was
found, when iterating through the values of f(8) and searching for its fixed points.
The data is given in Table 2.1 on page 23. It shows that we must check around 60%
of the possible plaintexts. It also confirms the values of n1 = 2.39 (calculated in
Section 2.4.8 on page 30) and n8 = 5.39 (calculated in Section 2.4.9 on page 31).
23
2.4.6 Fraction of Plainspace Required
As calculated in Section 2.4.9 on page 31, we expect f to have an expected
value of 3 total points of orders 2, 4, or 8. This is independent of the number of fixed
points, so long as the number of fixed points is small. The probability of p fixed
points of f being present is 1/(p!e) as calculated in Lemma 3 on page 29. Upon
conditioning that f have at least two fixed points, this becomes
1
(p!e)(1?2/e)
Our plan is to check each fixed point of f(8), and see if it is a fixed point of f
(denoted ?yes?), or if not, which would mean it is a point of order 2, 4, or 8 (denoted
?no?). Upon the second ?yes? result, we stop. How many checks must we perform,
until we stop? Denote the number of checks required as k.
Now recall that we expect 3 points of order 2, 4, and 8, independent of the
number of fixed points of f. This is shown in Section 2.4.9 on page 31. If we have
p fixed points of f, we will expect to have 3 +p fixed points of f(8).
For example, suppose there are p = 2 fixed points of f. In expectation, we
will find 5 fixed points of f(8) The possible arrangements are
k pattern
2 YYNNN
3 YNYNN, NYYNN
4 YNNYN, NYNYN, NNYYN
5 YNNNY, NYNNY, NNYNY, NNNYY
24
The expected value of k in the above example is 4. Since the fixed points
of f(8) are expected to be uniformly distributed in the plainspace, we can model
them as placed uniformly in the intervals{(0,0.2),(0.2,0.4),...,(0.8,1.0)}(viewed
as fractions of the plainspace). Alternatively, this is expecting the fixed points to
be located at approximately{0.1,0.3,0.5,0.7,0.9}. This is obviously a crude model
but the afore mentioned experiments confirm it.
In our case, sincekis expected to be 4, then 0.7 or 70% of the plaintext must be
checked, in expectation, for p = 2. Of course if the fixed points were not uniformed
distributed, and we knew the distribution, we would find them faster than this.
In general, 2?k?5, and (2k?1)/(2(p+3)) is the fraction of plaintext that
needs to be computed. Call this value ?. To find the probability distribution of ?,
we need only find a probability distribution for k, since we already have one for p.
This tabulation will be simplified by observing that to the left of k, there is
one yes, and all remaining are no?s. To the right of k, one finds p?2 yes?s and all
remaining are no?s. The left of k has k?1 slots, and the right of k has p+ 3?k
slots.
This gives us a number of arrangements:
parenleftbiggk?1
1
parenrightbiggparenleftbiggp+ 3?k
p?2
parenrightbigg
= (k?1)
parenleftbiggp+ 3?k
5?k
parenrightbigg
Since k?{2,3,4,5}the total number of arrangements is
(1)
parenleftbiggp+ 1
3
parenrightbigg
+ (2)
parenleftbiggp
2
parenrightbigg
+ (3)
parenleftbiggp?1
1
parenrightbigg
+ (4)
parenleftbiggp?2
0
parenrightbigg
Thankfully, that totals to parenleftbigp+33 parenrightbig, the number of ways of putting three no?s
into p+ 3 slots. This fact can be seen algebraically by expanding out the ?choice?
25
function above in each case.
The probability of a k = K is thus given by
Pr{k = K|p = P}= (K?1)
parenleftbigP+3?K
5?K
parenrightbig
parenleftbigP+3
3
parenrightbig
Then we can apply the probability distribution of p
Pr{k = K|p?2}=
?summationdisplay
p=2
(K?1)parenleftbigp+3?k5?K parenrightbigparenleftbig
p+3
3
parenrightbig 1p!e(1?2/e)
From this, the expected value of k can be found Then we can apply the prob-
ability distribution of p
E[k|p?2] =
k=5summationdisplay
k=2
?summationdisplay
p=2
k(k?1)
parenleftbigp+3?k
5?k
parenrightbig
parenleftbigp+3
3
parenrightbig 1p!e(1?2/e)
Knowing that ?, the fraction of the plainspace that we must search is given by
? = 2k?12(p+ 3) = k?1/2p+ 3
as shown above, we can substitute to obtain:
E[?|p?2] =
k=5summationdisplay
k=2
?summationdisplay
p=2
parenleftbiggk?1/2
p+ 3
parenrightbigg (k?1)parenleftbigp+3?k
5?k
parenrightbig
parenleftbigp+3
3
parenrightbig 1p!e(1?2/e)
This evaluates to 0.6297, which is close the value given in Table 2.1 on page 23,
considering the crudeness of the model.
2.4.7 Comparison to Brute Force
Recall, thatf has two or more fixed points with probability 1?2/e, and that we
require f to have two or more. Our success probability is 2?16(1?2/e)?2?17.92. A
brute force attack which would itself have probability 2?17.92 of success would consist
26
of guessing 246.08 possible keys and then aborting, because 46.08 + 17.92 = 64, the
length of the key. Therefore, our attack must be faster than 246.08 encryptions of
guesses, or 528?246.08?255.124 rounds.
We require, for our attack, g?1k (Ek(vectorP)), which will need an additional 16
rounds. Even if we use the whole dictionary of 232 possible plaintexts, this comes to
(528 + 16)232 ?241.087 rounds, which is about 214.04 times faster than brute force.
If instead we use (528 + 16)(3/5)232 (which is now an expected value based on the
last paragraph of the previous section), we require 240.77 rounds.
2.4.8 Some Lemmas
This section provides some of the probability calculations needed in the pre-
vious sections. The argument in this section is that if (for random k) the function
fk : GF(2)n ?GF(2)n is computationally indistinguishable from a random permu-
tation from S2n, then fk and f(8)k have various properties. Our fk and f(8)k are not
random permutations, but are based off of the Keeloq specification. Since we are
discussing the cryptanalysis of a block cipher, we conjecture that modeling fk as a
random permutation is a good model (as is common). If not, much easier attacks
might exist. This is a standard assumption.
However, we only need 3 facts from this analysis. First, the expected number
of fixed points, if there are two, is about 2.3922. Second, the probability of having
two or more fixed points is about 26.424%. Third, the number of fixed points of
f(8) is around 5.39. These particular facts were verified by simulations, given in
27
Table 2.1 on page 23, and found to be reasonable.
Lemma 1 Both f and g are bijections.
Proof: Note Ek is a permutation (bijection) for any specific fixed key, as must
be the case for all block ciphers. Then further note
Ek(vectorP) = gk(f(8)k (vectorP))
implies that f and g are bijections. Of course, if the domain or range of these
functions were not finite, this argument would be incorrect. The only conclusion
wouldbethattheoutermostfunctionissurjectiveandthattheinnermostisinjective.
[]
Lemma 2 If h : GF(2)n ? GF(2)n is computationally indistinguishable from a
random permutation, then hprime(x) = h(x)?x is computationally indistinguishable
from a random function.
Proof: If hprime is computationally distinguishable from a random function that
means that there exists an Algorithm A, which in polynomial time compared to n,
and probability?, can distinguish between?(some oracle) being eitherhprime or being a
random function of appropriate domain and range. Presumably this requires queries
to ?, and only polynomially many queries compared to n since Algorithm A runs in
polynomial time compared to n. Finally, ? is non-negligible compared to n, or more
simply, 1/? is lower-bounded by a polynomial in n.
We create Algorithm B, which will distinguish between ? being h or being
a random permutation with appropriate domain and range. First, run Algorithm
28
A. Whenever Algorithm A asks for a query ?(x), return ?(x)?x. If ? is h then
?(x)?x = h(x)?x = hprime(x). Likewise, if ? is a random permutation, then ??x
acts as computationally indistinguishable from a random function, since it is a well-
known theorem that random functions and random permutations cannot be distin-
guished in polynomial time [GGM86]. Algorithm B should output ?pseudorandom?
or ?random? if and only if Algorithm A does.
This is a perfect simulation in either case, and therefore Algorithm A will be
correct with probability? and therefore Algorithm B will be correct with probability
?. Thus h and a random permutation are computationaly distinguishable.
We have now proven that hprime being computationaly distinguishable from a
random function implies that h is computationaly distinguishable from a random
permutation. The contrapositive is our lemma. []
Lemma 3 If h : GF(2)n ? GF(2)n is a random permutation, then the limit as
n??of the probability that h has p fixed points is 1/(p!e)
Proof: If hprime(x) = h(x)?x, and if h(y) = y then hprime(y) = 0. Thus the set of
fixed points of h is merely the preimage of 0 under hprime. By Lemma 2, hprime behaves as
a random function. Thus the value of hprime(y) for any particular y is an independently
and identically distributed uniform random variable. The ?Bernoulli trials? model
therefore applies. If|hprime?1(0)|is the size of the preimage of 0 under hprime then
limn??Prbraceleftbig|hprime?1(0)|= pbracerightbig
=
parenleftbigg2n
p
parenrightbiggparenleftbig
2?nparenrightbigpparenleftbig1?2?nparenrightbig2n?p
29
=
parenleftbigg2n
p
parenrightbiggparenleftbig
2?nparenrightbigpparenleftbig1?2?nparenrightbig2n parenleftbig1?2?nparenrightbig?p
? (2
n)(2n?1)(2n?2)(2n?3)???(2n?p+ 1)
p! (2
?n)p(e?1)(1)
? (1)(1?1?2
?n)(1?2?2?n)(1?3?2?n)???(1?(p?1)?2?n)e?1
p!
? 1/p!e
Thus h has p fixed points with probability 1/(p!e). []
Corollary 1 If h : GF(2)n ?GF(2)n is a random permutation, then h has two or
more fixed points with probability 1?2/e?0.26424.
Assuming h has 2 or more fixed points, it has exactly 2 with probability
1/2e
1?2/e ? 0.69611 and 3 or more 1?
1/2e
1?2/e ? 0.30389. Therefore it is useful to
calculate the expected number of fixed points given that we assume there are 2 or
more. This calculation is given by
parenleftbigg 1
1?2/e
parenrightbiggi=?summationdisplay
i=2
i
e(i!) ?2.3922
2.4.9 Cycle Lengths in a Random Permutation
It is well-known that in a random permutation, the expected number of cycles
of length m is 1/m, which is proven in the lemma at the end of this subsection.
Thus the expected numbers of cycles of length 1, 2, 4, and 8 in fk are 1, 1/2, 1/4,
1/8. All of these are fixed points of f(8)k , providing 1, 2, 4, and 8 fixed points each,
30
or a total of 1, 1, 1, and 1 expected fixed points, or 4 expected fixed points total.
Thus n8 = 4, in the general case.
In the special case of fk having at least 2 fixed points, we claim the number
should be largely unchanged. Remove the 2.39 expected fixed points from the do-
main of fk. This new function has a domain and range of size 2n?2 and is still a
permutation. Since we are assuming n is large, 2n?2?2n. Indeed, if one makes
a directed graph with one vertex for each x in the domain, with x having only one
exit edge, which points to f(x), then those two fixed points are both individual
islands disconnected from the main graph. Clearly, the remainder of the graph of
the permutation fk is unchanged by this removal. But, that further implies that fk
restricted to the original domain but with the two fixed points removed is unchanged
on its other values. Thus, f(8)k after the removal would have 4 fixed points.
We can estimate then 4?1 + 2.39 = 5.39 fixed points for f(8)k , because 1.0
fixed points are expected from fk in general, and 2.39 when fk has at least two fixed
points.
An alternative way to look at this is that the fixed points of f(8)k are precisely
the points of orders 1, 2, 4, and 8 of fk. These four sets are mutually exclusive, and
their cardinalities (as random variables for a random permutation), are asymptoti-
cally independent as the size of the domain goes to infinity. To see why this is true,
imagine a directed graph G = (V,E) with one vertex for each domain point and
one edge pointing from x to fk(x) for all x in the domain. The fixed points of f are
verteces that have only a self-loop as their edges. Therefore, a few can be deleted
from the graph without changing its characteristics.
31
Thus the expected number of points of order 2, 4, and 8, of a random permu-
tation, should remain unchanged upon requiring the permutation to have 2 or more
fixed points. This comes to
212 + 414 + 818 = 3
expected points. Since we require fk to have two or more fixed points, it will have
2.39 in expectation, as calculated in Section 2.4.8 on page 30. Thus the expected
number of fixed points of f(8)k is 5.39, when f is restricted to having two or more
fixed points. See also the note at the end of this subsection.
Lemma 4 The expected number of cycles of length m in a random permutation
h : D?D, in the limit as|D|??is 1/m.
Proof: Consider x,f(x),f(f(x)),... = y1,y2,y3,... Call the first repeated
value among the yi to be yr. More precisely, yr = y1, and yr negationslash= yi for 1 <i<r.
To see why yr must repeat y1 and not some other yi, suppose that yr = yi.
And since f is injective, yr?1 = yi?1, contradiction. Thus y1 is the first value to be
repeated.
The value of y2 = f(y1) is unconstrained and can be anything. If y2 = y1 then
y1 is a fixed point and is a member of an orbit of size 1. If y2 negationslash= y1 then we must
consider y3. We know, from the previous paragraph, that y2 negationslash= y3. If y3 = y1 then
y1 is a point of order 3, and if not, then we consider y4. Thus the probability that
the orbit size is s is given by
Pr{s} =
parenleftbigg|D|?1
|D|
parenrightbiggparenleftbigg|D|?2
|D|?1
parenrightbiggparenleftbigg|D|?3
|D|?2
parenrightbigg
???
parenleftbigg |D|?s
|D|?s+ 1
parenrightbiggparenleftbigg 1
|D|?s
parenrightbigg
= 1|D|
32
Since there are |D| initial points, the expected number of points that are
members of orbits of length s would be
Pr{s}|D|=|D| 1|D| = 1
Each orbit of length s has precisely s elements in it, and thus the expected
number of orbits of length s is 1/s. []
Note: The reader can easily verify that requiring f to have two fixed points
changes|D|to|D?2|and thus is invisible in the limit as|D|??.
2.5 Summary
The attack in this chapter is a constraint satisfaction problem (CSP), like all
algebraic attacks. Normally a CSP has zero, one, or more than one solution. In the
case of algebraic cryptanalysis, unless a false assumption is made, there is always
a solution because a message was sent. Therefore, we have only to ensure that the
constraints are sufficient to narrow down the keyspace to a single key, which is our
objective. A secondary, but crucial, objective is that the attack must finish within a
reasonable amount of time, namely faster than brute force by a significant margin.
If one has ? plaintext-ciphertext pairs encrypted with the same key, then
one has a set of constraints. Here, with Keeloq, we have one (cubic) equation
for each round that we take under consideration (See the equations at the start of
Section 2.3.8 on page 13, where NLF is defined as a cubic polynomial in Section 2.3.4
on page 10). Thus there are 528? constraints. This becomes 3 equations for each
33
round, when we convert into quadratic degree (See Section 2.3.10 on page 15.)
One approach is to therefore generate a key, generate ? plaintexts, encrypt
them all, write down the system of equations, and solve it. Because this might take
too long, we may elect to ?guess? g bits of the key to the system of equations and
adjust the final running time by 2g, or equivalently the final probability of success
by 2?g, as described in Section 2.3.12 on page 17. Upon doing that, we in fact do
solve the systems (See bulleted list in Section 2.3.12 on page 17), but discover that
the attack is far worse than brute force.
Instead, a fixed point is very attractive, in place of a plaintext-ciphertext
pair. The entire description of a fixed point of f is concerned only with the first
64 rounds. Therefore, only 64 equations are needed. However, the first objective,
namely narrowing the key down to one possibility, is not accomplished here. Instead,
two fixed points are needed. This is still a very limited number of equations, roughly
a factor of 3168/384 = 8.25 times smaller than the attack in Section 2.3.12 on
page 17, both in terms of number of equations and in terms of number of variables.
If the degree were a linear system, this would be faster by a factor of 8.253?
561.5 or 8.252.807 ? 373.7 depending on the algorithm used. Of course, solving a
polynomial system of equations is much harder than solving a linear one, so the
speed-up is expected to be much larger than that. And so, our second objective,
which is speed, is accomplished. This leaves us with the following attack:
a136 Find two fixed points of f by trying pairs of fixed points of f(8).
a136 Write down the equations that describe the Constraint Satisfaction Problem
34
(CSP) of f having those two fixed points.
a136 Solve the equations.
Now it remains to calculate the success probability, the work performed, and
thus the net benefit over brute force. The success probability is given as 2?16(1?
2/e) ? 2?17.92 in Section 2.4.7 on page 26. The work performed is entirely in the
discovery of the fixed-points, and is calculated as (3/5)?232 plaintexts or 240.35 in
Section 2.4.7 on page 26, and the paragraph immediately before it. A brute-force
attack with success probability 2?17.92 would require 264?17.92 = 246.08 plaintexts or
255.12 rounds. Therefore, we are 214.77 times faster than brute force.
2.6 A Note about Keeloq?s Utilization
An interesting note is Keeloq?s utilization in at least some automobiles. Specif-
ically, it encrypts the plaintext 0 and then increments the key by arithmetically
adding one to the integer represented by the binary string k63,k62,...,k1,k0. This
way the same key is never used twice. This is rather odd, of course, but if one
defines the dual of a cipher as interchanging the plainspace with the keyspace, then
the dual of Keeloq has a 64-bit plaintext, and a 32-bit key. The cipher is operating
in precisely counter-mode in that case, with a random initial counter, and fixed key
of all zeroes. However, not all automobiles use this method.
35
Chapter 3
Converting MQ to CNF-SAT
3.1 Summary
The computational hardness of solving large systems of sparse and low-degree
multivariate equations is a necessary condition for the security of most modern sym-
metric cryptographic schemes. Notably, most cryptosystems can be implemented
with inexpensive hardware, and have low gate counts, resulting in a sparse system
of equations, which in turn renders such attacks feasible. On one hand, numerous
recent papers on the XL algorithm and more sophisticated Gr?obner-bases techniques
[CSPK00, CP02, Fau99, Fau02] demonstrate that systems of equations are efficiently
solvable when they are sufficiently overdetermined or have a hidden internal alge-
braic structure that implies the existence of some useful algebraic relations.
On the other hand, most of this work, as well as most successful algebraic
attacks, involves dense, not sparse algorithms, at least until linearization by XL
or a similar algorithm. The natural sparsity, arising from the low gate-count, is
thus wasted during the polynomial stage, even if it is taken advantage of in the
linear algebra stage by the Block Wiedemann Algorithm or Lanczos?s Algorithm.
No polynomial-system-solving algorithm we are aware of except the very recently
published methods of Samayev and Raddum [RS06], demonstrates that a significant
benefit is obtained from the extreme sparsity of some systems of equations.
36
In this chapter, we study methods for efficiently converting systems of low-
degree sparse multivariate equations into a conjunctive normal form satisfiability
(CNF-SAT) problem, for which excellent heuristic algorithms have been developed
in recent years. The sparsity of a system of equations, denoted ?, is the ratio
of coefficients that are non-zero to the total number of possible coefficients. For
example, in a quadratic system of m equations in n unknowns over GF(2), this
would be
? = ?mparenleftbigparenleftbign
2
parenrightbig+parenleftbign
1
parenrightbig+parenleftbign
0
parenrightbigparenrightbig
where ? is the number of non-zero coefficients in the system, sometimes called the
?content? of the system.
A direct application of this method gives very efficient results: we show that
sparse multivariate quadratic systems (especially if over-defined) can be solved much
faster than by exhaustive search if ??1/100. In particular, our method requires
no additional memory beyond that required to store the problem, and so often
terminates with an answer for problems that cause Magma [mag] and Singular [sin]
to crash. On the other hand, if Magma or Singular does not crash (due to a lack of
memory), then they tend to be faster than our method, but this case includes only
the smallest sample problems.
37
3.2 Introduction
It is well known that the problem of solving a multivariate simultaneous system
of quadratic equations over GF(2) (the MQ problem) is NP-hard (See Section 3.9).
Another NP-hard problem is finding a satisfying assignment for a logical expression
in several variables (the SAT problem) [Kar72]. Inspired by the possibility that
either could be an efficient tool for the solution of the other, since all NP-Complete
problems are polynomially equivalent, we began this investigation.
There exist several off-the-shelf SAT-solvers, such as MiniSAT [ES05], which
can solve even relatively large SAT problems on an ordinary PC. We investigate
the use of SAT-solvers as a tool for solving a random MQ problem. In particular,
we show that if the system of equations is sparse or over-defined, then the SAT-
solver technique works faster than brute-force exhaustive search. If the system is
both sparse and over-defined, then the system can be solved quite effectively (see
Section 3.5 on page 50).
In Section 3.2.1 we describe how this work applies to algebraic cryptanalysis.
We define some notation and terms in Section 3.3, and describe the method of
conversion of MQ problems into CNF-SAT problems in Section 3.4. Our results are
in Section 3.5. We review previous work in Section 3.6. Finally, we note possible
applications to cubic systems in Appendix 3.8. We discuss the NP-Completeness
of these two problems in Section 3.9. A brief overview of SAT solvers is given in
Appendix 4.
38
3.2.1 Application to Cryptanalysis
Algebraic Cryptanalysis can be summarized as a two-step process. First, given
a cipher system, one converts it into a system of equations. Second, the system of
equations is solved to retrieve either a key or a plaintext. Furthermore, note that
all systems of equations over finite fields can be written as polynomial systems.
As pointed out by Courtois and Pieprzyk [CP02], this system of equations will
be sparse, since efficient implementations of real-world systems require low gate-
counts. In practice, the systems are very sparse?the system used to break six
rounds of DES in [CB06] has 2900 variables, 3056 equations and 4331 monomials
appear somewhere in the system. There would be parenleftbig29002 parenrightbig + parenleftbig29001 parenrightbig = 4,206,450
possible monomials, and those authors report less than 15 monomials per equation,
or ? = 3.57?10?6.
It is also known that any system of any degree can be written as a degree 2
system. This is done by using the following step, repeatedly:
{m = wxyz}?{a = wx;b = yz;m = ab}
Finally, it is usually the case that one can write additional equations by assum-
ing that many plaintext-ciphertext pairs are available. While the number of pairs
is not literally unbounded, as many stream ciphers have a limit of 240 bits before a
new key is required, generally one has an over-abundance of equations. Therefore,
we include in this study only systems where the number of equations is greater than
or equal to the number of unknowns.
39
3.3 Notation and Definitions
An instance of the MQ problem is a set of equations
f1(x1,...,xn) = y1,f2(x1,...,xn) = y2,...,fm(x1,...,xn) = ym
where each fi is a second degree polynomial. By adjusting the constant term of
each polynomial, it becomes sufficient to consider only those problems with yj = 0
for all j. Note that n is the number of variables and m is the number equations.
If we define ? = m/n or ?n = m, then ? = 1 will imply an exactly defined
system, ? > 1 an over-defined system and ? < 1 an under-defined system. We will
not consider under-defined systems here. The value of ? will be called ?the over-
definition? of a system. Let M denote the number of possible monomials, including
the constant monomial. Since we consider only quadratic polynomials (except for
Section 3.8 on page 58 on cubics),
M =
parenleftbiggn
2
parenrightbigg
+
parenleftbiggn
1
parenrightbigg
+ 1
The system will be generated by flipping a weighted coin for each of the M
coefficients for each equation. The value ??(0,1] will be called the sparsity, and
is the probability that a randomly selected coefficient is non-zero (equal to one). If
?lessmuch1/2 the system is considered sparse.
An instance of the Conjunctive Normal Form SAT or CNF-SAT problem is
a set of clauses. Each clause is a large disjunction (OR-gate) of several variables,
40
which can appear negated or not negated. If a set of values for all n variables makes
every clause evaluate to true, then it is said to be a satisfying assignment. In this
way, the set of clauses can be thought of as one long logical expression, namely a
conjunction (AND-gate) of all the clauses.
3.4 Converting MQ to SAT
3.4.1 The Conversion
The conversion proceeds by three major steps. First, some preprocessing might
be performed to make the system more amenable to this conversion (more detail
will follow). Next, the system of polynomials will be converted to a (larger) linear
system and a set of CNF clauses that render each monomial equivalent to a variable
in that linear system. Lastly, the linear system will be converted to an equivalent
set of clauses.
3.4.1.1 Minor Technicality
The CNF form does not have any constants. Adding the clause consisting of
(T), or equivalently (T?T?????T), would require the variable T to be true in any
satisfying solution, since all clauses must be true in any satisfying solution. Once
this is done, the variable T will serve the place of the constant 1, and if needed, the
variable ?T will serve the place of the constant 0. Otherwise constants are unavailable
in CNF.
41
Step One: From a Polynomial System to a Linear System
Based on the above technicality, we can consider the constant term 1 to be
a variable. After that, every polynomial is now a sum of linear and higher degree
terms. Those terms of quadratic and higher degree will be handled as follows.
The logical expression
(w??a)(x??a)(y??a)(z??a)(a? ?w??x??y??z)
is tautologically equivalent to a ?? (w?x?y?z), or the GF(2) equation
a = wxyz. Similar expressions exist for equations of the form a = w1w2???wr.
Therefore, for each monomial of degree d > 1 that appears in the system of
equations, we shall introduce one dummy variable. One can see that d+ 1 clauses
are required, and the total length of those clauses is 3d+ 1.
Obviously, if a monomial appears more than once, there is no need to encode it
twice, but instead, it should be replaced by its previously defined dummy variable.
On the other hand, in a large system, particularly an over-defined one, it is likely
that every possible monomial appears at least once in some equation in the system.
Therefore we will assume this is the case, but in extremely sparse systems that are
not very over-defined, this is pessimistic, particularly for high degree systems.
At the risk of laboring over a minor point, note that in the production code
we have a check-list, and never encode the same monomial twice, and only encode a
monomial once it has appeared in the system. But, this algorithm can be encoded
into LogSpace by simply enumerating all the possible monomials at the start,
42
exactly once, and then continuing with the next step. For a fixed degree, there
are polynomially many monomials. If the degree is allowed to change, there are
exponentially many.
Step Two: From a Linear System to a
Conjunctive Normal Form Expression
Each polynomial is now a sum of variables, or equivalently a logical XOR.
Unfortunately, long XORs are known to be hard problems for SAT solvers [CD99].
In particular, the sum (a+b+c+d) = 0 is equivalent to
(a?b?c?d)(a?b??c? ?d)(a??b?c? ?d)(a??b??c?d) (3.1)
(?a?b?c? ?d)(?a?b??c?d)(?a??b?c?d)(?a??b??c? ?d)
which is to say, all arrangements of the four variables, with 0, 2, or 4 negations,
or all even numbers less than four. For a sum of length lscript, where 2floorleftlscript/2floorright= j, this
requires
parenleftbigglscript
0
parenrightbigg
+
parenleftbigglscript
2
parenrightbigg
+
parenleftbigglscript
4
parenrightbigg
+???+
parenleftbigglscript
j
parenrightbigg
= 2lscript?1
clauses, which is exponential.
To remedy this, cut each sum into subsums of length c. (We will later call c
the cutting number). For example, the equation x1 +x2 +???+xlscript = 0 is clearly
equivalent to
43
x1 +x2 +x3 +y1 = 0
y1 +x6 +x7 +y2 = 0
... ... ...
yi +x4i+2 +x4i+3 +yi+1 = 0
... ... ...
yh +xlscript?2 +xlscript?1 +xlscript = 0
if lscript?2(mod c). (If lscript is not, the final sum is shorter, this is more efficient because
a sum or XOR of shorter length requires fewer clauses). Therefore it is safe to be
pessimistic and assume all equations are of length lscript?2(mod c). In either case, one
can calculate h =ceilingleftlscript/cceilingright?2. Thus there will be h+1 subsums, and each will require
2c?1 clauses of length c each, via Equation 3.2 on page 43.
3.4.2 Measures of Efficiency
Three common measures of the size of a CNF-SAT problem are the number
of clauses, the total length of all the clauses, and the number of variables. It is
not known which of these is a better model of the difficulty of a CNF expression.
Initially we have n variables, and 0 clauses of total length 0.
For a quadratic system of polynomials, the cost for each monomial in Step
One of the conversion is 1 dummy variable, 3 clauses, of total length 7. This needs
to be done for all possible M?n?1 quadratic monomials. The constant monomial
44
requires 1 dummy variable, and 1 clause of length 1.
The cost in Step Two requires an estimate of the expected value of the length of
each equation. Since there areM possible coefficients, then this is equal toM?. For
the moment, assume the cutting number is c = 4. There will be (in expected value)
M?/2?1 subsums per equation, requiring M?/2?2 dummy variables, 4M??8
clauses and total length 16M??32.
This is a total of
a136 Variables: n+ 1 +(M?n?1)(1) +m(M?/2?1).
a136 Clauses: 0+ 1 + (M?n?1)(3) +m(4M??8).
a136 Length: 0+ 1 + (M?n?1)(7) +m(16M??32).
Substituting m = ?n and M = n2/2 +n/2 + 1, one obtains
a136 Variables: ?n2/2 +?n3?/4.
a136 Clauses: ?(3/2)n2 + 2?n3?.
a136 Length: ?(7/2)n2 + 8?n3?.
Furthermore, so long as ? > 1/m then the first term of each of those expressions
can be discarded. If that were not the case, at least a few equations would have
to be all zeroes. These expressions are summarized, for several values of cutting
number, in Table 3.1 on page 46
45
Table 3.1 CNF Expression Difficulty Measures for Quadratic Systems, by Cutting
Number
Cutting Number Variables Clauses Length
Cut by 3 ??n3?/2 ?2?n3? ?6?n3?
Cut by 4 ??n3?/4 ?2?n3? ?8?n3?
Cut by 5 ??n3?/6 ?(8/3)?n3? ?(40/3)?n3?
Cut by 6 ??n3?/8 ?4?n3? ?24?n3?
Cut by 7 ??n3?/10 ?(6.4)?n3? ?44.8?n3?
Cut by 8 ??n3?/12 ?(32/3)?n3? ?(128/3)?n3?
3.4.3 Preprocessing
It is clear from the above expressions that n is the crucial variable in deter-
mining the number of dummy variables, clauses, and total lengths of clauses. With
this in mind, we devised the following preprocessing scheme, based on the idea of
Gaussian Elimination. It is executed before the conversion begins. For any specific
polynomial one can reorder the terms as follows
xa0 = xa1 +xa2 +???+xan + (quadratic terms) +(+1)
where the +1 term is optional, andai?{1,...,n}. This is, in a sense, a re-definition
of xa0, and so we add this equation to every polynomial in the system where xa0
appears (except the first which is now serving as its definition). Afterword, xa0
will appear nowhere in the system of equations, except in its definition, effectively
eliminating it as a variable. Since SAT-solvers tend to choose the most-frequently-
appearing variables when deciding which cases to branch on (except in a constant
fraction of cases when they select randomly, e.g. 1% of the time), xa0 will not be
calculated until all other variables have been set.
46
If there aretequations of short length in the system, then, after preprocessing,
thesetvariables only appear in their own definitions (not even the definitions of each
other), and so far as the main system is concerned, there are now n?t variables.
In practice, the effect of this is slightly less than a doubling of performance, see
Section 3.5.3 on page 53.
We only consider a polynomial for elimination if it is of length 4 or shorter
(called ?light massage?) or length 10 or shorter (called ?deep massage?). The reason
for the length limit is to minimize the increase of ? that occurs as follows.
When Gaussian Elimination is performed on an m?n sparse GF(2) matrix
A, in the ith iteration, the ? in the region Ai+1,i+1...Am,n will tend to be larger
(a higher fraction of ones) than that of Ai,i...Am,n in the previous iteration (See
[Bar06] or [Dav06, Ch. 7]). Even in ?Structured Gaussian Elimination?, when the
lowest weight row is selected for pivoting at each step, this tends to occur. By adding
two rows, the new row will have as many ones as the sum of the weights of the two
original rows, minus any accidental cancellations. Therefore, by only utilizing low
weight rows, one can reduce the increase in ?. See the experiments in Section 3.5.3
on page 53, and Table 3.2 on page 55, for the effect.
3.4.4 Fixing Variables in Advance
Since cryptographic keys are generated uniformly at random, it makes sense
to generate the xi?s as fair coins. But suppose g of these are directly revealed to
the SAT solver by including the short equations x1 = 1,x2 = 0,...,xg = 1, and
47
that a satisfying solution is found in time tSAT. A real world adversary would not
have these g values of course, and would have to guess them, requiring time at most
2gtSAT, or half that value for expected time. As in algebraic cryptanalysis [CSPK00]
it turns out that g = 0 is not the optimal solution. In our experiments on actual
cryptographic systems, we manually tried all g within the neighborhood of values
which producedtSAT between 1 second and 1 hour, to locate the optimum (the value
of g which yielded the lowest running time).
Since exhaustive search requires checking 2n?1 possible values of x1,...,xn on
average, then this method is faster than brute force if and only if tver, the time
required to check one potential key, satisfies
tver >tSAT2?(n?g)
This method is useful for the cryptanalysis of a specific system, e.g. DES
[CB06]. In addition to having fewer variables, note that m/n<m/(n?g), and so
the ?over-definition? or ? will increase, yielding further benefit to fixing variables.
However, for random systems of quadratic equations, fixing variables g and
substitution of their values results in another system, which is an example of a
random system with m equations and n?g unknowns, but with slightly differ-
ent sparsity. Therefore, we did not have to try different values of g in our final
performance experiments, but chose g = 0.
48
3.4.4.1 Parallelization
Suppose g bits are to be fixed, and 2p processors (for some p) are available,
withp<g. Then of the 2g possible values of theg fixed bits, each processor could be
assigned 2g?p of them. After that, no communication between processors is required,
nor can processors block each other. Therefore parallelization is very efficient. If
interprocess communication is possible, then the ?learned clauses? (explained in
Appendix 4) can be propagated to all running SAT-solvers.
In the event that thousands of volunteers could be found, as in the DES
challenge of 1997, or DESCHALL Project [Cur05], then the low communications
overhead would be very important.
3.4.5 SAT-Solver Used
The solver used in this chapter is MiniSAT 2.0 [ES05], a minimalist open-
source SAT solver which has won a series of awards including the three industrial
categories in the SAT 2005 competition and first place in SAT-Race 2006. Mini-SAT
is based on Chaff, but the algorithms involved have been optimized and carefully
implemented. Also, Mini-SAT has carefully optimized variants of the variable order
heuristics and learned clause removal heuristics.
3.4.5.1 Note About Randomness
The program MiniSAT is a randomized algorithm in the sense of occasionally
using random numbers to behave probabilistically. However, in order to guarantee
49
reproducibility, the randomness is seeded from a hash of the input file. Therefore,
running the same input file several times yields the same running time, to within
1%. Obviously, this ?locked? randomness maybe a lucky choice, or an unlucky
one. Since the actual performance of MiniSAT on these problems is log-normal
(see Section 3.5.2 on page 51), the consequences of an unlucky choice are drastic.
Therefore, one (in testing) should generate 20?50 CNF files of the same system,
each perhaps different by fixing a different subset of g of the original n variables, or
perhaps by reordering the clauses in a random shu?e.
The latter is very cheap computationally, but the former is better, as casual
experimentation has shown there are definitely ?lucky? and ?unlucky? choices of
variables to fix. More precisely, the running time is not dependent on g alone, but
also on the specific g out of n monomials chosen to be fixed. The expected value
of the running time in practice can then be calculated as the mean of the running
times of the 20?50 samples, each with a distinct random choice of fixed variables.
3.5 Experimental Results
In general, the running times are highly variable. We propose that the log-
normal distribution, sometimes called Gibrat?s distribution, is a reasonable model
of the running time for a given system. This implies merely that the running time
t is distributed as ex, where x is some random variable with the normal (Gaussian)
distribution. In practice, however, this presents an experimental design challenge.
The distributions of the running times vary so wildly that at absolute mini-
50
mum, 50 experiments must be performed to get an estimate of the expectation. Also,
minor improvements, such as parameters of massaging, are only statistically signifi-
cant after hundreds of repeated trials?which makes careful tuning of the massaging
process impossible.
3.5.1 The Source of the Equations
In cryptanalysis, we always know that a message was indeed sent, and so we
know at least one solution exists to our equations. But, in generating a random
system of equations, if over-defined, we must take care, because many systems of
equations will have no solution. Therefore we used the following technique.
We started with a random system of m polynomial equations in n variables.
Each coefficient was set by a weighted coin, but independently and identically dis-
tributed. By moving all the terms to the same side of the equal sign, one can easily
see this as m functions on n variables, or a map F : GF(2)n ?GF(2)m. Then we
generated a random vector vectorx in GFn by flipping fair coins. It is easy to calculate
F(vectorx) = vectory. Finally we gave our tools the job of finding vectorx given only vectory, F, and
possibly a few bits of vectorx if noted.
3.5.2 The Log-Normal Distribution of Running Times
Examine Figures 1 and 2, which plot the probability distribution of the running
time, and its natural logarithm, respectively. One can observe that the second figure
?looksnormal?, inthesenseofbeingabellcurvethathashaditsrightendtruncated.
51
Figure 3.1: The Distribution of Running Times, Experiment 1
The kurtosis of a random variable is a measure of ?how close to normal? it
is, and takes values in [?3,?). The normal distribution has a kurtosis of zero,
and positive kurtosis implies a leptokurtic distribution (one with values near the
mean being more common than in the Gaussian) and negative kurtosis implies
a platykurtic distribution (one with values near the mean less common than the
Gaussian). The plot of running times suggests an exponential of some kind, and
so upon taking the natural logarithm of each point, a set of values with very low
kurtosis (0.07) was found. The plot is close to a bell curve, and is from 443 data
points, 14 of which were longer than the manually set 1800 sec time out, and 427
of which were plotted. Since loge(1800)?7.496, this explains why the graph seems
truncated at loget> 7.50.
These trials were of a system of equations with n = 64,m = 640,? = 10,? =
1/100, with g = 15 variables fixed in advance. The cutting number was 5, and
52
Figure 3.2: The Distribution of the Logarithm of Running Times, Experiment 1
light massaging was applied. The average running time of those that completed
was 326.25 seconds, on one processor. Since brute force would have to make an
expectation of 248 guesses to find the 49 bits not guessed, this is faster than brute
force if and only if one guess can be verified in less thantver = 0.001159 nanoseconds,
on one processor. This is absurd for modern technological means. At the time this
dissertation was written, even tver equal to 10,000 times that value would be most
likely infeasible.
3.5.3 The Optimal Cutting Number
See Table 3.2 on page 55. The system solved here is identical to that in the
previous experiment, except different cutting numbers and massaging numbers were
used during the conversion. Also, only 50 experiments were run. The result shows
that deep massaging is a worthwhile step, as it cuts the running time by half and
53
takes only a few seconds. Furthermore, it shows cutting by six is optimal, at least
for this system. Note, cutting by 8 would produce extremely large files (around 11
Mb)?those for cutting by 7 were already 5.66 Mb. Both in this case, and in casual
experiments with other systems of equations, the running time does not depend too
much on cutting number (also visible in Table 3.2 on page 55), and that cutting by
six remains efficient.
The kurtosis is seen to vary considerably in the table. Also, some of the modes
havekurtosisnearzero, whichwouldimplyanormalandnotlog-normaldistribution.
This is an artifact of having only 50 experiments per mode. Among statisticians,
a common rule is that a kurtosis of?1 is ?reasonably close to Gaussian,? which is
the case in all but two of the systems in Table 3.2 on page 55 for the logarithm of
the running time.
The massage ratio is the quotient of the running time with massaging to that of
the running time without. As one can see, the effects of a deep massage were slightly
less than doubling the speed of the system. A light massage was even detrimental
at times. This is because the requirement that a polynomial only be length 4 is
quite severe (very few polynomials are that short). Therefore, there is only a small
reduction in the number of variables, which might not be sufficient to offset the
increase in ?.
54
Table 3.2 Running Time Statistics in Seconds
Cut by 3 Cut by 4 Cut by 5 Cut by 6 Cut by 7
No Massaging
Na??ve Average 393.94 279.71 179.66 253.15 340.66
Na??ve StDev 433.13 287.33 182.18 283.09 361.04
Na??ve Kurtosis 0.93 5.12 0.79 1.16 2.47
Average(ln) 5.11 4.96 4.55 4.72 5.2
StDev(ln) 1.63 1.46 1.35 1.51 1.27
Kurtosis(ln) 0.51 0.8 0.43 -0.5 -0.32
Light Massaging
Na??ve Average 413.74 181.86 269.59 217.54 259.73
Na??ve StDev 439.71 160.23 301.48 295.88 237.52
Na??ve Kurtosis 0.04 0.08 3.68 6.85 0.01
Massage Ratio 1.05 0.65 1.5 0.86 0.76
Average(ln) 5.3 4.64 4.84 4.52 4.87
StDev(ln) 1.39 1.29 1.5 1.47 1.5
Kurtosis(ln) -0.38 0.07 0.09 -0.14 0.52
Deep Massaging
Na??ve Average 280.22 198.15 204.48 144.94 185.44
Na??ve StDev 363.64 292.21 210.53 150.88 49.53
Na??ve Kurtosis 5.67 9.24 3.74 0.62 4.69
Massage Ratio 0.71 0.71 1.14 0.57 0.54
Average(ln) 4.82 4.34 4.54 4.07 4.33
StDev(ln) 1.48 1.68 1.63 1.73 1.54
Kurtosis(ln) 1.1 2.41 0.75 -0.06 -0.23
55
3.5.4 Comparison with MAGMA, Singular
See Table 3.3 on page 57. The following experiments were performed using
deep massaging, and cutting number equal to six, on a 2 GHz PC. By Singular,
we refer to version 3.0.2 [sin]. By MAGMA, we refer to version 2.11-2, and by
MiniSAT, we refer to version 2.0 [ES05]. Various values of n,? and ? were chosen
to highlight the role of the number of variables, the sparsity, and the over-definition
of the system.
In particular, this method is much worse than brute force for dense systems,
but far better than brute force for sparse systems (a tver ?10?9 seconds would be
the smallest value that could represent present capabilities. Recall tver is the time
required to verify a false guess in brute-force). The transition appears somewhere
near ? = 1/100. The line marked n = 49 represents the experiments done in the
previous part of this chapter.
Finally, it is interesting to note that if Magma and Singular do not crash, then
they out-perform our method. However, they do crash for many of the systems in
this study, with an ?out of memory? error. In practice, SAT-solvers do not require
much more memory than that required to hold the problem. This is not the case
for Gr?obner Bases algorithms.
3.6 Previous Work
The exploration of SAT-solver enabled cryptanalysis is often said to have be-
gun with Massacci and Marraro [Mas99, MM99, MM00, HMN00], who attempted
56
Table 3.3 Speeds of Comparison Trials between Magma, Singular and ANFtoCNF-
MiniSAT
n m c ? Magma Singular SAT-solver tver
22 220 10 0.5 1.7 sec 1.0 sec 4021.99 sec 1.92?10?3 sec
30 150 5 0.1 3.5 sec 560 sec ?11,000 sec 2.05?10?5 sec
52 520 10 0.01 277.890 sec crashed 789.734 sec 3.51?10?13 sec
136 1360 10 10?3 crashed crashed ?? ??
263 2630 10 10?4 ?? ?? 2846.95 sec 6.54?10?38 sec
22 25 1.1 0.5 65.5 sec ?7200 sec 1451.62 sec 6.92?10?4 sec
30 33 1.1 0.1 crashed crashed 15,021.4 sec 2.80?10?5 sec
52 58 1.1 0.01 ?? ?? ?? ??
133 157 1.1 10?3 ?? ?? ?? ??
128 1280 10 10?3 < 1 sec crashed 0.25 sec 1.47?10?39 sec
250 2500 10 10?4 ?? 91.5 sec 0.26 sec 1.44?10?76 sec
49 640 10.06 0.01 n/a n/a 326.25 sec 1.159?10?12 sec
cryptanalysis of DES with the SAT-solvers Tableau, Sato, and Rel-SAT. This was
successful to three rounds. However, this was a head-on approach, encoding crypto-
graphic properties directly as CNF formul?. A more algebraic approach has recently
been published by Courtois and Bard [CB06], which breaks six rounds (of sixteen).
Fiorini, Martinelli and Massacci have also explored forging an RSA signature by
encoding modular root finding as a SAT problem in [FMM03].
The application of SAT-solvers to the cryptanalysis of hash functions, or more
correctly, collision search, began with [JJ05] which showed how to convert hash-
theoretic security objectives into logical formul?. The paper [MZ06], by Mironov
and Zhang, continued the exploration of hash functions via SAT-solvers by finding
collisions in MD4 and MD5.
We believe this is the first successful application of SAT-solvers to solving
57
systems of equations over finite fields. However, the approach was mentioned in
[Cou01a], upon the suggestion of Jacques Stern.
3.7 Conclusions
The problem of solving a multivariate system of equations overGF(2) is impor-
tant to cryptography. We demonstrate that it is possible to efficiently convert such
a problem into a CNF-SAT problem. We further demonstrate that solving such
a CNF-SAT problem on a SAT-solver is faster than brute force for sparse cases.
On most problems of even intermediate size, Gr?obner-Bases-oriented methods, like
Magma and Singular, crash due to a lack of sufficient memory. Our method, on the
other hand, requires little more memory than that required to store the problem. In
examples where Magma and Singular do not crash, these tools are faster than our
methods. However, our method is still much faster than brute force approximately
when ??1/100.
3.8 Cubic Systems
While no experiments were performed on random cubic systems, the crypt-
analysis of the first 6-rounds of the Data Encryption Standard by Courtois and Bard
[CB06] was carried out using the method in this chapter. It was much faster than
brute force; however, it was necessary to perform a great deal of human-powered
preprocessing. See that paper for details.
In particular, the conversion for cubics proceeds identically to quadratics. The
58
number of possible monomials is much higher. Intuition implies that the assumption
that every monomial is probably present might not be true.
However, degree does not, in fact, affect the probability that a given monomial
is present somewhere in the system. The probability any particular monomial is
present in any particular equation is ?. Since there are m equations, the probability
that a monomial is present anywhere is 1?(1??)m. Degree has no role to play in
that equation. Since this is obviously equal to the expected fraction of monomials
missing, it is interesting to compute what ? would need to be in order for a fraction
r or less of the expected monomials to be present:
(1?(1??)m)?r
Since this would be a small ? (for r< 1/2) we can approximate (1??)m?1?m?,
or m??r.
It would not be worth the overhead to keep a checklist unless perhaps 3/4 or
more of the monomials never actually appear. So it is interesting to discover what
?, in a cubic and quadratic example, would result in that level of monomial absences
(i.e. r< 1/4).
Cubic Example Consider n = 128, m = 128?, a number of monomials?1283/6?
349525. This would require ? ? 1/512?. This means the average length of
an equation would be ? 683/?. This could easily occur if the system is not
highly overdefined, i.e. ??1. It is also easy to imagine systems where this
would not occur.
59
Table 3.4CNF Expression Difficulty Measures for Cubic Systems, by Cutting Num-
ber
Cutting Number Variables Clauses Length
Cut by 3 ?cn4?/6 ?(2/3)cn4? ?2cn4?
Cut by 4 ?cn4?/12 ?(2/3)cn4? ?(8/3)cn4?
Cut by 5 ?cn4?/18 ?(8/9)cn4? ?(40/9)cn4?
Cut by 6 ?cn4?/24 ?(4/3)cn4? ?8cn4?
Cut by 7 ?cn4?/30 ?(32/15)cn4? ?(224/15)cn4?
Cut by 8 ?cn4?/36 ?(32/9)cn4? ?(256/9)cn4?
Quadratic Example n = 128, m = 128?, number of monomials 1282/2 ? 8192.
This would require ??1/512?. This means the average length of an equation
would be?16/?. Therefore, the system would have to be rather sparse with
very short equations in order for this to happen.
There are parenleftbign3parenrightbig ? n3/6 cubic monomials possible, each requiring 1 dummy
variable, 4 clauses of total length 10. There are as before parenleftbign2parenrightbig ? n2/2 quadratic
monomials possible, each requiring 1 dummy variable, 3 clauses of total length 7.
The total number of monomials possible is thus
M =
parenleftbiggn
3
parenrightbigg
+
parenleftbiggn
2
parenrightbigg
+
parenleftbiggn
1
parenrightbigg
+
parenleftbiggn
0
parenrightbigg
?n3/6
The expected length of any polynomial is ?M ??n3/6. Taking cutting by
four as an example, this would require??n3/12 dummy variables, and?(2/3)?n3
clauses of total length?(8/3)?n3, for each of the m equations. Therefore, the cost
of converting the monomials is negligible compared to that of representing the sums,
as before.
60
An interesting note is that as explained earlier, any polynomial system of
equations in GF(2) can be rewritten as a (larger) quadratic system. It is unclear if
it is better to convert a cubic system via this method, and then construct a CNF-
SAT problem, or construct the CNF-SAT problem directly from the cubic system
of equations. Many more experiments are needed.
3.9 NP-Completeness of MQ
The NP-Completeness of MQ over GF(2) is given in Garey and Johnson
[GJ78], inAppendixA7, ProblemAN9. They, inturn, citeanunpublishedmanuscript
of A. Fraenkel and Y. Yesha, 1977. The extension to GF(p) has been proven by
Goubin and Patarin, but has not yet been published at the time this dissertation
was written. The following proof, however, is more direct and sheds light on the
problem.
Given a 3-CNF problem, with m clauses of the form (vi1 ?vi2 ?vi3) for i =
1...m, and n variables x1,x2,...,xn, with vij being either an xi or its negation, we
can write a cubic system of equations as follows.
First, recall that each clause must be true in a satisfying assignment, and an
assignment which makes each clause true is satisfying. We will write one equation
for each clause by noting the following tautology:
(a?b?c) ? ((a?b)?c)?(a?b)?c
((a?c)?(b?c))?(a?b)?c
61
((a?c?b?c)?(a?c)?(b?c))?((a?b)?a?b)?c
(a?b?c)?(a?c)?(b?c)?(a?b)?a?b?c
(abc+ac+bc+ab+a+b+c) = 1
Furthermore, if a were negated, substituting 1+a for a would not change the
degree of that polynomial, likewise for b and c. Thus each clause becomes one cubic
polynomial equation. The number of variables is unchanged. And this is clearly a
polynomial time conversion. The increase in length is obviously linear.
Also recall that every cubic system of equations can be rewritten as a quadratic
system with the introduction of a few new variables (one per cubic monomial). Since
there is only one cubic monomial per equation, this would be a very modest increase
in the number of variables.
Therefore, if we had a black box that could solve either cubic or quadratic
polynomials overGF(2) in polynomial time, then we could solve CNF-SAT problems
using that black box in polynomial time. Therefore, if MQ (the problem of solving
a multivariate quadratic polynomial over GF(2)) is in P, then CNF-SAT is in P, and
P=NP.
Thus MQ is NP-Hard. Likewise for MC, the problem of solving a multivariate
cubic polynomial over GF(2). The decision problem related to it is ?does this
quadratic system of equations over GF(2) have a solution?? Clearly a witness to
this decision would be a solution itself, and verifying one would be rapid. Therefore
the decision problem is in NP, and is therefore NP-Complete.
62
3.10 Final Note on Sparse MQ and F5
It should be noted that we have observed, when sufficient memory is available,
that Magma can indeed solve large random sparse MQ systems. We have found
nothing in the published literature to date that suggests this. In fact, it appears
that some believe random large MQ to be hard, regardless if sparse or not. A curious
consequence of this is that Magma, which uses Faug`ere?s F4 algorithm [Fau99], may
turn out to be the fastest method of solving sparse random large MQ if sufficient
memory is available.
63
Chapter 4
How do SAT-Solvers Operate?
The purpose of this appendix is to explain how SAT-solvers operate (at least at
the time of writing). The family of SAT-solvers described here is based on the Chaff
Algorithm [MMZ+01]. This gives insight into Chapter 3 on page 36, in particular, by
highlighting why the number of variables per clause, number of clauses, and number
of variables, are taken as the three general barometers of difficulty for a particular
SAT problem.
At this time, SAT-solvers different from Chaff are no longer currently in use.
However, that could someday change, and in Section 4.4 on page 74, the Walk-
SAT algorithm is described. Walk-SAT was the last competitor to Chaff. Many
SAT algorithms have been proposed in previous years, and also many preprocessing
techniques, none of which will be described below.
4.1 The Problem Itself
Given a logical sentence over certain variables, does there exist a set of as-
signments of true and false to each of those variables so that the entire sentence
evaluates as true? This question is the ?Sat? problem, and is the root of the theory
of NP-Completeness.
The term ?logical sentence? in this document refers to an expression composed
64
of variables, and the operators from predicate calculus (AND, OR, NOT, IMPLIES,
and IFF), arranged according to the grammar of predicate calculus. There are no
universal quantifiers (i.e. ?), existential quantifiers (i.e. ?), or any functions. An
example of such a sentence is
(D?B?A)?(B?C)
which is satisfied by (for example) setting all the variables to true.
It is a basic fact from circuitry theory that any logical sentence can be writ-
ten as a product of sums (Conjunctive Normal Form or CNF) or sum of products
(Disjunctive Normal Form). These terms refer to the semiring first introduced in
Section A.2 on page 133, where addition is logical-OR and multiplication is logical-
AND.
4.1.1 Conjunctive Normal Form
A logical sentence in CNF is a set of clauses. Each clause is combined into a
large conjunction or AND-gate. Thus the sentence is true if and only if each clause
is true. The clauses are themselves OR-gates, or disjunctions. Each variable in the
clause can appear negated, or not negated.
Product of Sums or Conjunctive Normal Form has been selected as the uni-
versal notation for SAT-solvers for many reasons. One reason is that all predicate
calculus sentences can be written in CNF. Another interesting reason is that some
sentences can be written with two or fewer variables per clause, and others require
65
three variables at least for a few clauses. There does not exist a logical sentence
which cannot be written with the restriction of at most three variables per clause.
Solving the SAT problem on CNF sentences with at most two variables per clause (2-
CNF) is possible in polynomial time. For CNF sentences with up to three variables
per clause (3-CNF), SAT is NP-Complete.
While one could write any logical sentence in 3-CNF notation, it is not required
for SAT solvers that the author is aware of. The logical sentence need merely be in
CNF form.
4.2 Chaff and its Descendants
There is a large economic and financial incentive to make good SAT-solvers
(see Section 4.5 on page 74). For this reason, a series of competitions has been held
each year [BS06]. The Chaff algorithm proposed by [MMZ+01] is at the core of
all currently competitive SAT-solvers. Like most users of SAT-solvers, we treat the
system as a black-box, worrying only on how to present it with our problem in a
way that results in the most efficient search for a solution.
4.2.1 Variable Management
Every variable in the system will be given one of three values, namely true,
false, and not-yet-known. Initially, all variables are set to not-yet-known. As men-
tioned earlier, variables in a clause can be negated or not-negated. The first step
is to replace all the negated variables with new ones (thus doubling the number of
66
variables). However, the original variables are identified by the positive integers.
The negation of a variable has as its ID, the additive inverse of the original ID.
Thus whenever the variable numbered x is set to true, then it is understood that
?x will be set to false, regardless of the original sign of x.
There are three consequences to this. First, none of the variables in the system
are negated after this step; even though there are twice as many, they are tied
together as described; and when one variable is changed to true or false from ?not-
yet-known?, its complement will be set accordingly.
Now each clause is a disjunction (OR-gate) of some particular variables. If any
of those variables is at any time true, then the clause is satisfied. We will declare
the clause ?inactive? and it will be hidden from the algorithm. Thus the ?active?
clauses are those that are not-yet-satisfied. Likewise, if all of the variables are false,
then satisfiability has become impossible, and back-tracking must take place. We
will cover back-tracking later, in Section 4.2.4 on page 70.
Therefore, an active clause (except during backtracking) has no variables set
to true?all of its variables are set to false or not-yet-known, with at least one of
those being not-yet-known. But suppose, out of n variables, a clause were to have
n?1 false variables and one not-yet-known. Clearly in any satisfying assignment,
that not-yet-known variable must be true, and so we can set it to true. This rule is
called ?unit propagation.?
67
4.2.2 The Method of Watched Literals
In practice, each clause has two pointers associated with it, which we will
denote ?fingers?. Each finger must point to a variable, and since all variables begin
the system with the status of not-yet-known, they all point to a (distinct) not-yet-
known variable. If the status of a fingered variable changes, then the finger will
move. If the variable becomes true, then the clause is now inactive, and out of the
algorithm. If the variable becomes false, then the finger will try to move to another
variable in the same clause which is not-yet-known. If this is possible, it moves there.
If not, then this means there is one not-yet-known variable (pointed to by the other
finger) and all the other variables are false (because the clause is still active). As
we stated before, this means that the remaining single not-yet-known variable must
be set to true. And conveniently, we do not need to search for it, because the other
finger is pointing to it. The clause is now satisfied and can be deleted. This is called
the ?Method of Watched Literals.?
One additional rule is used. If a variable v is found somewhere in the entire
system, and?v is not, then it is safe to set v to true and?v to false. This sounds
like it might require a search. The beauty of the ?chaff? algorithm is that it uses
pointers in a clever way to ensure that searches are not needed, except at setup.
4.2.3 How to Actually Make This Happen
There will be an array from?n to n that contains all system variables. Each
variable will have a list of clauses that contain it, and every clause will have a list
68
of variables that it contains.
When a variable is set to true, any clause that contains it is deactivated.
Then for each of those newly deactivated clauses, the variables contained in them
are notified to remove that clause from their list of clauses that contain them. If
one of those lists becomes empty, the variable then is not found in the system. This
means its complement can be marked true, and it can be marked false, with all the
consequences that this paragraph requires from that marking. Once this is done,
the complement of the original variable which was set to true can be set to false.
When a variable is set to false, all the clauses are notified. If that variable had
one of the clause?s fingers then that finger is moved to any variable in that clause
which is currently marked not-yet-known. If no such clause is available, then the
variable pointed to by the other finger is marked true, with all the consequences we
described above. Of course, if an entire clause becomes false, the system has dis-
covered a contradiction and must begin back-tracking (which hasn?t been explained
yet, see Section 4.2.4 on page 70). And if not already done so, the complement of
the variable already set to false should be now set to true, with all the consequences
that entails.
Thus, we start with a system with all the variables set to not-yet-known. We
build the data-structures previously described. If any of the variables v fails to
appear in the system, (i.e. the list of clauses containing v is empty), then we mark v
false and mark?v true, which hopefully sets off a flurry of activity. Then either the
system will halt with all clauses inactive, which means we have found a satisfying
assignment, and print it out; or halt with a contradiction which means the original
69
problem was unsatisfiable; or with overwhelming probability, knock-out only a few
things and leave a problem that looks relatively unchanged.
At this time we choose a not-yet-known variable to be ?assumed.? For exam-
ple, with 1% probability, it could be a randomly chosen variable. Otherwise, with
probability 99%, it is that variable which appears in the largest number of clauses
(has the longest list of clauses associated). The variable selection is a heuristic and
varies from implementation to implementation. This variable will now be changed
to true or false, decided by a fair coin. Then that assumption will be pushed on to
an ?assumption stack.? Hopefully, this also sets off a flurry of activity and either
results in a satisfying assignment or a smaller problem. If a satisfying assignment
has resulted, we print the answer and declare victory. If a smaller problem results,
we guess another variable.
4.2.4 Back-Tracking
The third possibility is that we reach a contradiction. (Some clause is entirely
false). If this is the case, then we ?pop? the most recent assumption off of the stack.
If it were that v is true, then we now assume v is false, and check to see if the
contradiction remains. If the contradiction remains, we keep popping assumptions
until the contradiction no longer remains. At least one (v), if not several variables,
have now changed their state, and so a great deal of rewinding must take place. Due
to clever data structure design, akin to a journaling file-system, the rewinding can
be made very efficient.
70
If the contradiction stack becomes empty, and a contradiction remains, then
the original problem was unsatisfiable. Now assume that the contradiction can be
repaired by popping off one, some, or all of the assumptions, and the algorithm then
continues as before.
Some care is needed to make sure an infinite loop does not result but this is
easily taken care of with flag variables. Once a variable setting has resulted in a
contradiction (e.g. v5 = T), and its negation is attempted (e.g. v5 = F), if that
also fails, the system should move further up the assumption stack, and not try the
original (e.g. v5 = T) a second time.
Sometimes searches of a boolean space are described as a tree. Each assump-
tion is analogous to taking one branch of the tree over another. Note that because
more than one assumption can be popped off of the stack at once, it is possible to
?lop off? large portions of the tree in a single move. This also occurs when clauses
are learned. For this reason, Chaff is much faster than an ordinary tree search, no
matter how cleverly implemented. For this reason, the true running time is conjec-
tured to be clscript, where lscript is the number of variables, and 1 <c< 2 is a constant.
The reason that this ?lopping off? occurs is that a Cnf-Sat expression is
like a product in a factored polynomial. When one factor of a polynomial is zero,
the whole thing is zero. Thus if one clause is false, the conjunction of them all is
false. For this reason, one need not investigate all the settings for the other clauses.
Once a sub-tree has been identified as having the property of always forcing some
particular clause to be false, that sub-tree can be ignored.
71
4.3 Enhancements to Chaff
Once the previous description is understood, the following enhancements make
the system very efficient.
4.3.1 Learning
There is one last element of this algorithm, namely learning new clauses. Sup-
pose the assumption stack has five assumptions on it (without loss of generality: v1,
v2, ..., v5 are true), and a contradiction results. We then know that
v1?v2?v3?v4?v5
is true, which is equivalent to
v1?v2?v3?v4?v5
which is conveniently a CNF clause! Thus we have ?learned? a new clause from
this contradiction, which we can safely toss into the system. These learned clauses
might be quite long, (if the stack was large when the contradiction occurred) and
there might be many of them (if many contradictions were found). They are added
to the system but flagged as ?learned.? If a learned clause has not served a purpose
in a long time (e.g. it hasn?t changed activation status within t?105 steps) then it
can be deleted, but clauses that were part of the original problem are never deleted.
This keeps a bound on the number of clauses.
Sometimes a set of clauses will simultaneously become all false, each alone
72
enough to produce a contradiction. In this case, many clauses can be added at
once, one for each contradiction-causing clause.
4.3.2 The Alarm Clock
Finally, the algorithm has an alarm clock. If the algorithm hasn?t found an
answer after a certain length of time has elapsed, then it will completely reboot the
entire system, except that it retains any clauses that it learned. The idea is that by
starting over with this new information, a more efficient path might be taken. The
initial timer is set quite short, and then increases after each time-out. This is better
than a fixed timer of t seconds, because a problem that required t+epsilon1 seconds would
be unsolvable. In any case, this is all heuristic, and it seems to work in practice.
4.3.3 The Third Finger
Another variant, universally employed, is to add a third finger to each clause.
Like the first two fingers, it can only be attached to a variable which is not-yet-
known, and is not currently pointed to by one of the other two fingers. Once this
is no longer possible in an active clause, the system is aware that two variables are
not-yet-known, and all the others are false in that clause. (Note, if there were a
true variable there, then the clause would be inactive). Thus the failure of the third
finger to attach gives the system warning that this particular clause is about to
trigger a ?unit propagation.?
73
4.4 Walk-SAT
Walk-SAT simulates a walk through a search space, using the greedy algorithm
at each step to determine what step to take. Essentially, the algorithm begins with a
random assignment. The number of unsatisfied clauses is tabulated. It then toggles
each of the variables once, and sees which one reduces the number of unsatisfied
clauses the most. (Actually, to be technical, it performs this calculation analytically,
not with brute force, using the structure of the clauses). The toggle which was
most effective at reducing the number of unsatisfied clauses is now adopted, and
the algorithm repeats. There is a timer which resets after a fixed time elapses,
re-initializing the algorithm with a random setting.
Improvements that were made allowed for a random choice at each step. For
example, if the variable whose toggle resulted in the most number of newly satisfied
clauses is called ?first place?, then first place might be selected with probability
60%, second place with probability 30%, third with probability 8%, and a random
variable with probability 2%.
4.5 Economic Motivations
Suppose a certain sub-circuit of a larger circuit is not satisfiable. Then surely
one can replace its outputs with a 0, and delete that circuit, saving many gates.
If not, then suppose its negation is not satisfiable. Then one can replace its out-
puts with a 1, and delete the circuit. For this reason, the CNF-SAT problem is
crucial to the efficient implementation of microelectronics. (Though the problems
74
frequently solved by SAT-solvers are usually much more complex than those two
simple examples).
Over the years, there have been so many new and efficient SAT-solvers that the
common way to solve many NP-Complete problems is to convert the problem into a
CNF sentence, and then call a SAT-solver to find a satisfying assignment. In some
waysitisamazingthatthisworks, becausemuchinformationislostwhenperforming
the conversion. Yet it is a common practice, because SAT-solvers have been so
carefully tuned by many researchers over several decades. The common applications
are planning, AI, circuit-layout, and automated theorem proving (perhaps the last
one is not economic).
4.5.1 An Interesting Coincidence
Also note that the DES implementation attacked in [CB06] was one that had
been produced by an automated circuit minimization tool to reduce the number
of gates, presumably for speed and economic reasons. However, the reduction in
the number of gates directly leads to a simpler (sparser) algebraic representation as
well. This minimization would have been done by tools related to the SAT-problem
as well as other NP-hard problems. Thus SAT is interesting for cryptanalysis in
several ways.
75
Part II
Linear Systems
76
Chapter 5
The Method of Four Russians
Solving a linear system of GF(2) equations lies at the heart of many crypt-
analytic techniques. Some examples include stream cipher cryptanalysis via the
XL algorithm and its many variants [Arm02, Arm04, AA05, AK03, Cou02, Cou03,
Cou04a, CM03, HR04, CSPK00]; the algebraic attacks on the HFE public-key cryp-
tosystem [Cou01b, FJ03, CGP03, Cou04c]; cryptanalysis of QUAD [BGP05]; and
solving the matrix square root (provably NP-Hard) with the XL algorithm [Kut03].
Gaussian Elimination is a natural choice of algorithm for these problems. How-
ever, for dense systems, its cubic-time complexity makes it far too slow in practice.
The algorithm in this chapter achieves a speed-up of 3.36 times for a 32000?32000
GF(2)-matrix generated by random fair coins. The theoretical complexity of the al-
gorithm is O(n3/logn), but it should be remembered that frequently n is the cube
or higher power of a parameter of the system being attacked, and so frequently is
in the millions.
At first it may seem surprising that so much attention is given to an algorithm
of complexityO(n3/logn), since Strassen?s Algorithm for Matrix Multiplication has
complexity O(nlog2 7). But, in the end, we will combine the two algorithms.
The algorithms in this chapter have formed the backbone of a linear algebra
suite coded by the author, and are now part ofSage[sag], an open source competitor
77
to Magma [mag]. Some of the experiments cited in this chapter were performed by
Sage volunteers and staff, as noted in each case.
Performance is formally modeled and experimental running times are provided,
including running times for the optimal setting of the algorithm?s parameter. The
algorithm is named Method of Four Russians for Inversion (M4RI), in honor of
the matrix multiplication algorithm from which it emerged, the Method of Four
Russians for Multiplication (M4RM). The ?Four? are Arlazarov, Dinic, Kronrod,
and Faradzev [ADKF70], but later information showed that not all are Russian.
5.1 Origins and Previous Work
A paper published by Arlazarov, Dinic, Kronrod, and Faradzev [ADKF70] in
1970 on graph theory contained an O((logd)(v3/logv)) algorithm for finding the
transitive closure of a directed graph of v vertexes and diameter d. This problem is
of course equivalent to exponentiation of a boolean matrix (the adjacency matrix)
and the community quickly realized that it was useful not only as a matrix squaring
algorithm, but also a matrix multiplication algorithm, because
?
??
?
A B
0 0
?
??
?
2
=
?
??
?
A2 AB
0 0
?
??
?
and therefore squaring a matrix and matrix multiplication are equivalent. The
running time of the algorithm so produced (given in Section 5.3 on page 82 below),
is O(n3/logn) for an n?n matrix. This equivalence is not as inefficient as it
might seem, as one can trivially calculate the upper-right quadrant of the answer
78
matrix without calculating the other three-fourths of it. This algorithm appears
in Aho, Hopcroft, and Ullman?s book, which gives the name ?the Method of Four
Russians...after the cardinality and the nationality of its inventors? [AHU74, Ch.
6]. While that text states this algorithm is for boolean matrices, one can easily see
how to adapt it to GF(2) or even to GF(q) for very small q.
A similarly inspired matrix inversion algorithm was known anecdotally among
some cryptanalysts. The author would like to express gratitude to Nicholas Courtois
who explained the following algorithm to him after Eurocrypt 2005 in ?Arhus, Den-
mark. It appears that this algorithm has not been published, either in the literature
or on the Internet. We call this newer algorithm the ?Method of 4 Russians for In-
version? (M4RI) and the original as the ?Method of 4 Russians for Multiplication?
(M4RM).
5.1.1 Strassen?s Algorithm
Strassen?s famous paper [Str69] has three algorithms?one for matrix multi-
plication, one for inversion, and one for the calculation of determinants. The last
two are for use with any matrix multiplication algorithm taken as a black box, and
run in time big-Theta of matrix multiplication. However, substantial modification
is needed to make these work over GF(2). Details can be found in Section B.5.4 on
page 144 but for now, recall the running time,
?
parenleftbiggn
n0
parenrightbigglog2 7
M(n0)
79
where M(n0) is the time required to multiply an n0?n0 matrix in the ?fall-back?
algorithm. Strassen?s algorithm will repeatedly cut a matrix in half until the pieces
are smaller than n0. After this point, the tiny pieces are resolved with the fall-back
algorithm, and the answer is constructed. For this reason, if M(n0) is smaller with
M4RM rather than the na??ve algorithm, or likewise M4RI versus Gaussian Elimina-
tion, then Strassen?s Algorithm will be proportionally improved for all sufficiently
large matrices. Since n0 might be large, a speed-up of logn0 is not trivial.
5.2 Rapid Subspace Enumeration
The following step is crucial in the Method of Four Russians family of algo-
rithms. An n-dimensional subspace of a vector-space over GF(2) has 2n vectors in
it, including the all-zero vector. Given n basis vectors for that subspace, how can
we rapidly enumerate these 2n vectors?
Obviously, any vector in the subspace can be written as a linear combination
of the basis vectors. In GF(2), a linear combination is just a sum of a subset. There
will be 1 vector with 0 basis vectors in that subset, n vectors will be written as
a sum of one basis vector, parenleftbign2parenrightbig will be written as a sum of two basis vectors, ...,
parenleftbign
n
parenrightbig = 1 will be written as a sum of all the basis vectors. Thus the expected number
of basis vectors in the sum for any particular subspace vector is given by
i=nsummationdisplay
i=0
i
parenleftbiggn
i
parenrightbigg
= n22n
Instead, [ADKF70] contains an indirect description of a faster way. A k-
bit Gray Code is all 2k binary strings of length k, ordered so that each differs by
80
exactly one bit in one position from each of its neighbors. For example, one 3-bit
Gray Code is {000,001,011,010,110,111,101,100} [Gra53]. Now consider the ith
bit of this code to represent the ith basis vector. This means that the all-zero string
represents the all-zero vector, and the all-ones string represents the sum of all the
basis vectors. The Gray Code will cycle through all 2n vectors in the subspace.
Furthermore, each sum can be obtained from the previous sum by only one vector
addition.
The reason for this is that each codeword differs in exactly one bit from its
predecessor. Thus, given a codeword, suppose bit i is flipped to produce the next
codeword. If it was a 0 ? 1 transition, adding the ith basis vector to the sum
will produce the correct new sum. But, if it was a 1 ? 0 transition, adding the
ith basis vector to the sum will also produce the correct sum because vectorx+vectorx = 0 in
any vector space whose base field is of characteristic two. Thus, starting with the
all-zero string, and cycling through all 2n codewords, we can start with the all-zero
vector, and cycle through all 2n basis vectors, using only one vector-addition at each
step.
This requires 2n?1 vector additions instead of (n/2)2n, and is a speed-up of
?(n). Since a vector addition is a ?(n) operation, this rapid subspace enumeration
method requires ?(n2n) instead of ?(n22n) bit-operations. Since there are n2n
bits in the output of the algorithm, we can see that this method is optimal in the
sense of Big-?. For exact matrix memory operation counts, observing that ? 3n
matrix-memory operations are needed for a vector addition, a total of?3n(2n?1)
operations are required to enumerate an n-dimensional subspace.
81
5.3 The Four Russians Matrix Multiplication Algorithm
This matrix multiplication algorithm is derivable from the original algorithm
published by Arlazarov, Dinic, Kronrod, and Faradzev [ADKF70], but does not
appear there. It has appeared in books including [AHU74, Ch. 6]. Consider a
product of two matrices AB = C where A is an a?b matrix and B is a b?c matrix,
yielding an a?c for C. In this case, one could divide A into b/k vertical ?stripes?
A1...Ab/k of k columns each, and B into b/k horizontal stripes B1...Bb/k of k rows
each. (For simplicity assume k divides b). The product of two stripes, AiBi is an
a?b/k by b/k?c matrix multiplication, and yields an a?c matrix Ci. The sum
of all k of these Ci equals C.
C = AB =
i=ksummationdisplay
i=0
AiBi
The algorithm itself proceeds as given in Algorithm 1 on page 82.
1: for i = 1,2,...,b/k do
1: Make a Gray Code table of all the 2k linear combinations of the k rows of Bi.
Denote the xth row Tx.
(Costs (3?2k?4)c reads/writes, see Stage 2, in Section 5.6 on page 98).
2: for j = 1,2,...,a do
1: Read the entries aj,(i?1)k+1,aj,(i?1)k+2,...,aj,(i?1)k+k.
2: Let x be the k bit binary number formed by the concatenation of
aj,(i?1)k+1,...,aj,ik.
3: Add (the vectors) Cj? = Cj? +Tx. (Costs 3c reads/writes).
Algorithm 1: Method of Four Russians, for Matrix Multiplication
82
5.3.1 Role of the Gray Code
The Gray Code step is useful for two reasons. First, if any particular lin-
ear combination of the rows is required more than once, it is only computed once.
Second, even if each linear combination of rows is required exactly once, the Gray
Code works?n/2 times faster than the na??ve way of calculating those linear com-
binations. But, for matrices of various sizes and various values of k, the expected
number of times any particular linear combination is required will vary. Thus it is
better to calculate the running time directly to see the effects of the algorithm.
The innermost loop out requires k + 3c steps, and then the next requires
(3?2k?4)c+a(k+3c) steps. If the (3?2k?4)c is puzzling, note that (2k?1) vector
additions are required. This would normally require (3?2k ?3)c matrix-memory
read/writes. In the first iteration, we save an additional c by noting the previous
row is always all zeroes and so we do not have to actually read it.
Finally the entire algorithm requires
b((3?2k?4)c+a(k+ 3c))
k =
3b2kc?4cb+abk+ 3abc
k
matrix memory operations. Substitute k = logb, so that 2k = b, and observe
3b2c?4cb+ablogb+ 3abc
logb ?
3b2c+ 3abc
logb +ab
For square matrices this becomes?(6n3)/(logn).
83
5.3.2 Transposing the Matrix Product
Since AB = C implies that BTAT = CT, one can transpose A and B, and
transpose the product afterward. The transpose is a quadratic, and therefore cheap,
operation. This has running time (3b2a+3abc)/(logb)+cb (obtained by swapping a
andcin the earlier expression) and some manipulations show that this more efficient
when c < a, for any b > 1. Therefore the final complexity is ? (3b2 min(a,c) +
3abc)/(logb) + bmax(a,c). To see that the last term is not optional, substitute
c = 1, in which case the last term becomes the dominant term.
5.3.3 Improvements
In the years since initial publication, several improvements have been made, in
particular, in reducing the memory requirements [AS88, SU86], and the base fields
upon which the algorithm can work [San79].
5.3.4 A Quick Computation
Suppose we start again with the complexity expression,
b((3?2k?4)c+a(k+ 3c))
k
but substitute a = b = c = n (i.e. a square matrix times a square matrix of the
same size). One obtains, after some algebraic manipulations,
? 3n
22k + 3n3
k
84
Then substitute k = ?logn and observe,
3n(2+?) + 3n3
?logn
Immediately, we see that ? > 1 would cause the numerator to have a higher-
than-cubic term. That would make it inferior to even the na??ve algorithm. Further
observation shows that ? < 1 is inferior to ? = 1 because of the coefficient ? in the
denominator. Thus this quick analysis predicts k = logn is optimal.
5.3.5 M4RM Experiments Performed by SAGE Staff
Martin Albrecht, a member of the Sage Project, evaluated my library for
inclusion in Sage. The library is a Level 1, 2, and 3, BLAS (Basic Linear Algebra
System), and includes matrix multiplication and inversion via the algorithms in this
chapter, as well as LUP-factorization, and any matrix-vector and vector-vector op-
erations to support them. The tests were primarily was for M4RI, but also included
tests for M4RM. The crossover appears to be slightly larger than 6000?6000. The
results are in Section 5.1 on page 86. A log-log plot shows that Magma is using
Strassen?s Algorithm. It should be noted that Magma is hand optimized in assem-
bly language, for several processors, including the Opteron. The supercomputer
used in the tests is sage.math.washington.edu. The following quotation can be
found on the machine?s website.
This is computer [sic] for very open collaboration among SAGE develop-
ers and testing of intense SAGE calculations. It is a special-purpose 64-bit
computer built by Western Scientific that has 64GB of RAM and 16 AMD
85
Table 5.1 M4RM Running Times versus Magma
Matrix Size M4RM (SAGE) Strassen (Magma)
1000?1000 0.01 sec 0.02 sec
2000?2000 0.03 sec 0.16 sec
3000?3000 0.11 sec 0.24 sec
4000?4000 0.26 sec 0.48 sec
5000?5000 0.70 sec 1.03 sec
6000?6000 1.64 sec 1.67 sec
7000?7000 3.32 sec 2.61 sec
8000?8000 5.39 sec 3.34 sec
9000?9000 8.09 sec 5.45 sec
10000?10000 11.29 sec 7.28 sec
Table 5.2 Confirmation that k = 0.75log2n is not a good idea.
k = logn k = 0.75logn
Matrix Size k time k time
8000 10 5.404 13 8.742
16000 10 46.310 14 64.846
32000 11 362.066 15 472.384
Opteron cores. You can browse everybody?s home directories. It was pur-
chased for SAGE development using my NSF Grant (No. 0555776).
Also, due to the remarks in Section 5.6.1 on page 100, an experiment was
performed to try k = 0.75log2n instead of k = log2n. The results show that this
change is not for the better.
5.4 The Four Russians Matrix Inversion Algorithm
While the title of this section contains the words ?matrix inversion?, the al-
gorithm which follows can be used either for matrix inversion or for triangulation
and back-substitution, by the same mechanism that this is also true for Gaussian
Elimination. As stated earlier, even if one has several vectorb1,vectorb2,vectorb3,...,vectorbn, it is far more
efficient to solve Avectorxi = vectorbi by appending the bi as columns to the end of matrix A,
86
and putting matrix A in unit upper triangular form (UUTF). Then, one can solve
for each xi by back substitution to obtain the xi. (This is a quadratic, thus cheap,
step). The alternative is to invert A, and Section 5.4.5 on page 92 contains changes
for that approach, by adjoining A with an identity matrix and processing it into
row reduced echelon form (RREF).
In Gaussian Elimination to UUTF of anm?nmatrix, each iterationioperates
on the submatrix aii...amn, with the objective of placing a one at aii and a zero at
every other entry of the column i below row i, and leaving all above untouched. In
the Method of Four-Russians Inversion (M4RI) algorithm, k columns are processed
at once, producing ak?k identity matrix in the correct spot (aii...a(i+k?1),(i+k?1)),
with all zeros below it, and leaving the region above the submatrix untouched.
1: For i = 1,k+ 1,2k+ 1,3k+ 1,...min(m,n) do
1: Perform Gaussian Elimination on rows i,i+ 1,...,i+ 3k?1, to establish a
k?k identity matrix in cells aii...ai+k?1,i+k?1.
2: Construct a gray-code table to enumerate the 2k?1 non-zero vectors in the
subspace generated by rows i...i+k?1.
3: For each row j = i+ 3k...m do
1: Read the entries in the k columns i,i+1,...,i+k?1 of row j, and treat
them as a k-bit binary number x.
2: Add the entry in the Gray Code table that has x as a prefix, to row j.
Algorithm 2: Method of Four Russians, for Inversion
Each stage will now be described in detail.
87
5.4.1 Stage 1:
Denote the first column to be processed in a given iteration as ai. Then, per-
form Gaussian elimination on the first 3k rows after and including theith row to pro-
duceanidentitymatrixinai,i...a(i+k?1),(i+k?1), andzeroesina(i+k),i...a(i+3k?1),(i+k?1)
(To know why it is reasonable to expect this to succeed, see Lemma 1 in Section 5.6.3
on page 101).
5.4.2 Stage 2:
Construct a table consisting of the 2k binary strings of length k in a Gray
Code. Thus with only 2k vector additions, all possible linear combinations of these
k rows have been precomputed. (See ?Gray Code Step? in Section 5.2 on page 81).
5.4.3 Stage 3:
One can rapidly process the remaining rows from i+3k until row m (the last
row) by using the table. For example, suppose thejth row has entriesaji...aj,(i+k?1)
in the columns being processed. Selecting the row of the table associated with this
k-bit string, and adding it to row j, will force the k columns to zero, and adjust the
remaining columns fromi+k tonin the appropriate way, as if Gaussian Elimination
had been performed.
The process is then repeated min(m,n)/k times. As each iteration resolves
k columns, instead of one column, one could expect that this algorithm is k times
faster. The trade-off for large k is that Stage 2 can be very expensive. It turns out
88
(see Section 5.5 on page 94) that selecting the right value of k is critical.
5.4.4 A Curious Note on Stage 1 of M4RI
We have shown (See Section 5.6.3 on page 101) that a 3k?k submatrix, begin-
ning at ai,i and extending to ai+3k?1,i+k?1 is very unlikely to be singular. Therefore,
the Gaussian Elimination (which is done on the rows i,...,i+ 3k?1) will be suc-
cessful and will produce a k?k identity matrix. But, this is not the whole story.
With probability around 28.8%, the 3k?3k matrix will be full-rank, and so actually
an identity matrix of size 3k?3k will be available. (Recall this is the probability
that a sufficiently large random GF(2)-matrix will be invertible). With probability
57.6%, the matrix will have nullity one (proof given as Theorem 1 on page 89 be-
low) and so a (3k?1)?(3k?1) identity matrix (with one row of zeroes under it)
will be present. This means that the next two iterations of the algorithm will have
essentially no work to do at all in Stage 1, with probability around 86.6% or so.
The cases nullity 2, nullity 3, and nullity 4 absorb nearly all remaining probability
(proved in Theorem 2 on page 91 and shown in Table 5.3 on page 93), and the prob-
ability that ?only? 3k?2k will be in the form of an identity matrix (with k rows
of zeroes underneath) is already approaching zero as lscript gets large, with a reliability
that can be calculated using the aforementioned theorem. Therefore, Stage 1?s cost
is actually near to one-third its listed value. Since Stage 1 is not significant in the
final complexity, we do not carry this analysis further.
Theorem 1 The probability that an n?n GF(2)-matrix, filled with the output of
89
independent fair coins, is nullity 1 equals (1?2?n)(1?2?n+1)???(1?2?n+n?2)(1?
2?n). Also for large n, the ratio of the number of nullity one n?n matrices to the
number of nullity zero matrices is?2.
Proof: LetAbe a matrix that isn?nand nullity one. The null space contains
21 = 2 vectors. Since the null space always contains the zero vector, it therefore
contains one other vector vectorv.
There is a change-of-basis matrix B such that Bvectorv = vectore1 = {1,0,...,0}, or
vectorv = B?1vectore1. Since Avectorv =vector0 then AB?1vectore1 =vector0 also and therefore BAB?1vectore1 =vector0. Note
that B, by virtue of being an n?n change-of-basis matrix, is non-singular, and so
B?1 exists and is of the right size.
The fact that BAB?1vectore1 = 0 means that the first column of BAB?1 is all
zeroes. Note that BAB?1 and A have the same characteristic polynomial, nullity,
rank, determinant, etc....
The first column is all zeroes, but the rest of the matrix has to be full-rank
for the nullity to be exactly one, and so the second column can be anything but
all zeroes, the third column cannot be the second column nor all-zeroes, the fourth
column cannot be the third, the second, nor their sum, and so on. For the ith
column, we have ruled out the span of the i?2 dimensional subspace generated by
the previous i?1 columns.
The originalvectorv in the null-space could be any non-zero vector, or 2n?1 choices.
We have therefore,
Pr[nullity = 1] = (1)(2
n?1)(2n?2)???(2n?2n?2)
2n2 (2
n?1)
90
= (1?2?n)(1?2?n+1)???(1?2?2)(1?2?n)
As one can see, compared to the nullity zero case, we have removed a (1?2?1)
term and replaced it with an extra 1?2?n term, which asymptotically doubles the
whole product. []
Theorem 2 If A is an n?n matrix filled with fair coins, the probability that it has
nullity k is given by
(1?2?n)(1?2?n+1)???(1?2?n+k?1)(1?2?n)(1?2?n+1)???(1?2?k?1)
(2k?1)(2k?2)???(2k?2k?1)
Proof: Suppose the nullity of A is k and thus the nullspace of A has 2k?1
non-zero vectors in it. Choose k of them, vectorv1,...,vectorvk such that they are linearly
independent.
There is a change-of-basis matrix B that maps the vectors so that Bvectorvi = vectorei,
or vectorvi = B?1vectorei, for i ?{1,...,k}. This further implies that vector0 = Avectorvi = AB?1vectorei
for i?{1,...,k} and thus BAB?1vectorei =vector0. This means that the first k columns of
BAB?1 are all zero.
The remaining n?k columns have the following properties, because the re-
mainder of the matrix must be full rank. The first remaining column cannot be
all zeroes, the next cannot be the first nor all zeroes, and so forth. For i > k + 1,
the ith column cannot be in the (i?k?1)-dimensional subspace generated by the
previous i?1 columns, of which k are all-zero and i?k?1 are non-zero.
Obviously for the non-zero columns we have (2n?1)(2n?2)???(2n?2n?k?1)
choices. For the vectors in the null space, we have (2n?1)(2n?2)???(2n?2k?1)
91
choices, but a permutation of those vectors produces the same final matrix for a
different value of B, so a correction factor is needed.
Basically, the vectorv1,...,vectorvk was a basis for the nullspace, and nothing more. So,
the correction factor to prevent overcounting of the sameAgenerated by differentB
is just the number of bases of an n-dimensional space. The first vector could be any
one of the 2k?1 non-zero vectors in the space. The second vector can be anything
but the first or zero, and the third can be anything except zero, the first, the second,
or their sum. The ith can be anything not in the i?1 dimensional subspace of the
previous i?1 vectors, which is 2k?2i?1. Essentially, there are|GLk(GF(2))|ways
to choose a basis.
Finally, we have:
Pr[A?Mn(GF(2));nullity(A) = k] =
= (1?2
?n)(1?2?n+1)???(1?2?n+k?1)(1?2?n)(1?2?n+1)???(1?2?k?1)
(2k?1)(2k?2)???(2k?2k?1)
=
parenleftBigproducttext
i=k
i=1 1?2
?n+i?1
parenrightBigparenleftBigproducttext
i=n?k
i=1 1?2
?n+i?1
parenrightBig
parenleftBigproducttext
i=k
i=1 2k?2i?1
parenrightBig
[]
5.4.5 Triangulation or Inversion?
While the above form of the algorithm will reduce a system of linear equations
over GF(2) to unit upper triangular form, and thus permit a system to be solved
with back substitution, the M4RI algorithm can also be used to invert a matrix, or
put the system into reduced row echelon form (RREF). Simply run Stage 3 on rows
0???i?1 as well as on rows i+ 3k???m. This only affects the complexity slightly,
92
Table 5.3 Probabilities of a Fair-Coin Generated n?n matrix over GF(2), having
given Nullity
nullity n = 1000 n = 8 n = 3 n = 2
0 0.28879 0.28992 168/512 6/16
1 0.57758 0.57757 294/512 9/16
2 0.12835 0.12735 49/512 1/16
3 5.2388?10?3 5.1167?10?3 1/512 0
4 4.6567?10?5 4.4060?10?5 0 0
5 9.6914?10?8 8.5965?10?8 0 0
6 4.8835?10?11 3.7903?10?11 0 0
7 6.0556?10?15 3.5250?10?15 0 0
8 1.8625?10?19 5.4210?10?19 0 0
changing the 2.5 coefficient to 3 (calculation done in Section 5.6.2 on page 101).
To use RREF to invert a matrix, simply concatenate an identity matrix (of size
n?n) to the right of the original matrix (of size n?n), producing a n?2n matrix.
Using M4RI to reduce the matrix to RREF will result in an n?n identity matrix
appearing on the left, and the inverse matrix on the right.
5.5 Experimental and Numerical Results
Five experiments were performed. The first was to determine the correct value
of k for M4RI. The second was to determine the running time of both M4RI and
Gaussian Elimination. In doing these experiments, we noted that the optimization
level of the compiler heavily influenced the output. Therefore, the third experiment
attempted to calculate the magnitude of this influence. The fourth was to determine
if a fixed k or flexible k was superior for performance. The fifth was a spreadsheet
calculation to find an optimal k = c1 +c2 logn.
93
The specifications of the computer on which the experiments were run is given
at the end of Section B.1 on page 137. Except as noted, all were compiled under
gcc with the highest optimization setting (level three). The experiments consisted
of generating a matrix filled with fair coins, and then checking the matrix for invert-
ibility by attempting to calculate the inverse using M4RI to RREF. If the matrix
was singular, a new matrix was generated. If the matrix was invertible, the inverse
was calculated again using Gaussian Elimination to RREF. These two inverses were
then checked for equality, and finally one was multiplied by the original to obtain a
product matrix which was compared with the identity matrix. The times were cal-
culated using clock() from time.h built into the basic C language. The functions
were all timed independently, so extraneous operations like verifying the correctness
of the inverse would not affect running time (except possibly via cache coherency
but this is both unlikely and hard to detect). No other major tasks were being run
on the machine during the experiments, but clock() measures user-time and not
time in the sense of a wall clock.
In the first experiment (to determine the best value of k), the range of k
was permitted to change. The specific k which resulted in the lowest running time
was reported for 30 matrices. Except when two values of k were tied for fastest
(recall that clock() on Linux has a granularity of 0.01 sec), the thirty matrices
were unanimous in their preferred value of k in all cases. A linear regression on this
data shows that k = c1(logn)+c2 has minimum error in the mean-squared sense at
k = (3/4)(logn)+0. For the next two experiments, k was fixed at eight to simplify
addressing. Another observed feature of the first experiment was that the running
94
time was trivially perturbed if the value of k was off by one, and by a few percent
if off by two. The results are in Table 5.6 on page 97.
95
Tabl
e5.
4Exp
erimen
t1?
Optima
lCho
ices
of
k,
and
running
time
in
seconds.
Size
12
8
25
6
362
512
72
4
102
0
14
48
204
8
Be
st
k
5/
6
6
7
7/8
8
8/9
9
9
M4R
I
0.095
71
0.65
0
1.68
4.2
76
11.3
7
29.1
2
77
.58
204
.1
Gauss
0.18
71
1.54
7
4.40
5
12
.34
35.4
1
97.9
9
27
9.7
811
.0
Tabl
e5.
5Running
times
,in
ms
ec,
Optimiza
tion
Lev
el
0
k
1,02
4
1,5
36
2,0
48
3,072
4,0
96
6,144
8,192
12,2
88
163
84
5
870
2,7
50
6,2
90
20
,51
0
47,5
90
?*
?*
?
*
?
6
760
2,3
40
5,4
20
17
,54
0
40,6
30
132
,95
0
?*
1,0
33
,420
?
7
710
2,1
30
4,8
50
15
,48
0
35,5
40
116
,30
0
?*
903
,200
?
8
680
2,0
40
4,5
50
14
,32
0
32,6
20
104
,96
0
242
,99
0
798
,470
?
9
740
2,1
00
4,5
50
13
,86
0
30,9
90
97
,83
0
223
,27
0
737
,990
1,7
03,2
90
10
880
2,36
0
4,98
0
14,3
30
31
,13
0
95,8
50
21
5,0
80
69
0,58
0
1,595
,34
0
11
1,170
2,97
0
5,94
0
16,2
60
34
,02
0
99,9
80
21
8,3
20
68
0,31
0
1,528
,90
0
12
1,740
4,17
0
7,97
0
20,4
70
41
,02
0
11
3,2
70
23
8,1
60
70
8,64
0
1,557
,02
0
13
2,750
6,41
0
11
,89
0
29,2
10
55
,97
0
14
7,1
90
29
5,1
20
81
7,95
0
1,716
,99
0
14
4,780
10
,79
0
19
,39
0
45,6
10
84
,58
0
20
8,3
00
39
9,8
10
1,04
5,43
0
?
15
8,390
18
,76
0
33
,69
0
77,4
60
140
,64
0
33
5,7
10
62
3,4
50
1,52
9,74
0
?
16
15
,290
34
,34
0
60
,57
0
13
7,3
60
246
,01
0
56
9,7
40
1,03
4,6
90
2,44
0,41
0
?
*Indicat
es
tha
tt
oo
man
ya
bortio
ns
occurred
due
to
sing
ular
submat
rice
s.
See
Sectio
n5
.6.3
on
pa
ge
101
.
96
Table 5.6 Percentage Error for Offset of K, From Experiment 1
error of k 1,024 1,536 2,048 4,096 6,144 8,192 12,288 16384
-4 ? ? 48.0% 53.6% 38.7% ? 32.8% ?
-3 27.9% 34.8% 26.6% 31.1% 21.3% ? 17.4% ?
-2 11.8% 14.7% 11.7% 14.7% 9.5% 13.0% 8.5% 11.4%
-1 4.4% 4.4% 3.3% 5.3% 2.1% 3.8% 1.5% 4.3%
Exact 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
+1 8.8% 2.9% 3.4% 0.5% 4.3% 1.5% 4.2% 1.8%
+2 29.4% 15.7% 17.3% 9.8% 18.2% 10.7% 20.2% 12.3%
+3 72.1% 45.6% 47.7% 32.4% 53.6% 37.2% 53.7% ?
+4 155.9% 104.4% 110.8% 80.6% 117.3% 85.9% 124.9% ?
+5 304.4% 214.2% 229.1% 172.9% 250.2% 189.9% 258.7% ?
+6 602.9% 428.9% 458.9% 353.8% 494.4% 381.1% ? ?
Each trial of the second experiment consisted of the same code compiled under
all four optimization settings. Since k was fixed at eight, addressing was vastly
simplified and so the program was rewritten to take advantage of this. The third
experiment simply used the code from the second experiment, with the compilation
set to optimization level 3. The results are in Table 5.8 on page 112 and Table 5.7
on page 112.
In the fourth experiment, k was permitted to vary. This resulted in the best
running times, which was a surprise, because the addressing difficulties were non-
trivial, and varying k slightly has a small effect on running time. Yet in practice,
letting k vary did vastly improve the running time of the algorithm. See Table 5.5
on page 97 for the affect of relatively adjusting k upward or downward.
A fifth mini-experiment was to take the computational cost expression for
97
M4RI, and place it into a spreadsheet, to seek optimal values of k for very large
values of n, for which experimentation would not be feasible. The expression 1 +
logn?loglogn was a better fit than any c1 +c2 logn. On the other hand, it would
be very hard to determine the coefficient of the loglogn term in that expression,
since a double logarithm differs only slightly from a constant.
5.6 Exact Analysis of Complexity
Assume for simplicity that logn divides n and m. To calculate the cost of the
algorithm one need only tabulate the cost of each of the three stages, which will be
repeated min(m,n)/k times. Let these stages be numbered i = 1...min(m,n)/k.
The first stage is a 3k?(n?ik) underdefined Gaussian Elimination (RREF),
which requires?1.5(3k)(n?ik)2?0.75(3k)3 matrix memory operations (See Sec-
tion B.5.3 on page 143). This will be negligible.
The second stage, constructing the table, requires 3(n?ik?k) steps per row.
The first row is all zeroes and can be hard-coded, and the second row is a copy of
the appropriate row of the matrix, and requires (n?ik?k) reads followed by writes.
Thus one obtains 2(n?ik?k)+(2k?2)(3)(n?ik?k) = (3?2k?4)(n?ik?k)
steps.
The third stage, executed upon (m?ik?3k) rows (if positive) requires 2k+
3(n?ik?k) reads/writes per row. This becomes (m?ik?3k)(3n?3ik?k)
matrix memory operations in total, when that total is positive. For example, in a
square matrix the last 2 iterations of stage 1 will take care of all of these rows and
98
so there may be no work to perform in Stage 3 of those iterations. To denote this,
let pos(x) = x if x> 0 and pos(x) = 0 otherwise.
Adding steps one, two, and three yields
i=lscript/k?1summationdisplay
i=0
1.5(3k)2(n?ik)?0.75((3k)3) + (3?2k?4)(n?ik?k) +
(pos(m?ik?3k))(3n?3ik?k)
=
?
?
i=lscript/k?3summationdisplay
i=0
1.5(3k)2(n?ik)?0.75((3k)3)(3?2k?4)(n?ik?k)+
(m?ik?3k)(3n?3ik?k)]
+1.5(3k)2(n?lscript+ 2k)?0.75((3k)3)(3?2k?4)(n?lscript+k)
+1.5(3k)2(n?lscript+k)?0.75((3k)3)(3?2k?4)(n?lscript)
? ? 14kbracketleftbig2k(?6klscript+ 12nlscript?6lscript2)?6mlscript2?6nlscript2 + 4lscript3 + 12mnlscriptbracketrightbig
Recalling lscript = min(m,n) and substituting k = loglscript and thus 2k = lscript, we obtain,
? 14loglscriptparenleftbig6nlscript2?2lscript3?6mlscript2 + 12mnlscriptparenrightbig
Thus for the over-defined case (lscript = n) this is (4n3 + 6n2m)/(4logn), and for
the under-defined case (lscript = m) this is (18nm2 ?8m3)/(4logm), and for square
(5n3)/(2logn).
5.6.1 An Alternative Computation
If we let n = 2m, which would be the case for inverting a square matrix, we
obtain:
99
? 14kbracketleftbig2k parenleftbig?6km+ 18m2parenrightbig+ 10m3bracketrightbig
Now substitute
k = ?logm+?
and observe,
? 2
?
?logm+?
bracketleftbig18m2+?bracketrightbig+ 10m3
?logm+?
from which it is clear that ? > 1 would result in a higher-than-cubic complexity.
Also,? < 1 is suboptimal because of the gamma in the denominator of the first term.
As for ?, the picture is less clear. But what is interesting is that experimentation
shows ??0.75 is around best in practice. The net result is that the computational
cost model which I propose is approximate at best, due, perhaps, to the cache
consequences which the model cannot consider.
5.6.2 Full Elimination, not Triangular
Like Gaussian Elimination, the M4RI algorithm can be used not only to reduce
a matrix to Row-Echelon Form (making it upper triangular if it were full-rank), but
to Reduced Row-Echelon Form (making the left-hand part of the matrix the m?m
identity matrix if it were full-rank). The only change is that in Step 3, we process
all rows other than the 3k rows processed by Gaussian Elimination. Before, we only
processed the rows that were below the 3k rows, not those above. Thus instead
of m?3k?ik row additions in stage 3, we will require m?3k. Otherwise the
100
calculation proceeds exactly as in Section 5.6 on page 99.
i=lscript/k?1summationdisplay
i=0
1.5(3k)2(n?ik)?0.75((3k)3) + (3?2k?4)(n?ik?k) +
(pos(m?3k))(3n?3ik?k)
= lscript8kparenleftbig?14n+ 24mn+ 4mk?72kn?12k2?162k3 + 2k(24n?12k?12lscript)
+25k+ 7lscript?12mlscript+ 36lscriptk)
? lscript8kparenleftbig24mn?12mlscript+ 2k(24n?12lscript)parenrightbig
As before, if k = log2lscript, then
? 12log
2lscript
bracketleftbig6mnlscript?3mlscript2 + 6nlscript2?3lscript3bracketrightbig
and thus if lscript = n (the over-defined case), we have 3mn2+3n32log
2 n
and if lscript = m (the under-
defined case), we have 6m2n?3m3log
2 m
. In the case that m = n (the square case), we have
3n3/log2n. As specified in Section 5.4.5 on page 93, this is just the same formula
with 3 taking the place of 5/2.
5.6.3 The Rank of 3k Rows, or Why k+epsilon1 is not Enough
The reader may be curious why 3k rows are selected instead of k rows at the
small Gaussian Elimination step (Stage 1 of each iteration). Normally to guarantee
non-singularity, a system with k variables is solved with k + epsilon1 equations, where
epsilon1?2...100. However, this does not work in the M4RI algorithm, because lscript/loglscript
submatrices must be reduced by Gaussian Elimination, and the algorithm fails if
101
any of these submatrices is singular.
The answer is that the probability of k vectors of length 3k having rank k is
very high, as proved below. The small Gaussian Elimination will fail to produce the
identity matrix followed by rows of zeroes if and only if this submatrix is not of full
rank.
Lemma 5 A random GF(2) matrix of dimension 3k?k, filled by fair coins, has
full rank with probability?1?2?2k.
Proof: Consider the columns of the matrix as vectors. One can attempt to
count the number of possible full rank matrices. The first vector can be any one
of 23k?1 non-zero length 3k vectors. The second one can be any non-zero vector
distinct from the first, or 23k?2 choices. The third one can be any non-zero vector
not equal to the first, the second, or their sum, or 23k?4. The ith vector can be
any vector not in the space spanned by the previous i?1 vectors (which are linearly
independent by construction). Thus 23k?2i?1 choices are available. Therefore, the
probability of any k vectors of length 3k being linearly independent is
producttexti=k
i=1 (2
3k?2i?1)
(23k)k =
i=kproductdisplay
i=1
(1?2i?12?3k) ?1?
i=ksummationdisplay
i=1
2i?12?3k ?1?2?3k(2k?1)?1?2?2k
and this is the desired result. []
Even in the case k = 5, the actual probability of less than full rank is 9.46?
10?4, and the above formula has a relative error of 3.08?10?6, and is even more
accurate for higher k. Also, note when k = cloglscript then the probability of full
102
rank is 1?lscript?2c. Since there will be (lscript)/(loglscript)?1 iterations, the probability of
even one failure during all passes is approximately 1/(lscript2c?1 loglscript), which is very low,
considering that lscript may approach the millions.
Note that even if 2k?k were chosen, then the probability of failure over the
whole algorithm would be 1/loglscript, which is non-trivial. In practice, when k was
significantly lower than loglscript, the algorithm would abort very frequently, whereas
it never aborted in any of the experiments when k was set near loglscript. (Abortions
marked with a star in Table 5.5 on page 96).
5.6.4 Using Bulk Logical Operations
The above algorithm can be improved upon if the microprocessor has instruc-
tions for 32-bit (or even 64-bit) logical operations. Stages 2 and 3 essentially consist
of repeated row additions. The matrix can be stored in an 8-bits per byte format
instead of the 1-bit per byte format, and long XOR operations can perform these
vector additions. Stage 1 is unaffected. However, stages 2 and 3 can proceed 32
or 64 times as fast as normal if single-instruction logical operators are available in
those sizes, as they are on all modern PCs. Since only stages 2 and 3 were non-
negligible, it is safe to say that the algorithm would proceed 32 or 64 times faster,
for sufficiently large matrices.
Experimentally the author found that the speed-up varied between 80% to
95% of this figure, depending on the optimization settings of the compiler chosen.
However, there is absolutely no reason not to do this all the time, so the vector
103
additions were performed 64 entries at one time.
5.6.5 M4RI Experiments Performed by SAGE Staff
Martin Albrecht also performed some experiments for M4RI, just as he did for
M4RM. See also, Section 5.3.5 on page 85.
5.6.5.1 Determination of k
In order to independently determine if fixed or flexible k is better, some Sage
experiments were performed on matrices of size 1000, 2000,..., 14000, 15000. The k
attempted were 6,8,10 for the flexible method, and k = 8 for fixed addressing (see
Section 5.5 on page 94). The results are summarized in Table 5.9 on page 113. The
lowest of the three options fork in the flexible column is listed in the column ?least?.
The column ?Gaussian? is the author?s implementation of Gaussian Elimination,
and one ratio is the ratio of the least costly flexible k and Gaussian Elimination.
The other ratio is that of the fixed to the flexible M4RI. This shows that while the
fixed addressing has an advantage if k?8, when k ?should? be far from 8, there is
a penalty for picking the ?wrong? k that overrules the advantage of more simplified
addressing.
5.6.5.2 Comparison to Magma
Computingtheechelon-formofamatrixwastriedforthreesizesversusMagma.
First, 1000?40,000 where M4RI ran in 1.26 sec, and Magma in 1.63 sec. Sec-
104
ond, 10,000?10,000, where M4RI ran in 15.84 sec and Magma in 5 sec. Third,
40,000?1000 where M4RI ran in 1.21 sec and Magma in 0.88 sec.
5.6.5.3 The Transpose Experiment
One experiment was to multiply a 200?1000 matrix with a 1000?100,000
matrix. Clearly, it would be faster to do 100,000?2000 times 2000?2000 using
the ?transpose trick? described in Section 5.3.2 on page 84. The effects are given
in Table 5.10 on page 113. In particular, 180 msecs without a transpose and 190
msecs with one. However, this is likely because we are using a na??ve approach for
calculating the transpose, rather than butterfly shu?es or some other fast technique.
This is an area for improvement.
Figure 5.1 on page 106 shows a plot of the running time of M4RI compared
with Magma.
5.7 Pairing With Strassen?s Algorithm for Matrix Multiplication
As stated earlier, MAGMA uses Strassen?s Algorithm for matrix multiplication
for large GF(2)-matrices, not the na??ve approach. This is for two reasons. First, in
finite fields, there is no rounding error. Second, the exponent is lower (log2 7?2.807
vs 3).
Since Strassen?s Algorithm multiplies ann?nmatrix in 7 calls to ann/2?n/2
algorithm, versus 8 for the na??ve method, one can estimate the cross-over easily. The
?break-even? point occurs when the time spent on the extra overhead of Strassen?s
105
Figure 5.1: A Plot of M4RI?s System Solving in Sage vs Magma
106
algorithm (the 18 matrix additions, for example) equals the time saved by the one
fewer matrix multiplication. Table 5.1 on page 86 shows that this occurs at about
slightly below 4000?4000. This is because a 2000?2000 requires 0.03 sec, and a
4000?4000 requires 0.26 sec, slightly more than eight times as much.
On the other hand, at 6000?6000 the M4RM algorithm is roughly equal
to the Strassen-na??ve combo that Magma is using (despite the fact that Magma is
famously hand-optimized). Considering that 0.03 sec are required for M4RM and
0.16 for Magma in the 2000?2000 case, we can expect a roughly 16/3 speed-up by
combining M4RM with Strassen over Magma.
5.8 The Unsuitability of Strassen?s Algorithm for Inversion
It is important to note that Strassen?s famous paper [Str69] has three algo-
rithms. The first is a matrix multiplication algorithm, which we call ?Strassen?s
Algorithm for Matrix Multiplication.? The second is a method for using any matrix
multiplication technique for matrix inversion, in asymptotically equal time (in the
big-? sense). We call this Strassen?s Formula for Matrix Inversion. The third is a
method for the calculation of the determinant of a matrix, which is of no concern to
us. Below, Strassen?s Formula for Matrix Inversion is analyzed, by which a system
of equations over a field can be solved.
Given a square matrix A, by dividing it into equal quadrants one obtains the
following inverse (A more detailed exposition is found in [CLRS01, Ch. 28], using
the same notation):
107
A =
?
??
?
B C
D E
?
??
? ? A?1 =
?
??
?
B?1 +B?1CS?1DB?1 ?B?1CS?1
?S?1DB?1 S?1
?
??
?
where S = E?DB?1C, which is the Schur Complement of A with respect to B.
One can easily check that the product of A and the matrix formula for A?1
yields the identity matrix, either multiplying on the left or on the right. If an inverse
for a matrix exists, it is unique, and so therefore this formula gives the unique inverse
of A, provided that A is in fact invertible.
However, it is a clear requirement of this formula that B and S be invertible.
Over the real numbers, or other subfields of the complex numbers, one can show
that ifAandB are non-singular, thenS is non-singular also [CLRS01, Ch. 28]. The
problem is to guarantee that the upper-left submatrix, B, is invertible. Strassen did
not address this in the original paper, but the usual solution is as follows (more
details found in [CLRS01, Ch. 28]). First, if A is positive symmetric definite, then
all of its principal submatrices are positive symmetric definite, including B. All
positive symmetric definite matrices are non-singular, soB is invertible. Now, ifAis
not positive symmetric definite, but is non-singular, then note that ATA is positive
symmetric definite and that (ATA)?1AT = A?1. This also can be used to make
a pseudoinverse for non-square matrices, called the Moore-Penrose Pseudoinverse
[Moo20, Pen55, Ber95].
However, the concept of positive symmetric definite does not work over a finite
field, because these fields cannot be ordered (in the sense of an ordering that respects
the addition and multiplication operations). Observe the following counterexample,
108
A =
?
??
??
??
??
??
?
1 0 0 0
1 0 1 0
0 1 0 0
0 1 0 1
?
??
??
??
??
??
?
ATA =
?
??
??
??
??
??
?
0 0 1 0
0 0 0 1
1 0 1 0
0 1 0 1
?
??
??
??
??
??
?
Both A and ATA have det = 1, thus are invertible. Yet in both cases the upper-left
hand 2x2 submatrices have det = 0, and therefore are not invertible. Thus Strassen?s
formula for inversion is unusable without modification. The modification below is
from Aho, Hopcroft and Ullman?s book [AHU74, Ch. 6] though it first appeared in
[BH74].
5.8.1 Bunch and Hopcroft?s Solution
Consider a matrix L that is unit lower triangular, and a matrix U that is unit
upper triangular. Then Strassen?s Matrix Inversion Formula indicates
L =
?
??
?
B 0
D E
?
??
? ? L?1 =
?
??
?
B?1 0
?E?1DB?1 E?1
?
??
?
U =
?
??
?
B C
0 E
?
??
? ? U?1 =
?
??
?
B?1 ?B?1CE?1
0 E?1
?
??
?
Note S = E?DB?1C becomes S = E in both cases, since either C or D is the zero
matrix. Since L (or U) is unit lower (or upper) triangular, then its submatrices B
and E are also unit lower (or upper triangular), and therefore invertible. Therefore
109
Strassen?s Matrix Inversion Formula over GF(2) will always work for unit lower or
upper triangular matrices.
It is well known that any matrix over any field has a factorization A = LUP
where P is a permutation matrix, L is unit lower triangular and U is unit upper
triangular [Hig02]. OnceAis thus factored, the matrix inversion formula is sufficient
to calculate A?1. Aho, Hopcroft and Ullman [AHU74, Ch. 6] give an algorithm for
computing the LUP factorization over an arbitrary field, in time equal to big-? of
matrix multiplication, by use of a black-box matrix multiplication algorithm. We
call this algorithm AHU-LUP. The algorithm is found in Section C.2.2 on page 151.
Once the factorization of A is complete, Strassen?s Matrix Inversion Formula can be
applied to L and U. Note A?1 = P?1U?1L?1.
5.8.2 Ibara, Moran, and Hui?s Solution
In[IMR82], Ibara, Moran, andHuishowhowtoperformanLQUP-factorization
with black-box matrix multiplication. The LQUP-factorization is similar to the
LUP-factorization, but can operate on rank-deficient matrices. Therefore, if inter-
mediate submatrices are singular, there is no difficulty.
A factorizationA = LQUP hasLas lower-triangular, m?m, andU as upper-
triangular,m?n. The permutation matrixP isn?nas before. The added flexibility
comes from the matrix m?m matrix Q which is zero everywhere off of the main
diagonal, and contains r ones followed by m?r zeroes. Here r is the rank of A.
This is how singular and rank-deficient A can be represented, while L, U,
110
and P can be kept invertible. The determinant of Q is zero if and only if r <
m. The algorithm is simpler than Bunch and Hopcroft, but is less amenable to
parallelization, as it requires copying rows between submatrices after cutting.
111
Table 5.7 Results of Experiment 3?Running Times, Fixed k=8
Size M4RI Gaussian Ratio
4,000 rows 18.97 s 6.77 s 2.802
6,000 rows 59.40 s 22.21 s 2.674
8,000 rows 135.20 s 51.30 s 2.635
12,000 rows 167.28 s 450.24 s 2.692
16,000 rows 398.12 s 1023.99 s 2.572
20,000 rows 763.92 s 1999.34 s 2.617
Table 5.8 Experiment 2?Running times in seconds under different Optimizations,
k=8
Opt 0 Opt 1 Opt 2 Opt 3
4000 x 4000
Gauss 91.41 48.35 48.37 18.97
Russian 29.85 17.83 17.72 6.77
Ratio 3.062 2.712 2.730 2.802
6000 x 6000
Gauss 300.27 159.83 159.74 59.40
Russian 97.02 58.43 58.38 22.21
Ratio 3.095 2.735 2.736 2.674
8000 x 8000
Gauss 697.20 371.34 371.86 135.20
Russian 225.19 136.76 135.21 51.30
Ratio 3.096 2.715 2.750 2.635
112
Table 5.9 Trials between M4RI and Gaussian Elimination (msec)
Matrix Fixed Flexible Ratio Ratio
Size K=8 K=6 K=8 K=10 Least Gaussian G/Least fixed/flex
1,000 0 10 10 10 10 20 2.0000 0.0000
2,000 20 40 40 40 40 133 3.3125 0.5000
3,000 70 110 100 110 100 383 3.8250 0.7000
4,000 150 230 210 210 210 873 4.1548 0.7143
5,000 350 790 430 470 430 1,875 4.3605 0.8140
6,000 940 1,180 990 1,060 990 4,178 4.2197 0.9495
7,000 1,970 5,320 2,120 1,980 1,980 8,730 4.4091 0.9949
8,000 3,360 4,450 3,480 3,280 3,280 14,525 4.4284 1.0244
9,000 4,940 6,830 5,240 4,970 4,970 22,233 4.4733 0.9940
10,000 7,110 9,820 7,240 6,890 6,890 31,180 4.5254 1.0319
11,000 9,340 13,010 9,510 9,090 9,090 41,355 4.5495 1.0275
12,000 12,330 46,470 12,640 12,010 12,010 54,055 4.5008 1.0266
13,000 15,830 20,630 16,040 15,260 15,260 67,920 4.4509 1.0374
14,000 19,280 62,180 19,640 18,690 18,690 83,898 4.4889 1.0316
15,000 23,600 45,840 24,080 22,690 22,690 101,795 4.4863 1.0401
*The second k = 8 includes the ?fixed k? with streamlined addressing as described
in Section 5.5 on page 94.
Table 5.10 The Ineffectiveness of the Transpose Trick
k C = AB C = (BTAT)T
1 0.79 s 0.37 s
2 0.35 s 0.25 s
3 0.23 s 0.22 s
4 0.20 s 0.20 s
5 0.18 s 0.21 s
6 0.25 s 0.20 s
7 0.33 s 0.19 s
8 0.54 s 0.19 s
9 0.82 s 0.19 s
10 1.31 s 0.19 s
11 2.10 s 0.19 s
(200?1000 by 1000?100,000)
113
Table 5.11 Optimization Level 3, Flexible k
Dimension 4,000 8,000 12,000 16,000 20,000 24,000 28,000 32,000
Gaussian 19.00 138.34 444.53 1033.50 2022.29 3459.77 5366.62 8061.90
7 7.64 ? ? ? ? ? ? ?
8 7.09 51.78 ? ? ? ? ? ?
9 6.90 48.83 159.69 364.74 698.67 1195.78 ? ?
10 7.05 47.31 151.65 342.75 651.63 1107.17 1740.58 2635.64
11 7.67 48.08 149.46 332.37 622.86 1051.25 1640.63 2476.58
12 ? 52.55 155.51 336.11 620.35 1032.38 1597.98 2397.45
13 ? ? 175.47 364.22 655.40 1073.45 1640.45 2432.18
14 ? ? ? ? ? ? 1822.93 2657.26
Min 6.90 47.31 149.46 332.37 620.35 1032.38 1597.98 2397.45
Gauss/M4RI 2.75 2.92 2.97 3.11 3.26 3.35 3.36 3.36
114
Chapter 6
An Impractical Method of Accelerating Matrix Operations in Rings
of Finite Size
6.1 Introduction
It is well known that an n?n matrix with entries from the ring R can be
thought of as ann/b?n/bmatrix composed of entries from the ringMb(R), provided
of course that b|n. Usually this is not done, because Mb(R) may lack properties
that R might have. For example, if R is a field then Mb(R) is not a field for b> 1.
Nonetheless in this chapter we will derive a speed-up from this representation when
performing an n?n matrix multiplication, or z such multiplications of distinct
matrices.
We assume an algorithm exists which can perform n?n matrix multiplication
in?cn? ring operations, for some c and some ?, over all rings, or perhaps a class of
rings that includes all finite rings (Noetherian, Artinian, etc...). Examples include
na??ve matrix multiplication for c = 2 and ? = 3, and Strassen?s Algorithm for
? = log2 7?2.807, and c varying by implementation, both of which will work on
any ring (or semiring, see Section A.2 on page 133). The algorithm used will be
denoted ?the baseline algorithm.?
We will show that one can derive an algorithm, parameterized by some b, that
115
runs faster by a factor of (logqn)??22 , when b is optimally chosen and where|R|= q.
After briefly summarizing the practical and theoretical impact of this chapter,
we present the algorithm itself in Section 6.2. Next, we demonstrate how to choose
the algorithm?s parameter in Section 6.3. There are two ways to make the choice,
which we call liberal and conservative. Next, we show how the algorithm can be
improved in the special case of a finite field, in Section 6.4. The feasibility of the
algorithm is shown in Section 6.5, for finite rings up to order 8. The algorithm
is compared to the work of Atkinson and Santoro [AS88], on which it is based, in
Section 6.6. Finally some miscellaneous comments are found in Section 6.7.
6.1.1 Feasibility
While the algorithm presented in this chapter is not actually infeasible, it
does require a surprising amount of memory for a modest gain in performance.
Currently, the cost of memory versus the cost of processors would make this trade-
off a strange choice. On the other hand, processor speed seems to have stagnated,
while memory capacity seems to be growing. Therefore, there may come a time when
the methods of this chapter are more attractive. Unlike, for example, Sch?onhage?s
Algorithm for matrix multiplication [Sch81], which is infeasible for all examples,
there are particular specific rings (e.g. GF(3) or GF(5)) and problem sizes for which
this algorithm can be implemented feasibly today on an ordinary PC and with a
noticeable acceleration over Strassen?s algorithm (See Table 6.2 on page 129).
116
6.2 The Algorithm over a Finite Ring
Consider the set of all b?b matrices over a finite ring R of size q. Each
matrix has b2 entries, and so there are qb2 such possible matrices. Furthermore,
each requires b2ceilingleftlog2qceilingrightbits of storage. One could construct a multiplication table
for the ring Mb(R), with qb2 rows and columns. Each entry in the multiplication
table is a b?b matrix and so (qb2)2b2ceilingleftlog2qceilingright= b2q2b2ceilingleftlog2qceilingrightbits of storage would
be required. Obviously, b must be extremely small in order for this to be feasible.
The na??ve matrix multiplication method requires precisely 2n3 operations to
multiply two n?n matrices. Since b will be very small, it is safe to assume that the
na??ve matrix multiplication algorithm is a good choice, if not the best choice, for
generating the multiplication table?s entries. Thus the work required to build the
table will be precisely 2b3 ring operations per entry, or a total of precisely 2b3q2b2
ring multiplies in R. (See Notes, Section 6.7 on page 130, for a discussion on why
we do not count ring additions.)
Since the baseline algorithm works for all finite rings, it works over the ring
Mb(R). Rewriting our n?n matrix over R into an n/b?n/b matrix over Mb(R), we
can use the baseline algorithm to perform the matrix multiplication in time equal
to?c(n/b)? ring multiplication operations.
For this reason, replacing the ring multiplication of Mb(R) with a look-up
table is a source of potential improvement. However, there are two pitfalls. First,
generating the look-up tables will take time, and this time must be included in our
algorithm?s complexity. Second, the memory required to store the look-up tables
117
must exist.
To see why it is important that we take care to identify in which ring a multiply
is done, note that if b = 32000 and R = GF(2) then one ring operation takes about
an hour on a modern laptop at the time that this dissertation was written.
6.2.1 Summary
We are given the task of multiplying z distinct pairs of n?n matrices. We
have an algorithm that requires ? cn? ring operations to perform n?n matrix
multiplication, for any ring (or a class of rings including finite rings). Our matrices
have entries from a ring R of size q. We will follow the four steps in Algorithm 3 on
page 118.
1: Select a positive integer b (details to follow).
2: Generate a look-up table for multiplication in the ring Mb(R), in other words
b?b R-matrix multiplication.
3: For each of our z matrix pairs:
(a) Divide our n?n matrix over R into b?b submatrices thus making an
n/b?n/b matrix over Mb(R).
(b) Execute the ?baseline? algorithm over Mb(R).
Algorithm 3: The Finite-Ring Algorithm
6.2.2 Complexity
The first step will be trivial once we determine good formulas for b. Step Two
willrequire2b3 R-multiplicationsforeachofq2b2 matrices, or2b3q2b2 R-multiplications
total. Step 3a runs in quadratic time, and thus is negligible. Step 3b will require
118
c(n/b)? multiplies in the ringMb(R), repeatedz times or a total ofczn?/b? multipli-
cations in Mb(R). The extra storage for the big look-up table is b2q2b2ceilingleftlog2qceilingrightbits.
(Recall there is a small look-up table for the field multiplications). See Section 6.2.4
on page 120 for a note on the storage requirements.
For the sake of simplicity, it will be productive to convert these into matrix-
memory operations. Assuming R-multiplications are implemented by a look-up
table, they will require 4 ring element memory operations (two to read the multipli-
cands, one to read the product, and one to write the product). Each ring element
memory operation isceilingleftlog2qceilingrightbit read/writes, for a total of 4ceilingleftlog2qceilingrightmatrix-memory
bit operations.
Using the big look-up table forMb(R), one can see 2b2ceilingleftlog2qceilingrightmatrix-memory
operations are required, because one is copying b2 elements of R instead of just a
single element. Finally we have,
?(2b3q2b2)(4ceilingleftlog2qceilingright) + czn
?
b? 4b
2ceilingleftlog
2qceilingright
or, more simply,
?
bracketleftbigg
8b3q2b2 + 4czn
?
b??2
bracketrightbigg
ceilingleftlog2qceilingright
which is to be compared with 4zcn?ceilingleftlog2qceilingright. Note that ifR = GF(2) thenceilingleftlog2qceilingright=
1, which conforms to the ?(zn?) matrix memory operations one would intuitively
expect.
The objective, therefore, is to make the first term o() of the second, and thus
119
achieve a speed up of b??2.
6.2.3 Taking Advantage of znegationslash= 1
In the special case when we know we will execute several iterations of multi-
plying n?n matrices over the same ring, then building a larger table might make
sense. If one substitutes
nprime = n??z
thenc(nprime)? = czn?. Thus working onz multiplications ofn?nmatrices is equivalent
to working on one multiplication of nprime?nprime matrices. For this reason, we can proceed
by assuming z = 1, and programmers can use the above formula for nprime to adjust in
the case that z > 1.
6.2.4 The Transpose of Matrix Multiplication
Since it is the case thatATBT = (BA)T, we need not compute allq2b2 products
in the large look-up table. Instead, we could generate the table for the set of all
symmetric possibilities for A, and half of the set of the non-symmetric possibilities
for A, each with all possibilities for B. Then we could copy the answers for the
other half of the possibilities for non-symmetric A. Since the fraction of matrices
which are symmetric is vanishingly small as b gets large, this means that we do half
as much work in generating the tables, but still need the full amount of memory.
The total complexity is now
?
bracketleftbigg
b3q2b2 + cn
?
b??2
bracketrightbigg
4ceilingleftlog2qceilingright
120
which is to be compared with
?4cn?ceilingleftlog2qceilingright
6.3 Choosing Values of b
There are two choices of b that will prove productive. The first, which we
denote ?conservative?, requires only logarithmically more memory than storing the
original matrices. The second, which we denote ?liberal?, is optimal but uses a great
deal of memory.
6.3.1 The ?Conservative? Algorithm
Let b = radicalbiglogqn. One then obtains the time requirement
?
bracketleftBigg
(logqn)3/2q2logq n +c n
?
(radicalbiglogqn)??2
bracketrightBigg
4ceilingleftlog2qceilingright
which simplifies to
?
bracketleftBigg
(logqn)3/2n2 +c n
?
(logqn)??22
bracketrightBigg
4ceilingleftlog2qceilingright
Clearly, the left-hand term is absorbed into the right-hand term. We have now
established a speed-up of
(logqn)??22
121
The memory requirement is
? (logqn)n2ceilingleftlog2qceilingright
? (log2n)n2
parenleftbigg
1 + o(1)log
2q
parenrightbigg
which is only slightly worse than the 2n2ceilingleftlog2qceilingrightbits required to store the original
two matrices themselves. The ratio of the extra storage to the original storage is
(logqn)/2, or (logqn)/2z if znegationslash= 1. Note, while we disposed of z in terms of running
time in Section 6.2.3 on page 120, the storage requirements of the original problem
also grow linearly with z. Therefore the ratio of the extra memory that we require
in order to build our tables to the memory required to store the original problem
will have a z term.
6.3.2 The ?Liberal? Algorithm
The strategy of finding the optimal b is as follows. So long as the left-hand
term is absorbed into the right-hand term, increasing b is beneficial, because it is
in the denominator of the right-hand term. Therefore, we wish to find the largest b
where the left-hand term is still negligible. For example,
b =
radicalbigg??epsilon1
2 logqn
which implies
q2b2 = n??epsilon1
122
yielding a time requirement of
bracketleftBiggparenleftbigg
??epsilon1
2 logqn
parenrightbigg3/2
n??epsilon1 +c n
?
(??epsilon12 logqn)??22
bracketrightBigg
4ceilingleftlog2qceilingright
So long as epsilon1 > 0, the table-building time can be neglected, and if epsilon1 = 0, it
cannot be neglected. Therefore, up to minor factors like adding a constant to b, this
is close to optimal.
Of course, one could solve for the ideal b by setting the derivative of the
complexity expression to zero, and solving forbexplicitly or numerically. In practice,
however, since b< 8 in any conceivable case even in the distant future, is it simply
better to run a trial for values of b = 2,...,b = 7, and find the optimum suited to
a particular machine or architecture experimentally.
Unfortunately, the memory requirement is
??epsilon1
2 (logqn)ceilingleftlog2qceilingrightn
??epsilon1
which is significantly more than our previous substitution. In fact, the ratio of the
memory required for the tables to the memory required for the original two matrices
is
??epsilon1
4 (logqn)n
??epsilon1?2
Finally, note the speed-up factor is therefore
parenleftbigg??epsilon1
2 logqn
parenrightbigg??2
2
123
6.3.3 Comparison
The liberal approach accomplishes more in theory but is too expensive (in
memory) to be feasible at all. The conservative approach is less expensive but still
accomplishes an improvement.
6.4 Over Finite Fields
The following idea in this section is inspired by projective geometry, and works
over any finite field other than GF(2). Consider that all matrices in M2(GF(q)) can
be written as scalar multiples of one of the following five matrices:
?
???
???
?
??
?
1 q1
q2 q3
?
??
?,
?
??
?
0 1
q1 q2
?
??
?,
?
??
?
0 0
1 q1
?
??
?,
?
??
?
0 0
0 1
?
??
?,
?
??
?
0 0
0 0
?
??
?
?
???
???
Discarding the rightmost, we can represent every matrix in M2(GF(q)) in the
form (k,m) with k?GF(q) and m a matrix of the type listed above. To calculate
the product of (k1,m1) and (k2,m2) one would look up the table entry for m1m2,
which must be recorded in the form (k3,m3). Then one must calculate kprime = k1k2k3,
and store (kprime,m3). We claim the added complexity is negligible (see Section 6.4.1
on page 125).
The point of this change, however, is that the table is drastically smaller in
terms of the number of entries. This reduces the memory and time requirements
accordingly, and allows for a bigger b.
In particular, let m = b2 for notational simplicity, and observe that there are
qm?1 matrices of the first kind, qm?2 of the second, and down to 1 matrix of the last
124
kind. (The all zero matrix has been discarded). This is a total of
qm?1
q?1
matrices, or roughly 1/qth as many as previously required. This means the table
will be smaller by a factor of about 1/q2.
6.4.1 Complexity Penalty
Algorithm 4 on page 125 shows the algorithm for multiplying that is now
replacing our table look-up for Mb(R).
1: Read (k1,m1). (b2 + 1 field element reads).
2: Read (k2,m2). (b2 + 1 field element reads).
3: Look-up m1m2 = (k3,m1m2) (b2 + 1 field element reads).
4: Calculate kprime = k1k2k3. (2 field operations).
5: Write (kprime,m3). (b2 + 1 field element reads).
Algorithm 4: Fast Mb(R) multiplications for R a finite field but not GF(2)
The calculation of k1k2k3 requires only two field operations, or an additional
8 field element reads/writes at the rate of 4 reads/writes per multiply (as discussed
earlier). This is 8 + 4b2 + 4 field element reads, whereas we had 4b2 before. The
difference is negligible in?notation, and so will not be discussed further.
6.4.2 Memory Requirements
Before,q2b2 entries in the look-up table were needed, but now we requireq2b2/q2
or q2b2?2 entries. Before, each was b2 ring elements, and now each one is b2 +1 field
125
elements. The total memory required is thus
q2b2?2(b2 + 1)ceilingleftlog2qceilingright
6.4.3 Time Requirements
The ?transpose trick? would be difficult to use in this context and so the time
to compute the contents of the look-up table is simply 2b3 field multiplications, or
8b3ceilingleftlog2qceilingrightmatrix-memory operations each.
This results in a total complexity of
bracketleftbigg
8b3q2b2?2 + 4c n
?
b??2
bracketrightbigg
ceilingleftlog2qceilingright
6.4.4 The Conservative Algorithm
If we substitute bprime = radicalbig1 + logqn, then 2(bprime)2?2 = 2logqn, which was the ex-
ponent ofq in the conservative derivation for the finite ring version of the algorithm.
The speed up then becomes
(1 +logqn)??22
which is a mild improvement.
The memory requirement is
? (1 +logqn)n2ceilingleftlog2qceilingright
126
? (1 +log2n)n2
parenleftbigg
1 + o(1)log
2q
parenrightbigg
which is only slightly worse than the 2n2ceilingleftlog2qceilingrightbits required to store the original
two matrices themselves. The ratio of the extra storage to the original storage is
(1 +logqn)/2, or (1+ logqn)/2z if znegationslash= 1.
6.4.5 The Liberal Algorithm
If we substitute
bprime =
radicalbigg
1 + ??epsilon12 logqn
then 2(bprime)2?2 = (??epsilon1)logqn, which was the exponent of q in the liberal derivation
for the finite ring algorithm. The speed up then becomes
(1 + ??epsilon12 logqn)??22
which is a mild improvement.
The memory requirement is
(1 + ??epsilon12 logqn)n??epsilon1ceilingleftlog2qceilingright
which is the most expensive requirement thus far.
6.5 Very Small Finite Fields
The memory required for this algorithm, for typical small rings and fields, is
given in Table 6.2 on page 129. Note, there is no question of ?liberal? or ?conser-
127
Table 6.1 The Speed-Up and Extra Memory Usage given by the four choices
Memory Used
Liberal Conservative
Finite Ring ??epsilon12 (logqn)ceilingleftlog2qceilingrightn??epsilon1 (logqn)n2ceilingleftlog2qceilingright
GF(q), qnegationslash= 2 (1 + ??epsilon12 logqn)n??epsilon1ceilingleftlog2qceilingright (1 +logqn)n2ceilingleftlog2qceilingright
Speed-Up
Liberal Conservative
Finite Ring parenleftbig??epsilon12 logqnparenrightbig??22 (logqn)??22
GF(q), qnegationslash= 2 (1 + ??epsilon12 logqn)??22 (1 +logqn)??22
vative? memory allocation because b is fixed, and those adjectives referred to the
calculation of a good b as n increases.
A few things are obvious. First, that non-fields pay a heavy penalty for being
unable to use the ?projective trick?. Second, this algorithm is not completely in-
feasible. For example, GF(4) and b = 3 requires 9.16?109 bits of memory, or 1.07
gigabytes, about half the memory of a typical newly purchased PC at the time this
dissertation was written. While this is enormous for one table, a speed-up of 2.427
over Strassen?s algorithm without the look-up tables is achieved. Third, for fields
much larger than GF(8), this algorithm is completely infeasible.
It should be noted, however, that these speed-ups would probably be vastly
offset by the fact that this algorithm would have essentially no cache efficiency via
spatial or temporal locality.
128
Tabl
e6.
2Memory
Requir
ed
(in
bits)
and
Sp
eed-Up
Fa
ctor
(w/St
rassen)
for
Var
ious
Rings
GF
(2)
GF
(3)
GF
(4)
Z4
GF
(5)
Z6
GF
(7)
GF
(8)
Z8
Sp
eed-
Up
b=2
25
6
1,6
40
8,7
38
131
,072
48
,82
8
5,03
8,8
48
360,3
00
79
8,91
5.5
0?
10
8
1.7
50
b=3
262
,14
4
9.6
9?
10
7
9.1
6?
10
9
1.3
7?
10
11
4.7
7?
10
11
3.0
5?
10
14
1.0
2?
10
14
8.5
8?
10
14
5.4
0?
10
16
2.4
27
b=4
4.29
?10
9
4.6
3?
10
14
2.4
6?
10
18
3.6
9?
10
19
2.9
1?
10
21
2.3
9?
10
25
6.9
0?
10
25
3.7
7?
10
27
2.3
8?
10
29
3.0
61
b=5
1.1
3?
10
15
1.7
9?
10
23
1.6
9?
10
29
2.5
4?
10
30
1.1
1?
10
34
2.4
2?
10
39
1.1
2?
10
41
6.8
0?
10
43
4.2
8?
10
45
3.6
65
129
6.6 Previous Work
Atkinson and Santoro?s algorithm appeared in [AS88]. Unlike the method pre-
sented in this chapter, it is given only for boolean matrices, requires both addition
and multiplication tables (versus only multiplication), and uses the na??ve cubic-time
algorithm for the baseline operation. Furthermore, the ?transpose? and ?projective?
tricks were not used. That paper does not consider z negationslash= 1. Unfortunately, Atkin-
son and Santoro appear to have neglected the time required to copy table-lookups
into the final matrix in their calculations. Therefore, their speed-up of (log2n)3/2
actually should be radicalbiglog2n. However, this work is clearly an extension of their
idea, and therefore this algorithm should be called ?the Extended Atkinson-Santoro
Algorithm.?
6.7 Notes
6.7.1 Ring Additions
Normally in linear algebra one counts multiplications and inversions but not
additions and subtractions, because the former two operations are near-cubic time
and the latter two operations are quadratic time. In finite field computations,
GF(q = pn) will be represented as a n-dimensional GF(p)-vector space. Addition
of vectors is trivial, especially if the model used is an n-dimensional Z-module with
occasional modulus operations to prevent overflow. Likewise, ZN additions can be
modeled the same way.
130
While the cases of matrix rings, finite fields, and ZN are very common finite
rings, they are other finite rings. For example, take the Boolean Ring over the set
of three letters. The set of three letters is an alphabet, and the ring is composed
of the power set of the alphabet. (i.e. all subsets of the alphabet are members of
the ring). The multiplication operation is an intersection of sets, and the addition
operation is the set symmetric difference. That is to say that
x?(A+B)?x?(A?B)?xnegationslash?(A?B)
It is easy to see that this algebraic object is a commutative ring with eight
elements, and therefore a candidate for this algorithm. Clearly a look-up table for
both addition and multiplication is the best option for computing in this ring, and
so addition and multiplication require equal time. However, such rings other than
GF(2) are uncommon and have few applications that the author is aware of (only
certain combinatorial computations).
Interestingly, Atkinson and Santoro used look-up tables in the original paper
for both addition and multiplication. The ring involved was actually not a ring, but
the boolean semiring discussed in Section A.2 on page 133. But additions in Mb(S),
where S is that semiring, are trivial.
6.7.2 On the Ceiling Symbol
The ceiling operation in the memory requirement is not strictly optimal. One
can write ceilingleftb2 log2qceilingright instead of b2ceilingleftlog2qceilingright. The former represents enumerating all
131
possible matrices in Mb(R), and assigning them ?ID numbers.? The latter repre-
sents storing them as a two-dimensional array of field elements. The array option
has many advantages, like simpler addressing, and enabling the faster additions
mentioned in the previous subsection.
Ontheotherhand, forM3(GF(5)), notethatb2ceilingleftlog2qceilingright= 27, whileceilingleftb2 log2qceilingright=
21. Thus memory usage could be cut by 22% in this case. However, this would re-
quire doubling the size of the extra storage (i.e. one table for multiplication and
one for addition), and this completely overshadows the 22% advantage.
132
Appendix A
Some Basic Facts about Linear Algebra over GF(2)
The purpose of this chapter is to identify some facts aboutGF(2)-vector spaces
and about matrices over GF(2). To emphasize the differences between matrices
over R or C and matrices over GF(2), we note several interesting phenomena. We
assume the contents of this chapter are already known. They are stated here so that
they can be used elsewhere, and for background.
A.1 Sources
Normally, we would cite a series of useful textbooks with background informa-
tion but amazingly there is no text for finite field linear algebra. We do not know
why this is the case. The algorithms book [AHU74, Ch. 6] mentions algorithms
for finite field linear algebra. There are a few pages in [LN94, Ch. 7] that deal
with this topic, there named ?Linear Modular Systems.? Also, Krishnamurthy?s
work [Kri85, Ch. 2] discusses linear algebra over the integers, a related topic. The
studies [HW98, FMR+95] appear highly cited and relevant but we have been unable
to obtain a copy of either.
A.2 Boolean Matrices vs GF(2) Matrices
In graph theory, a particular ring-like object is often used whose elements are
?true? and ?false?; multiplication is logical-AND and addition is logical-OR. The
identity element for addition is ?false?. But then clearly, this algebraic object has
no additive inverse for ?true?. Thus it is a semigroup on both operations (as well
as a monoid on both operations) and the name for this bizarre arrangement is a
semiring. It turns out that linear algebra can be done in this world, in the sense of
matrix multiplication and matrix exponentiation for calculating transitive closures
of digraphs. Matrices filled with elements from this semiring are called boolean
matrices.
Therefore, to distinguish between those matrices and matrices from GF(2), we
will use the term ?boolean matrix? for the former and ?GF(2) matrices? for the
latter. For example, the Method of Four Russians for Multiplication was designed
for boolean matrices, but as will be shown in Section 5.3 on page 82, we have adapted
it for GF(2) matrices.
A.3 Why is GF(2) Different?
This section contains three very basic observations that are intended to remind
the reader that GF(2)-vector spaces are different from R-vector spaces, such as R3.
The author assumes these examples have been known for quite some time, but they
133
serve to remind the reader of some crucial differences, and will be touched on later
in the dissertation as facts in their own right.
A.3.1 There are Self-Orthogonal Vectors
Consider the ordinary dot product,
<vectorx,vectory>=
i=nsummationdisplay
i=1
xiyi
Surely in GF(2), one can see that < (0,1,1),(0,1,1) >= 0 + 1 + 1 = 0 and
thus there exist non-zero vectors which are orthogonal to themselves. In R,Q and
in C, or any field of characteristic zero, only the zero-vector is self-orthogonal. Note
that in C, the value of <vectorx,vectory>=vectorxTvectory.
A.3.2 Something that Fails
Consider the Gram-Schmidt algorithm, a very well understood linear algebraic
technique. Given a setS of vectors, the algorithm computesB, an orthonormal basis
for the space spanned by S. The algorithm is given in Algorithm 5 on page 134.
1: B?{}.
2: for each si?S do
1: for each bj ?B do
1: si?si?<si,bj >bj
2: if sinegationslash= 0 then
1: si? 1||si||si
2: Insert si into B.
Algorithm 5: Gram-Schmidt, over a field of characteristic zero.
The first problem is that the normalization step (second-to-last step) requires
a norm. If the usual norm based on the inner product ||x|| = ?<x,x> is used,
then self-orthogonal vectors will result in division by zero. If the Hamming norm
is used, then perhaps one would have to compute 1/3 or 1/4 times a GF(2)-vector,
which is meaningless.
However, we can drop the second to last step, and simply hope to create an
orthogonal basis instead of an orthonormal basis (i.e. it will not necessarily be the
case that the output vectors will all have norm one, but there will still be a basis
and all vectors will be orthogonal to each other).
Now consider the vectors S ={(1,0,1,0);(1,1,1,0);(0,0,1,1)}. The output
is B ={(1,0,1,0);(1,1,1,0);(0,1,1,1)}. Clearly the first and last vector of B are
134
not orthogonal. Thus the algorithm fails. Note that the first input vector was a
self-orthogonal vector.
To see why this is important, consider this basic use of an orthonormal basis.
Given such a basisB =
braceleftBigvector
b1,...,vectorbn
bracerightBig
, one can write a vectorvectorvas a linear combination
of the basis vectors. Let ci =<vectorv,vectorbi >, and then vectorv = summationtextcivectorbi.
In the example above, consider (0,1,0,0). In this case c1 = 0,c2 = 1,c3 = 0,
by the above method. But 0vectorb1 + 1vectorb2 + 0vectorb3 = (1,1,1,0) negationslash= (0,1,0,0). Instead, a
better choice would have been c1 = 1,c2 = 1,c3 = 0, which produces the correct
answer. Note the only coefficient that is wrong is the one computed for the only
self-orthogonal vector.
Since Gram-Schmidt is crucial in theQR-factorization algorithm, this problem
rules out doing QR in GF(2)-vector spaces.
A.3.3 The Probability a Random Square Matrix is Singular or In-
vertible
Consider the set ofn?nmatrices overGF(2). Suppose we wish to calculate the
probability that a random matrix (one filled with the output of random fair coins),
is singular or invertible. The ratio of invertible n?n matrices (i.e. |GLn(GF(2))|),
to all n?n matrices (i.e. |Mn(GF(2))|), will give us that probability.
The latter calculation is trivial. Each matrix has n2 entries and so there are
2n2 such matrices.
Now for the former, consider the first column. It can be anything except the
column of all zeroes, or 2n?1 choices. The second column can be anything except
the first column, or all zeroes, thus 2n?2 choices. The third column cannot be
all zeroes, the first column, the second column, or their sum, or 2n?4 choices. It
is clear that the ith column cannot contain a vector in the subspace generated by
the previous i?1 columns, which are linearly independent by construction. This
subspace has 2i?1 elements. Thus the ith column has 2n?2i?1 choices.
This results in the following expression for the probability
producttexti=n
i=1 2
n?2i?1
2n2 =
i=nproductdisplay
i=1
1?2i?1?n
.
The latter is obviously just a rational number. For any particular value of n,
it can be calculated. But as n??, the product converges toward 0.28879..., a
positive real number. (This is also a good approximation for n> 10.) This value is
also very close to radicalbig1/12.
While this result is well known [Rys57] or [vLW01, Ch. 16], this is still a
surprise, because for any real random variable with a continuous probability distri-
bution function, filling a matrix with independent and identically distributed values
will produce a singular matrix with probability zero. To see why this is true, recall
that the determinant is a polynomial function of the entries (det : Rn2 ?R), and a
matrix is singular if and only if its determinant is zero. Since zero is a single point,
135
the pre-image of zero under this map is a hyper-surface of co-dimension 1. Therefore
this pre-image is a set of measure zero in Rn2.
136
Appendix B
A Model for the Complexity of GF(2)-Operations
Here, we propose a new model, counting matrix-memory operations instead
of field operations, for reasons to be discussed in Section B.1 on page 137. It turns
out this model describes reality only partially?but we will explicitly discuss the cir-
cumstances in which the model is descriptive and in which it fails, see Section B.1.3
on page 139. The complexity expressions are summarized in Table B.1 on page 146.
Also of interest are certain data structure choices that we made in arranging our
linear algebra library, see Section 1 on page 4. This library was used by Nicolas
Courtois in his cryptographic research, and now forms part of the GF(2) linear al-
gebra suite of SAGE [sag], an open source competitor to Magma [mag], Matlab
[matb], Maple [map], and Mathematica [mata]. These are described in Section B.4
on page 140.
B.1 The Cost Model
In papers on matrix operations over the real or complex numbers, the number
of floating point operations is used as a measure of running time. This removes
the need to account for assembly language instructions needed to manipulate index
pointers, iteration counters, discussions of instruction set, and measurements of how
cache coherency or branch prediction will impact running time. In this dissertation,
floating point operation counts are meaningless, for matrices over GF(2) do not
use floating point operations. Therefore, we propose that matrix entry reads and
writes be tabulated, because addition (XOR) and multiplication (AND) are single
instructions, while reads and writes on rectangular arrays are much more expensive.
Clearly these data structures are non-trivial in size, so memory transactions will be
the bulk of the time spent.
From a computer architecture viewpoint in particular, the matrices required
for cryptanalysis cannot fit in the cache of the microprocessor, so the fetches to
main memory are a bottleneck. Even if exceptionally careful use of temporal and
spatial locality guarantees effective caching (and it is not clear that this is possible),
the data must still travel from memory to the processor and back. The bandwidth
of buses has not increased proportionally to the rapid increase in the speeds of
microprocessors. Given the relatively simple calculations done once the data is in
the microprocessor?s registers (i.e. single instructions), it is extremely likely that
the memory transactions are the rate-determining step. Due to the variations of
computer architectures, the coefficients given here may vary slightly. On the other
hand, by deriving them mathematically rather than experimentally, one need not
worry about artifacts of particular architectures or benchmarks skewing the results.
When attempting to convert these memory operation counts into CPU cycles,
one must remember that other instructions are needed to maintain loops, execute
137
field operations, and so forth. Also, memory transactions are not one cycle each,
but can be pipelined. Thus we estimate that about 4?10 CPU cycles are needed per
matrix-memory operation.
B.1.1 Is the Model Trivial?
A minor technicality is defining what regions of memory the reads and writes
should count. Clearly registers do not count and the original matrix should. The
standard we set is that a read or write counts unless it is to a ?scratch? data
structure. We define a data structure to be ?scratch? if and only if it size is bounded
by a constant.
For example, consider the following three step algorithm of inverting a non-
singular n?n matrix, in?2n2 time.
1. Read in a matrix. (n2 reads).
2. Invert the matrix. (No reads or writes).
3. Write the output matrix. (n2 writes).
This is not allowed (or rather, we would not tabulate Step 2 as zero cost)
because the temporary storage of the matrix requires n2 field elements, and this is
not upper-bounded by a constant.
B.1.2 Counting Field Operations
It is easy to see that counting field multiplications only versus counting field
multiplications and additions produces two distinct tabulations in almost all cases.
It is also easy to imagine that counting field multiplies and reads/writes will result
in distinct tabulations.
An interesting question is if counting reads/writes is distinct from counting
field multiplications and additions. In Gaussian Elimination, the answer is yes,
because of ?if? operations. If a row contains a zero in the pivot column, it is read
but never operated upon.
The follow-up question is if counting reads/writes is distinct from counting
field multiplications, additions, and conditionals (if?s). After all, the latter three
operations are all single logic gates.
In this case consider a one by one matrix multiplication, or one-dimensional
dot-product. It requires one arithmetic operation, and three reads/writes. A two-
dimensionaldotproductrequiresfourreadsandonewrite, versustwomultiplications
and one addition. An n-dimensional dot-product requires 2n+ 1 reads/writes but
2n?1 field operations, for a ratio of 2n+12n?1. While this is?1, the ratio is changing.
Note it is important to have very close estimates of the coefficient when performing
cross-over analysis.
138
B.1.3 Success and Failure
The model described above has had some success. When actually implement-
ing the algorithms in code, and performing timing experiments, the observed ex-
ponents have always been correct. When comparing different variants of the same
algorithm (e.g. triangular versus complete Gaussian Elimination), the coefficients
have been correct to about 2%.
However, when comparing different algorithms (e.g. Magma?s Strassen-na??ve
matrix multiplication vs M4RM, or M4RM vs na??ve matrix multiplication) the co-
efficients sometimes give ratios that are off by up to 50%. This inaccuracy above
is probably due to the role of caching. Some algorithms are more friendly toward
cached memory than others. It is notoriously hard to model this.
Another reason is that Magma has been hand-optimized for certain processors
at the assembly language level, and the author?s library has been written in C
(though compiled with optimization settings turned on).
In calculating the number of times a subroutine will be called (i.e. How many
times do you use the black-boxn0?n0 matrix multiply when inverting a much larger
matrix?), the model is exact. Presumably because nearly all the time is spent in
the black box, and it is the same single black box routine in all cases, the number
of calls to the black box is all that matters. Since this is an integer, it is easy to
measure if one is correct.
B.2 Notational Conventions
Precise performance estimates are useful, so rather than the usual five symbols
O(n), o(n), ?(n), ?(n), ?(n), we will use f(n)?g(n) to indicate that
limn?? f(n)g(n) = 1
in the case that an exact number of operations is difficult to state. While O(n)
statements are perfectly adequate for many applications, coefficients must be known
to determine if algorithms can be run in a reasonable amount of time on particular
target ciphers.
Let f(n)??g(n) signify that there exists an h(n) and n0 such that f(n)?
h(n) for alln>n0, andh(n)?g(n). Equivalently, this means limsupf(n)/g(n)?1
as n??.
Matrices in this dissertation are over GF(2) unless otherwise stated, and are
of size m rows and n columns. Denote lscript as the lesser of n and m. If n > m or
lscript = m the matrix is said to be underdefined, and if m>n or lscript = n then the matrix
is said to be overdefined. Also, ? is the fraction of elements of the matrix not equal
to zero.
139
B.3 To Invert or to Solve?
Generally, four basic options exist when presented with solving systems of
equations over the reals as defined by a square matrix. First, the matrix can be
inverted, but this is the most computationally intensive option. Second, the system
can be adjoined by the vector of constants, and the matrix reduced into a triangular
form so that the unknowns can be found via back-substitution. Third, the matrix
can be factored into LU-triangular form, or other forms. Fourth, the matrix can
be operated upon by iterative methods, to converge to a matrix near to its inverse.
Unfortunately, in finite fields concepts like convergence toward an inverse donot have
meaning. This rules out option four. The second option is unattractive, because
solving the same system for two sets of constants requires twice as much work,
whereas in the first and third case, if the quantity of additional sets of constants
is small compared to the dimensions of the matrices, trivial increase in workload is
required.
Among these two remaining strategies, inversion is almost strictly dominated
by LUP-factorization. The LUP-factorization is A = LUP, where L is lower unit
triangular, U is upper unit triangular, and P is a permutation matrix. There are
other factorizations, like the QR [TB97, Lec. 7], which are not discussed here
because no one (to the author?s knowledge) has proposed how to do them overGF(2).
(For example, the QR depends on the complexity of Gram-Schmidt, but Gram-
Schmidt fails over GF(2)). While the LUP-factorization results in three matrices,
and the inverse in only one, the storage requirements are about the same. This is
because, other than the main diagonal, the triangular matrices have half of their
entriesforcedatzerobydefinition. Also, sincethemaindiagonalcanhaveonlyunits,
and the only unit in this field is 1, the main diagonal of the triangular matrices need
not be stored. The permutation matrix can be stored with n entries, rather than
n2, as is explained in Section B.4 on page 140.
Calculating the inverse is always (for all methods listed in this dissertation)
more work than the LUP-factorization but by a factor that varies depending on
which algorithm is used. Also the LUP-factorization allows the determinant to be
calculated, but for all non-singular GF(2) matrices the determinant is 1. (And for
singular matrices it is zero). Also, multiplying a matrix by a vector requires 3n2
matrix-memory operations (a read-read-write for each field operation, with n2 field
operations). For back-substitution in the LUP-case, one must do it twice, for L
and for U. The back-substitution requires n2/2 field operations, or (3/2)n2 matrix-
memory operations, so this ends up being equal also.
B.4 Data Structure Choices
The most elementary way to store a matrix is as an array of scalars. Two-
dimensional arrays are often stored as a series of one-dimensional arrays in sequence,
or as an array of pointers to arrays (one for each row, called a ?ragged array?). In
either case, it is not obvious if the linear arrays should be rows or columns. For
140
example, in a matrix multiplication AB with the na??ve algorithm, spatial locality
will be enhanced if A?s rows and B?s columns are the linear data structure. Two
data structures are proposed and described below.
B.4.1 Dense Form: An Array with Swaps
For dense matrices, we present a method of storing the matrix as an array but
with very fast swaps. The cells of the matrix are a two-dimensional array, with the
rows being the linear data structure, since more of the work in the algorithms of
this dissertation is performed upon rows than upon columns. Additionally, two one-
dimensional arrays called row-swap and column-swap are used. Initially these are
filled with the numbers 1,2,...m and 1,2,...n. When a swap of rows or columns is
called for, the numbers in the cells of the row-swap or column-swap corresponding
to those rows are swapped. When a cell aij is called for, the result returned is ari,cj,
with ri representing the ith entry of the row-swap array, and cj likewise. In this
manner, row and column swaps can be executed in constant time, namely two writes
each.
For example, a 5?5 matrix with rows 1 and 2 being swapped, and then rows
4 and 2 being swapped, would cause the matrix to have{2,4,3,1,5}as its row-swap
array.
B.4.2 Permutation Matrices
An identity matrix which has had rows or columns swapped is called a permu-
tation matrix. We propose an efficient scheme for storing and performing operations
on permutation matrices.
It is only necessary to store a row-swap and column-swap array as before, not
the body of the matrix. The row-swap and column-swap arrays allow a quick look-
up, by calculating aij = 1 if and only if ri = cj (i.e. the cell is on the main diagonal
after swapping), and returning aij = 0 if rinegationslash= cj.
In linear time one can compose two permutations (multiply the matrices) or
invert the permutation (invert the matrix). The algorithms for this are given in
Algorithm 6 on page 141 and Algorithm 7 on page 142. Note that the algorithms
should be called twice, once for row permutations and once for columns.
1: For i = 1 to n do
1: temp?ri
2: ti?stemp
Algorithm 6: To compose two row swap arrays r and s, into t
It is trivial to see that a permutation can be applied to a vector in linear time,
by simply moving the values around in accordance with the row-swap array. To
multiply a matrix by a permutation is also a linear time operation, because one
141
1: For i = 1 to n do
1: temp?ri
2: stemp?i
Algorithm 7: To invert a row swap array r, into s
only need apply the permutation?s row swap array to the matrix?s row swap array
(as in composing two permutations, in Aglorithm 6 on page 141).
B.5 Analysis of Classical Techniques with our Model
B.5.1 Na??ve Matrix Multiplication
For comparison, we calculate the complexity of the na??ve matrix multiplication
algorithm, for a productAB = C with dimensionsa?b, b?canda?c, respectively.
1: for i = 1,2,...,a
1: for j = 1,2,...,c
1: Calculate Cij = Ai1B1j +Ai2B2j +???AibBbj. (Costs 2b+ 1 reads/writes).
Algorithm 8: Na??ve Matrix Multiplication
From the algorithm given in Algorithm 8 on page 142, this clearly requires
2abc+ac operations, or for square matrices 2n3 +n2 operations. This reduces to
?2abc or?2n3.
B.5.2 Matrix Addition
If adding A + B = C, obviously cij = aij + bij requires two reads and one
write per matrix entry. This yields?3mn matrix memory operations overall, if the
original matrices are m?n.
B.5.3 Dense Gaussian Elimination
The algorithm known as Gaussian Elimination is very familiar. It has many
variants, but three are useful to us. As a subroutine for calculating the inverse of
a matrix, we refer to adjoining an n?n matrix with the n?n identity matrix
to form an n?2n matrix. This will be processed to output the n?n identity
on the left, and A?1 on the right. The second is to solve a system directly, in
which case one column is adjoined with the constant values. This is ?full Gaussian
Elimination? and is found in Algorithm 9 on page 143. Another useful variant,
which finishes with a triangular rather than identity submatrix in the upper-left, is
142
listed in Algorithm 10 on page 144, and is called ?Triangular Gaussian Elimination.?
(That variant requires 2/3 as much time for solving a system of equations, but is not
useful for finding matrix inverses). Since Gaussian Elimination is probably known
to the reader, it is not described here, but it has the following cost analysis.
1: For each column i = 1,2,...,lscript
1: Search for a non-zero entry in region aii...amn (Expected cost is 2 reads).
Call this entry axy.
2: Swap rows i and x, swap columns i and y. (Costs 4 writes).
3: For each row j = 1,2,...,m, but not row i
1: If aji = 1 (Costs 1 read) then for each column k = i,i+ 1,...,n
1: Calculate ajk = ajk +aik. (Costs 2 reads, 1 write).
Algorithm 9: Dense Gaussian Elimination, for Inversion
The total number of reads and writes is given by
=
i=lscriptsummationdisplay
i=1
6 + (m?1)(1 + 0.5(3)(n?i+ 1))
= 1.5nmlscript?0.75mlscript2 + 1.75mlscript?1.5nlscript+ 0.75lscript2 + 4.25lscript
? 1.5nmlscript?0.75mlscript2
Thus for the overdefined case (lscript = n) one obtains 1.5n2m?0.75mn2, and for under-
defined (lscript = m) the total is 1.5nm2?0.75m3. For a square matrix this is 0.75n3.
The alternative form of the Gaussian Elimination algorithm, which outputs
an upper-triangular matrix rather than the identity matrix in the upper-left lscript?lscript
submatrix, is found in Algorithm 10 on page 144. This is not useful for finding
the inverse of a matrix, but is useful for LU-factorization or solving a system of
m equations in n unknowns. Here it is assumed that one column is adjoined that
contains the constants for a system of linear equations.
The total number of reads and writes is given by
i=lscriptsummationdisplay
i=1
6 + (m?i)(1+ 0.5(3)(n?i+ 1))
=
i=lscriptsummationdisplay
i=1
6 + (m?i)(2.5 + 1.5?n?1.5?i)
= 1.5nmlscript?0.75mlscript2?0.75nlscript2 + 0.5lscript3 + 2.5mlscript?1.25lscript2?0.75mlscript?0.75nlscript+ 0.75lscript2 + 5lscript
? 1.5nmlscript?0.75mlscript2?0.75nlscript2 + 0.5lscript3
Thus for the overdefined case (lscript = n) one obtains 1.5n2m?0.75mn2?0.25n3, and
for underdefined (lscript = m) the total is 0.75nm2?0.25m3. For a square matrix this
is 0.5n3.
143
1: For each column i = 1,2,...,lscript
1: Search for a non-zero entry in region aii...amn (Expected cost is 2 reads).
Call this entry axy.
2: Swap rows i and x, swap columns i and y. (Costs 4 writes).
3: For each row j = i+ 1,i+ 2,...,m
1: If aji = 1 then for each column k = i,i+ 1,...,n
1: Calculate ajk = ajk +aik (Costs 2 reads, and 1 write).
Algorithm 10: Dense Gaussian Elimination, for Triangularization
B.5.4 Strassen?s Algorithm for Matrix Multiplication
To find:
bracketleftbigg a
11 a12
a21 a22
bracketrightbiggbracketleftbigg b
11 b12
b21 b22
bracketrightbigg
=
bracketleftbigg c
11 c12
c21 c22
bracketrightbigg
Use the algorithm found in Algorithm 11 on page 144. One can see that this consists
of 18 matrix additions and 7 matrix multiplications.
1: Calculate 10 sums, namely: s1 = a12 ?a22, s2 = a11 + a22, s3 = a11 ?a21,
s4 = a11 + a12, s5 = a21 + a22, s6 = b21 + b22, s7 = b11 + b22, s8 = b11 + b12,
s9 = b12?b22, and s10 = b21?b11.
2: Calculate 7 products, namely: m1 = s1s6, m2 = s2s7, m3 = s3s8, m4 = s4b22,
m5 = a11s9, m6 = a22s10, and m7 = s5b11.
3: Calculate 8 sums, namely: s11 = m1 +m2, s12 =?m4 +m6, s13 =?m3 +m2,
s14 =?m7+m5,c11 = s11+s12,c12 = m4+m5,c21 = m6+m7, andc22 = s13+s14.
Algorithm 11: Strassen?s Algorithm for Matrix Multiplication
Note that the matrices c11 and c22 must be square, but need not equal each
other in size. For simplicity assume that A and B are both 2n?2n matrices. The
seven multiplications are to be performed by repeated calls to Strassen?s algorithm.
In theory one could repeatedly call the algorithm until 1?1 matrices are the inputs,
and multiply them with a logical AND operand. However, its unlikely that this
is optimal. Instead, the program should switch from Strassen?s algorithm to some
other algorithm below some size n0.
As stated in Section B.5.2 on page 142, the n?n matrix additions require
?3n2 matrix memory operations each, giving the following equation:
M(2n) = 7M(n) + 54n2
144
allowing one to calculate, for a large matrix,
M(4n0) = 72M(n0) + (4+ 7)?54n20
M(8n0) = 73M(n0) + (16 + 7?4 + 72)?54n20
M(16n0) = 74M(n0) + (64 + 16?7 + 72?4 + 73)?54n20
M(2in0) = 7iM(n0) + (4i?1 + 4i?27 + 4i?372 +???+ 4?7i?2 + 7i?1)54n20
M(2in0) ? 7iM(n0) + 7i?1(1 +4/7 + 16/49+ 64/343 +???)54n20
M(2in0) ? 7iM(n0) + 7i18n20
Now substitute i = log2(n/n0) and observe,
7log2 nn0M(n0) + 7log2 nn0 72n20
and since blog2 a = alog2 b, then we have
M(n)?
parenleftbiggn
n0
parenrightbigglog2 7
[M(n0) + 72n0]
or finally M(n)?(n/n0)log2 7M(n0).
145
Tabl
eB.1
Algo
rithms
and
Performa
nc
e,
for
m
?
n
mat
rices
Algo
rithm
Ov
erdefined
Squar
e
Underdefined
Matrix
In
versio
n
Metho
dof
4R
uss
ians?
Inv
ersion
?
(1.
5n
3
+
1.5
n2
m)
/(lo
g2
n)
?
(3n
3)
/(lo
g2
n)
?
(6nm
2
?
3m
3)
/(lo
g2
m)
Dense
Gaussian
Eliminatio
n
?
1.5
n2
m
?
0.75
mn
2
?
0.75
n3
?
1.5
nm
2
?
0.75
m3
Syst
em
Up
per
-T
riangul
ar
izat
ion
Dense
Gaussian
Eliminatio
n
?
1.5
n2
m
?
0.75
mn
2
?
0.25
n3
?
0.5
n3
?
0.75
nm
2
?
0.25
m3
Metho
dof
4R
uss
ians
?
(4.
5n
3
+
1.5
n2
m)
/(lo
g2
n)
?
(6n
3)
/(lo
g2
n)
?
(4.
5n
m2
+
1.5
m3
)/(lo
g2
m)
Algo
rithm
Rec
tangula
ra
?
bb
yb
?
c
Squa
re
n?
n
Mult
ipl
icat
ion
Metho
dof
4R
uss
ians?
Multip
lic
ation
?
(3b
2c
+
3abc
)/(lo
g2
b)
?
(6n
3)
/(lo
g2
n)
Na
??ve
Multiplica
tion
?
2abc
?
2n
3
Stra
ssen?s
Algor
ithm
?
M
(n
0)
parenleftBig3?
abcn0
parenrightBiglog
2
7
?
M
(n
0)(
n/n
0)
log
2
7
*
Here
M
(n
0)
signifies
the
time
required
to
mult
iply
an
n0
?
n0
mat
rix
in
som
e?ba
se-line?
alg
orit
hm.
146
Appendix C
On the Exponent of Certain Matrix Operations
A great deal of research was done in the period 1969?1987 on fast matrix
operations [Pan84, Str69, Sch81, Str87, CW90]. Various proofs showed that many
important matrix operations, such as QR-decomposition, LU-factorization, inver-
sion, finding determinants, and finding the characteristic polynomial are no more
complex than matrix multiplication, in the big-Oh sense see [AHU74, Ch. 6] or
[CLRS01, Ch. 28].
For this reason, many fast matrix multiplication algorithms were developed.
Almost all were intended to work over a general ring. However, one in particular
was intended for boolean matrices, and by extension GF(2)-matrices, which was
named the Method of Four Russians, ?after the cardinality and the nationality of its
inventors.?1 While the Method of Four Russians was conceived as a boolean matrix
multiplication tool, we show how to use it for GF(2) matrices and for inversion, in
Section 5.3 on page 82 and Section 5.4 on page 86.
Of the general purpose algorithms, the most famous and frequently imple-
mented of these is Volker Strassen?s 1969 algorithm for matrix multiplication expo-
nent. However, many algorithms have a lower exponent in their complexity expres-
sion.
C.1 Very Low Exponents
The algorithms with exponents below O(n2.81) all derive from the following
argument (so far as the author is aware). Matrix multiplication of any particular
fixed dimensions is a bilinear map from one vector space to another. The input space
is of matrices?matrices as a direct sum, and the output space is another matrix
space. Therefore, the map can be written as a tensor. By finding a shortcut for a
particular matrix multiplication operation of fixed dimensions, one lower-bounds the
complexity2 of this tensor for those fixed dimensions. Specifically, Strassen performs
2?2 by 2?2 in seven steps instead of eight [Str69]. Likewise, Victor Pan?s algorithm
performs 70?70 by 70?70 in 143,640 steps rather than 343,000, for an exponent
of 2.795 [Pan84, Ch. 1].
One can now lower-bound the complexity of matrix multiplication in general
by extending the shortcut. The method of extension varies by paper, but usually
the cross-over3 can be calculated explicitly. While the actual crossover in practice
1Quoted from Aho, Hopcroft & Ullman textbook [AHU74, Ch. 6]. Later information demon-
strated that not all of the authors were Russians.
2An element of a tensor space is a sum of simple tensors. Here, the complexity of a tensor is
the smallest number of simple tensors required. This is often called the rank of the tensor, but
other authors use the word ?rank? differently. The rank of the tensor is directly proportional to
the complexity of the operation [Str87].
3The cross-over point is the size where the new tensor has rank (complexity) equal to the na??ve
147
might vary slightly, these matrices have millions of rows and are totally infeasible.
For example, for Sch?onhage?s algorithm atO(n2.70), the crossover is given by [Sch81]
at n = 314?4.78?106 rows, or 328?22.88?1012 entries (this is compared to na??ve
dense Gaussian Elimination. The crossover would be much higher versus Strassen?s
Algorithm or the Method of Four Russians).
Therefore, we have essentially three choices: algorithms of complexity equal
to Strassen?s exponent, of complexity equal to the Method of Four Russians, and
algorithms of cubic complexity. The purpose of the linear algebra part of this
dissertation is to combine these effectively.
C.2 The Equicomplexity Theorems
The following is a series of theorems which prove that matrix multiplication,
inversion, LUP-factorization, and squaring, are equally complex in the sense of
big-?. This implies that there is an exponent, ironically called the exponent of
matrix multiplication considering how many operations it describes, denoted ?.
Several papers have been written trying to find new upper bounds for this value
[Pan84, Str69, Sch81, Str87, CW90]. Other work has tried to lower-bound this value
but lower bounds are not discussed here. In theory, Coppersmith and Winograd still
hold the record at ??2.36, while in practice ? = 2.807 (Strassen?s algorithm) is
the fastest algorithm used [CW90, Str69].
The theorems in this section have been known for a long time. In fact, all
of them are found in or can be derived from the papers [Str69, BH74], except the
theorems on squaring.
For now, we will exclude rings that are not fields. Suppose R is a ring that
is not a division ring. Then there exists an element z which has no inverse. What
would the inverse of the matrix zI be? Normally, diagonal matrices with non-
zero entries on the main diagonal have inverses. Therefore, while these questions
can be answered (by excluding matrices with non-invertible determinant and other
methods) we will exclude them in this dissertation. ?Skew fields? are rings that
are division rings but not commutative, and thus not fields. An example is the
quaternion field. These cases also are beyond the scope of this dissertation.
A brief notational comment is needed. One can sometimes show that a partic-
ular algorithm is ?(f(n)) or if not, then O(f(n)). But, the complexity of a problem
is defined as the complexity of the best algorithm for it, in terms of asymptotic
running time. Therefore showing an algorithm for solving a problem is ?(f(n)) or
O(f(n)) only proves that the problem is O(f(n)).
The definitions of ?(f(n)), O(f(n)), and ?(f(n)) can be found on page xx,
but remember that any algorithm which is ?(n3) is also ?(n2) and O(n4).
A summary of the theorems is shown graphically in Figure C.1 on page 149.
algorithm?s tensor.
148
Figure C.1: The Relationship of the Equicomplexity Proofs
C.2.1 Starting Point
Recall the inverse or square of an n?n matrix, as well as the product of two
n?n matrices, will be an n?n matrix with n2 entries. Therefore, just outputting
the answer requires ?(n2) time and these operations are ?(n2). Likewise the LUP-
factorization of a non-singular n?n matrix requires 3 matrices, each n?n, to write
down, so that problem also is ?(n2).
Because na??ve matrix multiplication is ?(n3) (see Section B.5.1 on page 142)
we know that matrix multiplication and squaring is O(n3). Likewise, because Gaus-
sian Elimination is ?(n3) (see Section B.5.3 on page 142), we know that matrix
inversion or LUP-factorization is O(n3) (since that algorithm can be used for both).
C.2.2 Proofs
Theorem 3 If there exists an algorithm for matrix inversion of unit upper (or
lower) triangular n?n matrices over the field F, in time ?(n?), with n?3, then
there is an algorithm for n?n matrix multiplication over the field F in time ?(n?).
Proof: Let A,B be n?n matrices over the field F. Consider the matrix on the
left in the formula below:
?
?
I A 0
0 I B
0 0 I
?
?
?1
=
?
?
I ?A AB
0 I ?B
0 0 I
?
?
149
This matrix is 3n?3nupper-triangularand has onlyones onthe maindiagonal,
and is also composed of entries only from R. Therefore its determinant is one and it
is non-singular. Its inverse can be calculated in time ?(n?), and then the product
AB can be read in the ?north east? corner. []
The requirement of ? ? 3 was not quite superfluous.Any real r with ? ?r
would have done. If the matrix inversion requires f(n) time for an n?n matrix,
we need to know that f(n) is upper-bounded by a polynomial. Call the degree of
that polynomial d. This means that f(3n)?3df(n) for sufficiently large n. Thus
f(3n) = ?(f(n)).
For example, if ? = logn, or more precisely if f(n) = nlogn then this would
be problematic. In that case, f(3n) = n3f(n) and therefore f(3n)negationslash= ?(f(n)).
Theorem 4 If there exists an algorithm for squaring an n?n matrix over the
field F in time ?(n?) with ? ? 3, then there is an algorithm for n?n matrix
multiplication over the field F in time ?(n?).
Proof: Let A,B be n?n matrices over the field F. Consider the matrix on the
left in the formula below:
bracketleftbigg A B
0 0
bracketrightbigg2
=
bracketleftbigg A2 AB
0 0
bracketrightbigg
This matrix is 2n?2n and is also composed of entries only from R. Its square
can be calculated in time ?(n?), and then the product AB can be read in the ?north
east? corner. []
Again, the??3 was useful so that (iff(n) is upper-bounded by a polynomial
of degree d) we can say that f(2n) ? 2df(n) for sufficiently large n and therefore
f(2n) = ?(f(n)).
Theorem 5 If there exists an algorithm for multiplying two n?n matrices over
the field F in time ?(n?) then there is an algorithm for n?n matrix squaring over
the field F in time ?(n?).
Proof: A?A = A2 []
Theorem 6 If there exists an algorithm for multiplying two n?n matrices over the
field F, in time ?(n?) then there is an algorithm for inverting an n?n unit upper
(or lower) triangular matrix over the field F, in time ?(n?).
Proof: We will do the proof for lower triangular. It is almost unchanged for
upper. Just take the transpose of every matrix.
Observe, bracketleftbigg
A 0
B C
bracketrightbigg?1
=
bracketleftbigg A?1 0
?C?1BA?1 C?1
bracketrightbigg
150
If the original matrix is unit lower triangular, so are A and C. Thus an
n?n unit lower triangular inverse requires two n/2?n/2 matrix multiplies and two
n/2?n/2 unit lower triangular matrix inverses. Let the time required for an n?n
lower triangular inverse be I(n) and for an n?n matrix product M(n).
We have
I(n) = 2I(n/2) +2M(n)
= 4I(n/4) +4M(n/2) +2M(n)
= 8I(n/8) +8M(n/4) +4M(n/2) +2M(n)
= 2iI(n2i) + 2iM( n2i?1) +???+ 2M(n)
? 2iI(n2i) + 2ik( n2i?1)? +???+ 2kn?
? 2iI(n2i) +kn?
bracketleftbigg 2i
(2i?1)? +???+
2
1?
bracketrightbigg
? 2iI(n2i) +kn? 21?21??
Now we substitute i = logn, and observe that a 1?1 unit lower triangular
matrix is just the reciprocal of its only entry, and calculating that requires constant
time. Also, observe the final fraction is at most 4 since ??2. Finally, we have
I(n) = n?(1) +kn? 21?21?? = O(n?)
[]
Lemma 6 Let m = 2t where t is a positive integer, and m<n. Finding the LUP
factorization of a full-row-rankm?nmatrix can be done with two LUP factorizations
of size m/2?n and m/2?n?m/2, two matrix products of size m/2?m/2 by
m/2?m/2 and m/2?m/2 by m/2?n?m/2, the inversion of an m/2?m/2
triangular matrix, and some quadratic operations. Furthermore, L, U, P will be
each full-row-rank.
Proof: Step One: Divide A horizontally into two m/2?n pieces.
This yields A =
bracketleftbigg B
C
bracketrightbigg
.
Step Two: Factor B into L1U1P1. (Note that L1 will be m/2?m/2, U1 will
be m/2?n, and P1 will be n?n.
Step Three: Let D = CP?1. Thus D is m/2?n. This yields
A =
bracketleftbigg L
1U1
D
bracketrightbigg
P1
Step Four: Let E be the first m/2 columns of U1, and Eprime the remainder. Let
F be the first m/2 columns of D, and Fprime the remainder. Now compute E?1. Since
151
U1 is unit upper triangular and E is therefore also unit upper triangular, and thus
invertible.
This yields
A =
bracketleftbigg L
1E L1Eprime
F Fprime
bracketrightbigg
P1
which implies
A =
bracketleftbigg L
1 0
0 Im/2
bracketrightbiggbracketleftbigg E Eprime
F Fprime
bracketrightbigg
P1 (C.1)
Step Five: Consider T = D?FE?1U1. This can be thought of as G =
F ?FE?1E = 0 and Gprime = Fprime?FE?1Eprime, with T = G|Gprime since D = F|Fprime and
U1 = E|Eprime, where the | denotes concatenation. The matrices Eprime,Fprime,Gprime are all
n?m/2 columns wide. In the algorithm, we need only compute Gprime = Fprime?FE?1Eprime.
Along the way we should store FE?1 which we will have need of later. We have
now bracketleftbigg
Im/2 0
?FE?1 Im/2
bracketrightbiggbracketleftbigg E Eprime
F Fprime
bracketrightbigg
=
bracketleftbigg E Eprime
0 Gprime
bracketrightbigg
Step Six: Factor Gprime = L2U2P2, and observe
bracketleftbigg I
m/2 0
?FE?1 Im/2
bracketrightbiggbracketleftbigg E Eprime
F Fprime
bracketrightbigg
=
bracketleftbigg E Eprime
0 L2U2P2
bracketrightbigg
(C.2)
Note that since Gprime was m/2?n?m/2 wide, then L2 will be m/2?m/2 and
U2 will be m/2?n?m/2 and P2 will be n?m/2?n?m/2.
Step Seven: Let
P3 =
bracketleftbigg I
m/2 0
0 P2
bracketrightbigg
so that P3 is a n?n matrix.
Step Eight: Calculate EprimeP?12 . This enables us to write
bracketleftbigg E EprimeP?1
2
0 L2U2
bracketrightbiggbracketleftbigg I
m/2 0
0 P2
bracketrightbigg
bracehtipupleft bracehtipdownrightbracehtipdownleft bracehtipupright
=P3
=
bracketleftbigg E Eprime
0 L2U2P2
bracketrightbigg
which can be manipulated into
bracketleftbigg I
m/2 0
0 L2
bracketrightbiggbracketleftbigg E EprimeP?1
2
0 U2
bracketrightbigg
P3 =
bracketleftbigg E Eprime
0 L2U2P2
bracketrightbigg
Substituting Equation C.2 into this last equation we obtain
bracketleftbigg I
m/2 0
0 L2
bracketrightbiggbracketleftbigg E EprimeP?1
2
0 U2
bracketrightbigg
P3 =
bracketleftbigg I
m/2 0
?FE?1 Im/2
bracketrightbiggbracketleftbigg E Eprime
F Fprime
bracketrightbigg
Since bracketleftbigg
Im/2 0
?FE?1 Im/2
bracketrightbigg?1
=
bracketleftbigg I
m/2 0
FE?1 Im/2
bracketrightbigg
152
we can write
bracketleftbigg I
m/2 0
FE?1 Im/2
bracketrightbiggbracketleftbigg I
m/2 0
0 L2
bracketrightbiggbracketleftbigg E EprimeP?1
2
0 U2
bracketrightbigg
P3 =
bracketleftbigg E Eprime
F Fprime
bracketrightbigg
and substitute this into Equation C.1 to obtain
A =
bracketleftbigg L
1 0
0 Im/2
bracketrightbiggbracketleftbigg I
m/2 0
FE?1 Im/2
bracketrightbiggbracketleftbigg I
m/2 0
0 L2
bracketrightbiggbracketleftbigg E EprimeP?1
2
0 U2
bracketrightbigg
P3P1
This now is sufficient for the factorization:
A =
bracketleftbigg L
1 0
FE?1 L2
bracketrightbigg
bracehtipupleft bracehtipdownrightbracehtipdownleft bracehtipupright
=L
bracketleftbigg E EprimeP?1
2
0 U2
bracketrightbigg
bracehtipupleft bracehtipdownrightbracehtipdownleft bracehtipupright
=U
P3P1bracehtipupleftbracehtipdownrightbracehtipdownleftbracehtipupright
=P
Since L1 and L2 are outputs of the factor algorithm they are unit lower tri-
angular, as is L. Likewise E and U2 are unit upper triangular, and thus is U. The
product of two permutation matrices is a permutation matrix, as is P. Thus all
three are full-row-rank.
Note also the matrix products and inverses involving permutation matrices are
quadratic or faster, as discussed in Section B.4.2 on page 141, and thus negligible.
[]
Lemma 7 Let A be a non-zero 1?n matrix, with a non-zero entry at i. Then
L = [1], U = [xi,x2,x3,...,xi?1,x1,xi+1,xi+2,...,xn] and P being the permutation
matrix which swaps columns i and 1, is a factorization A = LUP.
Proof: Obvious. []
Theorem 7 If matrix multiplication of two n?n matrices is O(nc1) over the field
F and matrix inversion of an n?n triangular matrix is O(nc2) over the field F then
the LUP-factorization of an m?n matrix, with m being a power of two and m?n,
is O(nmax(c1,c2)), over the field F. (We require c1?3 and c2?3).
Proof: Suppose matrix multiplication can be done in time O(nc1) and trian-
gular matrix inversion in time O(nc2). Let c = max(c1,c2). For sufficiently large n,
the time of either of these operations is?knc for some real number k.
Also, since the time required to do an m/2?n LUP-factorization is greater
thanorequaltothetimerequiredtodoanm/2?n?m/2LUP-factorization(because
that is smaller), we will represent both as L(m/2,n), being slightly pessimistic.
Sincem<nin Lemma 6 the two matrix products and one triangular inversion
require at most 3knc time.
From Lemma 6, we have that
L(m,n) = 2L(m/2,n) + 3knc
= 4L(m/4,n) + 6k(n/2)c + 3knc
153
= 8L(m/8,n) + 12k(n/4)c + 6k(n/2)c + 3knc
= 16L(m/16,n) + 24k(n/8)c + 12k(n/4)c + 6k(n/2)c + 3knc
= 2iL(m/2i,n) + 3knc
bracketleftbigg 2i
(2i)c +???+ 4/4
c + 2/2c + 1/1c
bracketrightbigg
= 2iL(m/2i,n) + 3knc 2
c
2c?2
Now let i = log2m.
L(m,n) = mL(1,n) + 3kn
c2c
2c?2
SinceL(1,n) is ?(n) by Lemma 7, and that last term isO(nc) for any constant
c and constant k, we obtain that L(m,n) = O(nc). []
Theorem 8 If matrix multiplication of two n?n matrices is O(nc1), over the
field F, and triangular matrix inversion is O(nc2), over the field F, then the LUP-
factorization of an m?n matrix, with m?n, is O(nmax(c1,c2)), over the field F.
(We require c1?3 and c2?3).
Proof: This is an identical claim to Lemma 7 except that the requirement that
m be a power of two has been dropped.
If m is a power of two and m = n, factor as before. If not, let mprime be the next
power of two greater than or equal to both m and n.
A = L1U1P1?
bracketleftbigg A 0
0 Imprime?m
bracketrightbigg
=
bracketleftbigg L
1 0
0 Imprime?m
bracketrightbiggbracketleftbigg U
1 0
0 Imprime?m
bracketrightbiggbracketleftbigg P
1 0
0 Imprime?m
bracketrightbigg
And by extending A diagonally as shown, we at most double the size of m. We
therefore, at worse, increase the running time eightfold, since even using Gaussian
Elimination for LUP-factorization is ?(n3). []
Theorem 9 If multiplying two n?n matrices is O(nc) over the field F, then in-
verting an n?n matrix is O(nc) over the field F.
Proof: Becausemultiplyingtwon?nmatricesisO(nc), weknowbyTheorem6,
that inverting a unit lower triangular matrix is O(nc). Then via Theorem 8, an
LUP-factorization can be computed in O(nc) time. If the original n?n matrix
is A, then A = LUP with L and U being unit lower/upper triangular. Thus we
can invert them, and inverting P is a quadratic operation (See Section B.4.2 on
page 141). Surely then A?1 = P?1U?1L?1, and we required a constant number of
O(nc) operations. Thus we have inverted A in O(nc) time. []
Theorem 10 If finding the LUP-factorization of a non-singular n?n matrix is
O(nc) over a field F, then finding the determinant of a non-singular n?n matrix
over a field F is also O(nc).
154
Proof: If A = LUP then det(A) = det(L)?det(U)?det(P). Note that det(L)
is the product of the entries of the main diagonal, just as is det(U), because both
matrices are triangular. The determinant of a permutation matrix is the sign of that
permutation, thus +1 or?1. This can be calculated in linear time by ?undoing? the
permutation as a series of swaps, and counting the number requiredx, and returning
the determinant as (?1)x. []
Theorem 11 If any of matrix inversion, matrix multiplication, triangular matrix
inversion, or matrix squaring over the field F is ?(nc), then all of these operations
are ?(nc) over the field F. In addition, LUP-factorization is O(nc) and taking the
determinant is O(nc) (both over the field F).
Proof: The diagram in Figure C.1 on page 149 shows the relationships among
the proofs, and is sufficient to show the four operations of matrix multiplication,
triangular matrix inversion, general matrix inversion, and squaring are ? of each
other. That diagram further shows the determinant and LUP-factorization would
be O(nc). []
C.2.3 A Common Misunderstanding
Strassen?s matrix inversion formula
A =
bracketleftbigg B C
D E
bracketrightbigg
? A?1 =
bracketleftbigg B?1 +B?1CS?1DB?1 ?B?1CS?1
?S?1DB?1 S?1
bracketrightbigg
whereS = D?1?E?1CB?1, the Schur complement ofAwith respect toB, provides
a fast way of calculating matrix inverses. However, this does not work for fields in
which a singular B can be encountered. See Section 5.8 on page 107 for details.
155
Bibliography
[AA05] Frederik Armknecht and G. Ars. Introducing a new variant of fast alge-
braic attacks and minimizing their successive data complexity. In Proc.
of Mycrypt, Lecture Notes in Computer Science. Springer-Verlag, 2005.
[ADKF70] V. Arlazarov, E. Dinic, M. Kronrod, and I. Faradzev. On economical
construction of the transitive closure of a directed graph. Dokl. Akad.
Nauk. SSSR, 194(11), 1970. (in Russian), English Translation in Soviet
Math Dokl.
[AHU74] A. Aho, J. Hopcroft, and J. Ullman. The Design and Analysis of Com-
puter Algorithms. Addison-Wesley, second edition, 1974.
[AK03] Frederik Armknecht and M. Krause. Algebraic attacks on combiners
with memory. In Advances in Cryptology?Proc. of CRYPTO, Lecture
Notes in Computer Science. Springer-Verlag, 2003.
[Arm02] Frederik Armknecht. A linearization attack on the bluetooth key stream
generator. Cryptology ePrint Archive, Report 2002/191, 2002. http:
//eprint.iacr.org/2002/191.
[Arm04] Frederik Armknecht. Improving fast algebraic attacks. In Proc. of
Fast Software Encryption, Lecture Notes in Computer Science. Springer-
Verlag, 2004.
[AS88] M. Atkinson and N. Santoro. A practical algorithm for boolean matrix
multiplication. Information Processing Letters, September 1988.
[Bar85] Thomas C. Bartee. Digital Computer Fundamentals. McGraw Hill, sixth
edition, 1985.
[Bar06] Gregory Bard. Achieving a log(n) speed up for boolean matrix op-
erations and calculating the complexity of the dense linear algebra
step of algebraic stream cipher attacks and of integer factorization
methods. Cryptology ePrint Archive, Report 2006/163, 2006. http:
//eprint.iacr.org/2006/163.
[Ber95] Dan Bernstein. Matrix inversion made difficult. Unpublished
Manuscript, 1995. available on http://cr.yp.to/papers/mimd.ps.
[BGP05] C?ome Berbain, Henri Gilbert, and Jacques Patarin. Quad: A practical
stream cipher with provable security. In Advances in Cryptology?Proc.
of EUROCRYPT, Lecture Notes in Computer Science. Springer-Verlag,
2005.
156
[BH74] J. Bunch and J. Hopcroft. Triangular factorization and inversion by fast
matrix multiplication. Math Comp., 28(125), 1974.
[BS06] Daniel Le Berre and Laurent Simon. Special volume on the sat 2005
competitions and evaluations. Journal of Satisfiability (JSAT), March
2006.
[CB06] Nicolas Courtois and Gregory Bard. Algebraic cryptanalysis of the
data encryption standard. Cryptology ePrint Archive, Report 2006/402,
2006. http://eprint.iacr.org/2006/402.
[CB07] Nicolas Courtois and Gregory Bard. Algebraic and slide attacks on
keeloq. Cryptology ePrint Archive, Report 2007/062, 2007. http://
eprint.iacr.org/2007/062.
[CD99] Nadia Creignou and Herv?e Daude. Satisfiability threshold for random
xor-cnf formulas. Discrete Applied Mathematics, 1999.
[CGP03] Nicolas Courtois, Louis Goubin, and Jacques Patarin. Sflashv3, a
fast asymmetric signature scheme. Cryptology ePrint Archive, Report
2003/211, 2003. http://eprint.iacr.org/2003/211.
[CLRS01] T. Cormen, C. Leiserson, R. Rivest, and C. Stein. Introduction to Algo-
rithms. MIT Press, McGraw-Hill Book Company, second edition, 2001.
[CM03] Nicolas Courtois and W. Meier. Algebraic attacks on stream ciphers with
linear feedback. In Advances in Cryptology?Proc. of EUROCRYPT,
Lecture Notes in Computer Science. Springer-Verlag, 2003.
[Cou01a] Nicolas Courtois. The security of cryptographic primitives based on mul-
tivariate algebraic problems: MQ, MinRank, IP, HFE. PhD thesis, Paris
VI, September 2001. http://www.nicolascourtois.net/phd.pdf.
[Cou01b] Nicolas Courtois. The security of hidden field equations (hfe). In Cryp-
tographers? Track, RSA Conference, 2001.
[Cou02] Nicolas Courtois. Higher order correlation attacks, xl algorithm and
cryptanalysis of toyocrypt. In Proc. of ICISC, Lecture Notes in Com-
puter Science. Springer-Verlag, 2002.
[Cou03] Nicolas Courtois. Fast algebraic attacks on stream ciphers with linear
feedback. In Advances in Cryptology?Proc. of CRYPTO, Lecture Notes
in Computer Science. Springer-Verlag, 2003.
[Cou04a] Nicolas Courtois. Algebraic attacks on combiners with memory and
several outputs. In Proc. of ICISC, Lecture Notes in Computer Science.
Springer-Verlag, 2004.
157
[Cou04b] Nicolas Courtois. General principles of algebraic attacks and new design
criteria for components of symmetric ciphers. In Proc. AES 4 Confer-
ence, volume 3373 of Lecture Notes in Computer Science, pages 67?83,
Bonn, May 2004. Springer-Verlag.
[Cou04c] Nicolas Courtois. Short signatures, provable security, generic attacks,
and computational security of multivariate polynomial schemes such as
hfe, quartz and sflash. Cryptology ePrint Archive, Report 2004/143,
2004. http://eprint.iacr.org/2004/143.
[CP02] Nicolas Courtois and Josef Pieprzyk. Cryptanalysis of block ciphers with
overdefined systems of equations. In Advances in Cryptology?Proc. of
ASIACRYPT, volume 2501 of Lecture Notes in Computer Science, pages
267?287. Springer-Verlag, 2002. http://eprint.iacr.org/2002/044/.
[CSPK00] Nicolas Courtois, Adi Shamir, Jacques Patarin, and Alexander Klimov.
Efficient algorithms for solving overdefined systems of multivariate poly-
nomial equations. In Advances in Cryptology?Proc. of EUROCRYPT,
volume 1807 of Lecture Notes in Computer Science, pages 392?407.
Springer-Verlag, 2000.
[Cur05] Matt Curtin. Brute Force: Cracking the Data Encryption Standard.
Springer-Verlag, 2005.
[CW90] Don Coppersmith and S. Winograd. Matrix multiplication via arith-
metic progressions. J. of Symbolic Computation, 9, 1990.
[Dav06] Timothy Davis. Direct Methods for Sparse Linear Systems. Society for
Industrial and Applied Mathematics, 2006.
[Daw] S. Dawson. Code hopping decoder using a pic16c56. Technical Report
Technical Report AN642, Microchip Corporation. Available at http:
//www.keeloq.boom.ru/decryption.pdf.
[ES05] Niklas Een and Niklas Sorensson. Minisat ? a sat solver with conflict-
clause minimization. In Proc. Theory and Applications of Satisfiability
Testing, 2005.
[Fau99] Jean-Charles Faug`ere. A new efficient algorithm for computing gr?obner
bases (F4). Journal of Pure and Applied Algebra, 139:61?88, 1999.
[Fau02] Jean-Charles Faug`ere. A new efficient algorithm for computing gr?obner
bases without reduction to zero (F5). In Workshop on Applications of
Commutative Algebra, Catania, Italy, April 2002. ACM Press.
[FJ03] Jean-Charles Faug`ere and Antoine Joux. Algebraic cryptanalysis of hid-
den field equation (hfe) cryptosystems using gr?obner bases. In Advances
in Cryptology?Proc. of CRYPTO, Lecture Notes in Computer Science.
Springer-Verlag, 2003.
158
[FMM03] C. Fiorini, E. Martinelli, and F. Massacci. How to fake an rsa signature
by encoding modular root finding as a sat problem. Discrete Applied
Mathematics, 130(2):101?127, 2003.
[FMR+95] P. Fleischmann, G. Michler, P. Roelse, J. Rosenboom, R. Staszewski,
C. Wagner, and M. Weller. Linear algebra over small finite fields on par-
allel machines. Technical report, Universit?at Essen, Fachbereich Math-
ematik, Essen, 1995.
[GGM86] Oded Goldreich, Shafi Goldwasser, and Silvio Micali. How to construct
random functions. Journal of the Association of Computing Machinery,
33(4):792?807, October 1986.
[GJ78] Michael R. Garey and David S. Johnson. Computers and Intractability:
A Guide to the Theory of NP-Completeness. W. H. Freeman and Co.,
1978.
[Gra53] F. Gray. Pulse code communication, March 1953. USA Patent 2,632,058.
[Hig02] Nicholas Higham. Accuracy and Stability of Numerical Algorithms. So-
ciety for Industrial and Applied Mathematics, second edition, 2002.
[HMN00] M. Hietalahti, F. Massacci, and I. Niemela. Des: A challenge problem for
nonmonotonic reasoning systems. In Proc. 8th International Workshop
on Non-Monotonic Reasoning, 2000.
[HR04] P Hawkes and G Rose. Rewriting variables: The complexity of fast
algebraic attacks on stream ciphers. In Advances in Cryptology?Proc.
of CRYPTO, Lecture Notes in Computer Science. Springer-Verlag, 2004.
[HW98] G. Havas and C. Wagner. Some performance studies in exact linear
algebra, english summary. In Worksop on Wide Area Networks and
High Performance Computing, volume 249 of Lecture Notes in Control
and Information Science, pages 161?170, Essen, 1998. Springer-Verlag.
[IMR82] O. H. Ibara, S. Moran, and Hui R. A generalization of the fast lup ma-
trix decomposition algorithm and applications. Journal of Algorithms,
1(3):45?56, 1982.
[JJ05] D. Jovanovic and P. Janicic. Logical analysis of hash functions. In
Proc. Frontiers of Combining Systems, volume 3717 of Lecture Notes in
Artificial Intelligence, pages 200?215. Springer-verlag, 2005.
[Kar72] Richard Karp. Reducibility among combinatorial problems. In Proc.
of Symposium on Complexity of Computer Computations, pages 85?103,
Yorktown Heights, New York, 1972. IBM Thomas J. Watson Res. Cen-
ter, Plenum.
159
[Kri85] E. V. Krishnamurthy. Error-Free Polynomial Matrix Computations.
Springer-Verlag, 1985.
[Kut03] M. Kutz. The complexity of boolean matrix root computation. In Proc.
of Computing and Combinatorics, Lecture Notes in Computer Science.
Springer-Verlag, 2003.
[LN94] Rudolf Lidl and Harald Niederreiter. Introduction to Finite Fields and
Their Applications. Cambridge University Press, revised edition, 1994.
[mag] Magma. Software Package. Available at http://magma.maths.usyd.
edu.au/magma/.
[map] Maple. Software Package. Available at http://www.maplesoft.com/.
[Mas99] F. Massacci. Using walk-sat and rel-sat for cryptographic key search.
In Proc. 16th International Joint Conference on Artificial Intelligence,
1999.
[mata] Mathematica. Software Package. Available at http://www.wolfram.
com/products/mathematica/index.html.
[matb] Matlab. Software Package. Available at http://www.mathworks.com/.
[MM99] L. Marraro and F. Massacci. Towards the formal verification of ciphers:
Logical cryptanalysis of des. In Proc. Third LICS Workshop on Formal
Methods and Security Protocols, Federated Logic Conferences (FLOC-
99), 1999.
[MM00] F. Massacci and L. Marraro. Logical cryptanalysis as a sat-problem:
Encoding and analysis of the US data encryption standard. Journal of
Automated Reasoning, 24, 2000.
[MMZ+01] Matthew W. Moskewicz, Conor F. Madigan, Ying Zhao, Lintao Zhang,
and Sharad Malik. Chaff: Engineering an efficient sat solver. In Proc.
of 28th Design Automation Conference, 2001.
[Moo20] E. Moore. On the reciprocal of the general algebraic matrix. Bulletin of
the American Mathematical Society, 26, 1920.
[MZ06] I. Mironov and L. Zhang. Applications of sat solvers to cryptanalysis
of hash functions. In Proc. Theory and Applications of Satisfiability
Testing, 2006. Also available as IACR E-print 2006/254.
[Pan84] Victor Pan. How to Multiply Matrices Faster. Number 179 in Lecture
Notes in Computer Science. Springer-Verlag, 1984.
[Pen55] R. Penrose. A generalized inverse for matrices. Proc. of the Cambridge
Phil. Soc., 51, 1955.
160
[RS06] H Raddum and I Semaev. New technique for solving sparse equation
systems. Cryptology ePrint Archive, Report 2006/475, 2006. http:
//eprint.iacr.org/2006/475.
[Rys57] H. J. Ryser. Combinatorial properties of matrices of zeros and ones.
Canadian Journal of Mathematics, 9:371?377, 1957.
[sag] Sage. Software Package. Available at http://sage.math.washington.
edu/.
[San79] N. Santoro. Extending the four russians bound to general matrix mul-
tiplication. Information Processing Letters, March 1979.
[Sch81] A. Sch?onhage. Partial and total matrix multiplication. Journal of Com-
puting, 10(3), August 1981.
[sin] Singular. Software Package. Available at http://www.singular.
uni-kl.de/.
[Str69] Volker Strassen. Gaussian elimination is not optimal. Numerische Math-
ematik, 13(3), 1969.
[Str87] Volker Strassen. Relative bilinear complexity and matrix multiplication.
J. Reine Angew. Math., 375?376, 1987. This article is so long that it is
split among two volumes.
[SU86] N. Santoro and J. Urrutia. An improved algorithm for boolean matrix
multiplication. Computing, 36, 1986.
[TB97] Lloyd Trefethen and David Bau, III. Numerical Linear Algebra. Society
for Industrial and Applied Mathematics, 1997.
[vLW01] J.H.vanLintandR.M.Wilson. A Course in Combinatorics. Cambridge
University Press, second edition, 2001.
[vzGG03] Joachim von zur Gathen and J?urgen Gerhard. Modern Computer Alge-
bra. Cambridge University Press, second edition, 2003.
Curriculum Vitae
Gregory V. Bard
Contact Info
Email gregory.bard@ieee.org
Web Page http://www.math.umd.edu/?bardg/
Phone +1-301-933-7203.
Office Room 4305, The Math Building, College Park Campus, Univ. of Maryland.
Postal Gregory Bard, Department of Mathematics, University of Maryland, College
Park, MD, 20742, USA.
Fax +1-240-363-0240 (e-fax, works worldwide.)
Education
a136 Rensselaer Polytechnic Institute, May 1999,
B.Sc., Computer and Systems Engineering, Magna cum Laude,
Minor in the Philosophy of Science and Logic,
(Advisor: Prof. William R. Franklin)
a136 University of Maryland at College Park, May 2002,
M.Sc., Electrical and Computer Engineering,
(Advisors: Prof. Virgil D. Gligor & Prof. Patrick Dowd)
a136 New College, Oxford University, April?July 2004,
Registered Visiting Advanced Student, Mathematics.
a136 University of Maryland at College Park, December 2005,
M.Sc., Applied Mathematics and Scientific Computation,
(Advisor: Prof. Lawrence C. Washington)
a136 University of Maryland at College Park, August 2007
Ph.D., Applied Mathematics and Scientific Computation,
? Ranked1stof21studentstakingtheAbstractAlgebraWrittenQualifying
Exam, in August 2005,
? Advanced to Candidacy, December 2005,
? Defended Dissertation, April 30, 2007,
? ?Algorithms for Solving Linear and Polynomial Systems of Equations
over Finite Fields with Applications to Cryptanalysis?,
? (Advisor: Prof. Lawrence C. Washington)
Research Interests
a136 Cryptology, and particularly Algebraic Cryptanalysis.
a136 Linear Algebra over Finite Fields.
a136 Polynomial Systems of Equations over GF(2n).
a136 Mathematical Exposition.
a136 Proofs of Security and Provable Information Assurance.
a136 Security Protocols & Practical Implementations.
a136 NP-Complete Problems & Approximation Algorithms.
Teaching Experience
American University?Department of Computer Science, Audio Tech-
nology, and Physics
Lecturer for CSC 100: Computers and Information, (January 2006 to April
2006)
An entry-level course on basic computer hardware, networking, and software. Main
topics included making web-pages, spreadsheets, word-processing, but also com-
puter organization, binary and hexadecimal, network structures, the OSI model,
protocols, functions of operating systems and the software design process. Added
material on using the web as a research tool. Experience levels, academic levels
and backgrounds of the students were mixed, creating a challenging but fascinating
teaching environment.
American University?Department of Computer Science, Audio Technol-
ogy, and Physics
Lecturer for CSC 544: Object Oriented Programming, (August 2005 to Octo-
ber 2005)
A course for Masters degree students, all working professionals, to teach objected-
oriented large-scale design through the example of the Java language. Course in-
cluded three projects and a seminar on an advanced topic, plus exams. Through
the mechanism of the seminars, an enormous variety of topics were covered, includ-
ing XML, Internationalization, Multi-threading, Documentation, Unit Testing, and
others.
American University?Department of Computer Science, Audio Technol-
ogy, and Physics
Lecturer for CSC 281: Introduction to Computer Science II, (January 2005 to
April 2005)
A second semester basic programming course in Java, including linear data struc-
tures, object-oriented design, graphical user interfaces, sorting and searching. The
core of the course was five projects of increasing complexity, supported by two
examinations. Invented a student-feedback system where 2?3 typed pages of con-
structive commentary were given to each student for each project, to guide them
in their progress. Dramatic improvement in coding style and sophistication resulted.
University of Maryland at College Park?Department of Computer Sci-
ence
Graduate Teaching Ast. for CMSC 858K: Advanced Topics in Cryptology,
(January 2003 to May 2003)
An exciting opportunity to be a Teaching Assistant at a higher than typical level,
assisting Prof. Jon Katz during his graduate-level offering of cryptography. Duties
included writing exam problems, grading assignments, proctoring the final exam,
and guiding students who required help, during office hours and via e-mail.
Employment Summary
Employer Role Time
Fordham University Visiting Assistant Professor 9/2007+
University of Maryland NSF Diss. Compl. Fellow. 8/2006 ? 6/2007
ECRYPT Invited Visiting Scientist 5/2006 ? 8/2006
American University Lecturer 1/2005 ? 5/2006
University of Maryland Research Assistant 8/2003 ? 4/2004
Naval Surface Warfare Ctr Summer Intern 5/2003 ? 8/2003
University of Maryland Teaching Assistant 1/2003 ? 5/2003
National Security Agency Comp. Cryptologic Engineer 5/2001 ? 9/2002
National Security Agency Comp. Cryptologic Engineer 6/1999 ? 5/2001
National Security Agency Summer Intern 5/1998 ? 8/1998
Rensselaer Polytechnic Inst. Undergraduate Research Prog. 1/1998 ? 5/1998
Rensselaer Polytechnic Inst. Undergraduate Research Prog. 1/1997 ? 5/1997
Hermes Machine Tool Web Developer 5/1996 ? 8/1996
Cybernetics Research Inst. High School Visiting Scholar 1993 ? 1995
Camp Horizons Counselor-in-Training Summer ?91, ?92
A more detailed description of employment can be found starting on Page 169.
Publications
Conferences with Peer-Reviewed Published Proceedings
a136 G.Bard, ?Spelling-ErrorandReorderingTolerantPass-phrasesviatheDamerau-
Levenshtein String-Edit Distance Metric.? Accepted in Proceedings of the Aus-
tralasian Information Security Workshop, (AISW). Ballarat, Australia. 2007.
A patent might be filed shortly. 8 pp. Also available as IACR E-print Report
2006/364.
a136 G. Bard, ?A Challenging but Feasible Blockwise-Adaptive Chosen-Plaintext
Attack on SSL.? Proceedings of the International Conference on Security and
Cryptography, (SECRYPT). Set?ubal, Portugal. 2006. Also available as IACR
E-print Report 2006/136. 10 pp.
a136 G. Bard, ?FLOWHUNT? An Attempt at Specification-Based Intrusion De-
tection using Neural Networks.? Proceedings of the 2nd Annual Computer
Network Exploitation Conference, (CNE?02). A conference internal to the US
Department of Defense and Intelligence Community, and the Ministries of De-
fense of certain allied nations, but competitive and peer-reviewed. While the
proceedings were classified, this paper, however, was not. It is available upon
request, and will be on arXiv.org soon.
Peer-Reviewed Conferences without Published Proceedings
a136 G. Bard. ?On the Connection Number of a Particular Graph of a Polynomial
System of Equations over a Finite Field.? The 8th International Conference on
Finite Fields and Applications, (Fq8). Melbourne, Australia. 2007. (Sched-
uled). Paper to be made available soon.
a136 G. Bard. ?Modes of Encryption Secure Against Blockwise-Adaptive Chosen-
Plaintext Attack.? Yet Another Conference on Cryptography, (YACC). Por-
querolles, France. 2006. Full-version and conference-version available on my
web-page. Extended Abstract: 4 pp.
Works Pending Publication
a136 An improved version of ?Modes of Encryption Secure Against Blockwise-
Adaptive Chosen-Plaintext Attack.? has been submitted to a conference with
peer-reviewed proceedings. Available as IACR E-print Report 2006/271. 22
pp.
a136 G. Bard, ?Accelerating Cryptanalysis with the Method of Four Russians.?
Under revision, first version available as IACR E-print Report 2006/251. 20
pp.
a136 G. Bard, N. Courtois, B. Debraize, and L. Goubin. ?On the Algebraic Vul-
nerability and Reduced Versions of SNOW 2.0.? Under revision. Available on
request. 16 pp.
a136 G. Bard, N. Courtois, C. Jefferson. ?Efficient Methods for Conversion and So-
lution of Sparse Systems of Low-Degree Multivariate Polynomials over GF(2)
via SAT-Solvers.? Available as IACR E-print Report 2007/024. 14 pp.
a136 N. Courtois, G. Bard. ?Algebraic Cryptanalysis of the Data Encryption Stan-
dard.? Available as IACR E-print Report 2006/402. 12 pp.
a136 N. Courtois, G. Bard. ?Algebraic Cryptanalysis of Keeloq.?. Available as
IACR E-print Report 2007/062. 15 pp.
Textbook Chapters
a136 G. Bard, J. Belding, P. Bello, M. Black, A. Dalal, J. Domke, M. Kirn, S.
Lawton, S. Schmoyer, L. Sheridan, D. Stebilia, C. Truman, K. Truman, and
C. Zorn. Mathematics?It?s Not Just Calculus! (In Preparation).
? This book has been finally accepted for publication by A& K Peters, Inc.
? This project is a joint attempt of 14 authors, all current graduate students
or recent graduates, to write a textbook for college freshmen who have
chosen to major in mathematics, or very advanced high school seniors,
withinspirationfromWhat is Mathematics? byR.Courtant, H.Robbins,
and I. Stewart, and The Skeleton Key of Mathematics by D. Littlewood.
? Chapters: ?Preliminaries?, ?Systems of Numbers?, ?Logic?, ?Linear Al-
gebra?, ?Group Theory, Part I?, ?Probability?, ?Group Theory, Part II?,
?Number Theory?, ?Cryptography?, ?Graph Theory?, ?Computational
Complexity?, ?Topology?, ?Geometry/Visualizing Algebra?, ?Field The-
ory?, ?Combinatorics?, ?Classical Mechanics?, and ?Quantum Mechan-
ics?.
? Drafts of the chapters were completed by January of 2006. All chap-
ters are completed as of May 2007. Editing is underway. Publication
anticipated 2008.
? I am the founder, organizer and general editor of the project.
? I am writing ?Group Theory?Part I? (available on my web page, 35 pp)
and ?Group Theory?Part II? (available soon).
? I am co-authoring with S. Lawton and J. Belding. ?Preliminaries?, a
lengthy introduction to the concept of proof, the basic tools of logic, and
mathematical terminology.
? Each chapter has a primary editor. I am the primary editor for ?Linear
Algebra?, ?Field Theory?, ?Graph Theory?, and ?Logic?.
Non-Peer Reviewed Technical Reports
a136 G. Bard, ?Achieving a log(n) Speed Up for Boolean Matrix Operations and
Calculating the Complexity of the Dense Linear Algebra step of Algebraic
Stream Cipher Attacks and of Integer Factorization Methods.? Available on
my web-page, and as IACR E-print Report 2006/163. 20 pp.
a136 G.Bard. ?AlgorithmsforFastMatrixOperations.? ScholarlypaperforMaster
of Science without Thesis in Applied Mathematics & Scientific Computation,
December 2005. Available on my web-page. 13 pp.
a136 G. Bard. ?Vulnerability of SSL to Chosen-Plaintext Attack.? March 2004.
IACR E-print Report 2004/111. 10 pp.
a136 G. Bard. ?Hurdles in High-Speed Intrusion Detection.? Scholarly paper for
Master of Science without Thesis in Electrical & Computer Engineering, April
2002. Available on request. 20 pp.
a136 G. Bard. The Target Network Mapping Handbook: A Guide to the Analy-
sis and Mapping of Internal Foreign Networks. Published internal to NSA,
Advanced Network Solutions Office. January 25, 2001.
a136 Several authors4 and G. Bard, ?A Description and Analysis of the Implemen-
tation of the Voyager Project, An Attempt to Build an Automated Tool to
Search for Buffer Overflow Attacks in Compiled Code using the Formal Meth-
ods of Logic.? Published internal to NSA, August 6, 1998.
a136 While working for the NSA Operations Directorate: several internal papers,
titles remain classified.
Service
Professional Service
a136 Conference Paper Reviewer and Proceedings Referee,
? Applied Cryptography and Network Security. (ACNS?07). Reviewed 1
paper.
? International Workshop on Coding and Cryptography. (WCC?07). Re-
viewed 1 paper.
? Fast Software Encryption 2007. (FSE?07). Reviewed 3 papers.
? Information Security and Cryptology. (INSCRYPT?06, formerly CISC).
Reviewed 1 paper.
? The 2006 Annual Computer Security Applications Conference. (AC-
SAC?06). Reviewed 3 papers.
? The 2005 Annual Computer Security Applications Conference. (AC-
SAC?05). Reviewed 4 papers.
a136 Session Chair, January 2007 Joint Mathematics Meeting.
a136 See also, Mathematical Outreach Talks, on page 168.
4Whose names are not releasable
University & Departmental Service
a136 Judge, Fall 2006 Spotlight on Graduate Research, Oral Presentation Contest.
a136 Graduate Student Mentor?assigned a first-year graduate student to guide and
mentor through the Applied Mathematics PhD program, starting Fall 2006.
a136 Coach, American University?s ?Alpha Team? for the ACM International Colle-
giate Programming Competition ? Trained, coached, organized and prepared
two CS students and one Mathematics student for the five-hour annual ACM
programming contest. Team placed 53rd out of 154 teams in the Mid-Atlantic
Region. (Fall 2005).
a136 Judge, Fall 2005 Spotlight on Graduate Research, Oral Presentation Contest.
a136 Founder & President, Rensselaer Cryptographic Club, Fall 1998?Spring 1999.
(RPI).
a136 Member, Dean?s Advisory Commission for the School of Engineering, 1997?
1999. (RPI)
a136 Member, President?s Advisory Council on Diversity. 1995?1997. (RPI).
Community Service
a136 Elections Judge, 2nd Precinct, 14th District, State of Maryland. Setup and
closed the polling place, trained volunteers, adjudicated exceptions, counted
ballots, certified the process. A paid, trained, position with an oath of office.
(Primary Election and General Election of 2006). Counted 709 and 1821
ballots respectively.
a136 Advisor, The Order of Jacques DeMolay. An international service organization
for young men aged 12?21. There are over 1000 chapters, and one million
members since 1919. (Advisor since 2006).
Mathematical Outreach
a136 Mathematics Book Club. ?Using Matrices to Break Modern Satellite and
Telephony Codes.? Was asked to present a one-hour talk on my dissertation
work, and in particular the aspects related to cryptanalysis. Introduced the
concept of a field, the link between logic and GF(2), a brief description of
stream ciphers, and an overview of solving systems of polynomials over a
finite field via linear algebra. April 25, 2006.
a136 Historical Miniatures Gaming Society Lecture Series. ?What is Game The-
ory?? Gave a pair of one hour talks on the basics of game theory, how it can
be taught at a high school level and the applications to modern economic and
military thought. Gettysburg, November 11 & 12, 2005.
a136 School of Computer, Mathematical and Physical Sciences, ?Science & Technol-
ogy: Addressing the Need for Diversity? (STAND) ?Math Exploration Day.??
Gave a pair of 45-minute interactive demonstrations of cryptography to groups
of visiting 7th grade students, on C?sar Ciphers and Mono-alphabetic Sub-
stitution. October 21, 2005.
Details of Employment
Fordham University?Department of Mathematics
Visiting Assistant Professor, (August 2007 to present)
This is a non-tenure track visiting assistant professorship, with a 2-3 teaching load,
at Fordham University, the Jesuit University of New York City.
University of Maryland?Department of Mathematics
VIGRE Dissertation Fellow, (August 2006 to June 2007)
Part of the National Science Foundation (NSF) Vertical Integration of Research (VI-
GRE) program grant to the University of Maryland Department of Mathematics.
Received a semester-long fellowship, to permit 100% focus on dissertation research
and writing. Rules of the fellowship required me not to teach during this term.
ECRYPT?Invited Visiting Scholar
Research Assistant to Nicolas Courtois, (May 2006 to August 2006)
The European Union (EU) has created several ?Centers of Excellence? to draw to-
gether scientists from around the EU, fund joint projects, and invite experts from
outside the EU to contribute to on-going research. The Center for Excellence for
cryptography is called ECRYPT, which invited me to spend the Summer of 2006
in Paris at their expense, in response to my paper ?Achieving a log(n) Speed Up
for Boolean Matrix Operations and Calculating the Complexity of the Dense Linear
Algebra step of Algebraic Stream Cipher Attacks and of Integer Factorization Meth-
ods.? Sponsored by and working with Nicolas Courtois. Studied the algorithms and
implementations of boolean linear algebra to create a cryptanalytic toolbox, as well
as undertaking research in the area of Boolean Linear Algebra.
American University?Department of Computer Science, Audio Technol-
ogy, and Physics
Lecturer, (January 2005 to present.)
See teaching experience.
University of Maryland at College Park?Department of Computer Sci-
ence
Graduate Research Assistant, (August 2003 to April 2004)
Teamed with Prof. Jon Katz, focused on Algebraic Cryptanalysis of Stream Ciphers,
and Proofs of Security. During this period, discovered & published (on IACR?s E-
print, and later at SECRYPT?06) an exploitable vulnerability in the Secure Sock-
ets Layer, and also proved several new theorems in the case of blockwise-adaptive
chosen-plaintext adversaries.
Naval Surface Warfare Center?Carderock Division
Summer Intern, (June 2003 to August 2003)
Modeled the performance of various Radar Absorbing/Reflecting/Amplifying Mate-
rials, and improved the signal-to-noise ratio of a crucial radar-testing experimental
instrument. Mademeasurableprogressinimprovingtheinstrument?snoisefloor, by
recommending improved experimental practices, designing a microwave-absorbing
foam cocoon, resulting in a noise reduction. Wrote a 14 page technical report. In-
formally mentored a lab assistant, who had just graduated high-school.
University of Maryland at College Park?Department of Computer Sci-
ence
Graduate Teaching Ast. for CMSC 858K: Advanced Topics in Cryptology,
(January 2003 to May 2003)
See teaching experience.
National Security Agency?Research Directorate: Secure Systems Re-
search Office
Computer Engineer, (June 2001 to September 2002)
Worked independently on several concurrent research projects in Defensive Informa-
tion Operations. Designed and implemented a novel method of specification-based
Intrusion Detection, published in a paper (see FLOWHUNT, above.) Designed tools
to help network analysts do Computer Network Defense. Supervised a research con-
tract with industry to develop Trojan Horse detection tools. Researched Bayesian
classifier pattern-matching for vocabulary categorization.
National Security Agency?Operations Directorate: Advanced Network
Solutions
Computer Engineer, (December 2000 to June 2001)
Worked to ?target the global network? by mapping particular portions of the Inter-
net. Using a wide variety of tools and sources, created a detailed understanding of
targets. On my own initiative, spearheaded an effort to automate the process, and
personally designed & completed tools which were still in use in mid-2003. Wrote a
user?s manual & trained others. More than tripled team output.
National Security Agency?Operations Directorate: Data Network Sys-
tems
Computer Engineer, (June 1999 to December 2000)
Designed, built, and tested hardware digital communications devices that were re-
quested on a ?Quick Reaction? basis. Acquired intimate knowledge of Internet
and other network protocols, embedded micro-controllers, assembly language, cross-
compiles, foreign digital communications. Also, received a commendation for an
ability to work fast and rapidly coordinate with other teams. Was made a project
technical leader on half of a very critical project after working for the agency for
less than a year. Contributed to several successful deployments.
National Security Agency?Information Assurance Directorate: Crypto-
graphic Evaluations Office
Summer Intern, (May 1998 to August 1998)
Designed and developed a proof of concept LISP engine to use Formal Methods
of Logic to detect vulnerabilities (or prove the absence thereof) in compiled code.
Each assembly language instruction was converted into a logical expression, and
potential security failures were likewise defined. Code was only deemed secure once
a Theorem Prover would guarantee that the failures were not possible. Brought the
ideas of a mathematician into reality through knowledge of algorithms and resulted
in efficient test-case code. Published a paper internal to NSA. The system was used
to verify correctness of command-and-control software for strategic nuclear weapons.
Rensselaer Polytechnic Institute?Department of Electrical & Computer
Engineering
Undergraduate Research Program, (January 1998 to May 1998)
Worked with Professor W. R. Franklin to analyze UNIX Password Security. Per-
formed novel statistical research on human password choices, sniffed directly from
the network. Executed survey of modern authentication methods, including attacks.
Wrote 140-page (double-spaced) technical report.
Rensselaer Polytechnic Institute?Department of Electrical & Computer
Engineering
Undergraduate Research Program, (January 1997 to May 1997)
Designed a new circuit board for a freshman digital electronics course that sim-
plified power supply, de-bouncing inputs and synchronizing outputs, and left the
digital logic to the student. Wrote lab assignments & lesson plans. The design
was used unchanged in production for a mandatory course for hundreds of freshmen
Computer Engineering students.
Hermes Machine Tool Co. Inc., (later emachinetool.com)?Web De-
veloper
Summer Intern, (May 1996 to August 1996)
Modernized a Mechanical Engineering firm by building a web site for e-commerce,
which later became www.emachinetool.com. The spin-off company survived until
2003.
Cybernetics Research Institute?West Orange, NJ
High School Visiting Scholar, (1993?1995)
Learned programming by helping develop hardware devices to allow the severely
handicapped to use personal computers. Gained valuable C/C++ knowledge as
well as experience with human-machine interfacing & design.
Camp Counselor-in-Training?Livingston, NJ
Camp Horizons, (Summers of 1991 & 1992)
Taught small groups of students aged 8?14 how to use computers, including pro-
gramming in the BASIC programming language, computer graphics, games of vari-
ous sorts, and word processing. (Unpaid).
Honors & Awards
a136 Member of IEEE and ACM since circa 1997. Member of MAA since 2004.
Member of IACR and AMS since 2005.
a136 Travel Support, SAGE Project. The SAGE project paid for my flight and hotel
to attend a ?coding-sprint?, during which my algorithms would be added to
their library.
a136 Dissertation Fellowship, NSF VIGRE Program (National Science Foundation,
Vertical Integration of Research Program), including full salary plus 10 credits
of tuition for one semester. (Fall of 2006).
a136 Above renewed for Spring 2007.
a136 Visiting Scholar Grant, ECRYPT. The EU?s ?center of excellence? for the
cryptographic community, ECRYPT, funded my living expenses in Paris for
the Summer of 2006. (See ?Employment.?)
a136 Travel Grant, NSF VIGRE Program (National Science Foundation, Vertical
Integration of Research Program), to fund a trip to Europe to speak at Oxford,
to interview for an internship in Paris, and to attend the ECRYPT SASC 2006
Conference in Leuven, Belgium. (February 2006).
a136 Travel Stipend, SASC 2006. Was granted a stipend by the conference orga-
nizers, for expenses related to the conference SASC 2006 in Leuven, Belgium.
(January 2006).
a136 Winner, 2004?2005 ?Spotlight on Graduate Research? Award, Department of
Mathematics, University of Maryland, for my work on building hash-function
families from pseudorandom function families. There were four total win-
ners from among the Pure Math, Applied Math, Scientific Computation and
Statistics programs. Included a cash award. (November 2005).
a136 Ranked first place out of 21 Pure & Applied Mathematics students in the
Algebra Written Qualifying Exams for the PhD program.
a136 Registered Visiting Advanced Student, Oxford University, New College, 2004.
a136 NSA Eagle Award, December 1999, 2000, 2001, for charitable & community
service acts.
a136 Letter of Commendation, October 2001, from Director of Information As-
surance, for participating in extensive reviews of research proposals for local
colleges.
a136 Skills Enhancement Recruitment Incentive Program (SERIP). Was given two
semesters of academic leave to pursue a Master?s Degree in Electrical & Com-
puter Engineering at the University of Maryland, including full-tuition and
full-salary. Fall Semester of 2000, Spring Semester of 2001.
a136 Letter of Commendation, September 2000, from Chief of Security Evaluations,
for mentoring & training an RPI summer intern, who later joined NSA.
a136 Letter of Commendation, September 2000, from Lt. General Michael Hayden,
Director of NSA, for hosting a large tour group of high-school students &
giving demos.
a136 Letter of Commendation, February 2000, from Lt. General Michael Hayden,
Director of NSA, as above.
a136 Letter of Commendation, February 2000, from Chief of Advanced Network
Solutions, Operations Directorate. (Reasons classified.)
a136 Letter of Commendation with Time-Off Award, August 1999, from Chief of
Security Evaluations, for mentoring & training an MIT graduate-student sum-
mer intern.
a136 The Director?s Star Award, August 1998, from Chief of Security Evaluations,
for an outstanding internship resulting in a publishable paper, Automated
Detection of Buffer Overflows, while an undergraduate.
a136 Letter of Commendation, August 1998, from Technical Director of C12, see
above.
a136 Member, Tau Beta Pi, The Engineering Honor Society, 1998.
a136 Member, Eta Kappa Nu, The Electrical Engineering Honor Society, 1997.
a136 Dean?s Scholarship, Rensselaer Polytechnic Institute, 1995, 1996, 1997, 1998.
a136 National Merit Scholarship Semi-Finalist, June 1995.
Other Interests
Revolutionary War Re-enactment (1st Maryland Regiment, Continental Line);
Strategy, Board, and Computer Games; Reading Shakespeare?s Histories; Antique
Book Collecting; Learning French.
Talks & Presentations (Selected)
This list does not include the ?Mathematical Outreach Talks? which are listed
on page 168.
Invited Talks External to the University of Maryland
a136 (Scheduled) The 8th International Conference on Finite Fields and Applica-
tions (Fq8), July 9?13, 2007. ?On the Connection Number of a Particular
Graph of a Polynomial System of Equations over a Finite Field.?
a136 University of California at Los Angeles, Institute for Pure and Applied Math-
ematics, February 17, 2007. ?The Method of Four Russians.?
a136 University of Wollongong, New South Wales, Australia, February 8, 2007.
?Solving Systems of Polynomial Equations over GF(2), and its Applications
to Cryptanalysis.?
a136 AISW?07, Ballarat, Australia. January 30?February 2, 2007. ?Spelling-Error
and Reordering Tolerant Pass-phrases via the Damerau-Levenshtein String-
Edit Distance Metric.?
a136 Royal Holloway, University of London. January 23, 2007. ?Solving a System
of Polynomial Equations with a SAT-Solver.?
a136 Royal Holloway, University of London. August 15, 2006. ?A Challenging but
Feasible Blockwise-Adaptive Chosen-Plaintext Attack.?
a136 SECRYPT?06, Set?ubal, Portugal. August 8, 2006. ?A Challenging but Feasi-
ble Blockwise-Adaptive Chosen-Plaintext Attack on SSL.?
a136 Ecole Nationale Sup?erieure de Techniques Avanc?ees (ENSTA). July 7, 2006.
?Boolean Matrix Operations.? (Invited Seminar talk as part of their research
group?s weekly series).
a136 Universit?e de Versailles Saint-Quentin-en-Yvelines. June 28, 2006. ?Matrix
Operations and Boolean Algebra.? (Invited Seminar talk as part of their
research group?s weekly series).
a136 YACC?06, Porquerolles, France. June 19, 2006. ?Modes of Encryption Secure
against Blockwise-Adaptive Chosen-Plaintext Attack.?
a136 Special Seminar, Computing Laboratory, Oxford University. January 31, 2006,
?Boolean matrices, cryptography and constraint satisfaction.?
a136 Oxford Computing Laboratory, Security & Concurrency Research Group, June
23, 2004, ?Four Particular Open Questions in Cryptography.?
a136 SecondAnnualComputerNetworkExploitationConference, Spring2002, ?FLOWHUNT?
An attempt at Specification-Based Intrusion Detection.?
Talks at the University of Maryland
a136 University of Maryland Graduation Conference, April 17, 2007, ?Inverting
Dense Boolean Matrices in Faster than Cubic Time.?
a136 Applied Mathematics Research Interaction Team for Cryptology, University
of Maryland, March 7, 2007, ?Algebraic Cryptanalysis of Keeloq.?
a136 Applied Mathematics Research Interaction Team for Cryptology, University
of Maryland, September 27, 2006, ?SAT Solvers and Polynomials.?
a136 Applied Mathematics Research Interaction Team for Cryptology, University of
Maryland, April 24, 2006, ?Fuzzy Extractors: How to Generate Strong Keys
from Biometrics or Other Noisy Data.?
a136 Applied Mathematics Research Interaction Team for Cryptology, University
of Maryland, September 26, 2005, ?Matrices over GF(2) and Stream Cipher
Cryptanalysis, an Introduction.?
a136 ?Spotlight on Graduate Research? Competition Victory Presentation, Univer-
sity of Maryland, May 4, 2005, ?Building Cryptographic Hash Functions from
Pseudorandom Function Families.?
a136 Applied Mathematics Research Interaction Team for Cryptology, University of
Maryland, April 20, 2005, ?Pseudorandom Function Domain Extension Using
Directed Acyclic Graphs.?
a136 Applied Mathematics Research Interaction Team for Cryptology, University
of Maryland, March 2, 2005, ?Black-Box Analysis of the Block-Cipher Based
Hash Function Constructions from PGV.?
a136 ?Spotlight on Graduate Research? Competition Entry, November 22, 2004,
?Building Scalable Almost-XOR Universal Hash Functions from Pseudoran-
dom Function Families.?
a136 Applied Mathematics Research Interaction Team for Cryptology, University
of Maryland, Sept 22 & 29, 2004. ?Online Encryption Schemes resistant to
Blockwise Adaptive Chosen Plaintext Attack.?
a136 Applied Mathematics Research Interaction Team for Cryptology, University
of Maryland, March 10, 2004. ?Solving Systems of Polynomial Equations in
GF(2).?
a136 Applied Mathematics Research Interaction Team for Cryptology, University of
Maryland, September 17, 2003. ?Cryptanalysis of Non-Linear Stream Ciphers
using Very Large Matrices.?
a136 Department of Computer Science, Cryptology Reading Group, University of
Maryland, February 14, 2003. ?Blockwise-Adaptive Security.?
a136 Department of Computer Science, Cryptology Reading Group, University of
Maryland, October 29, 2002. ?The Security of SSH.?
Other Talks
a136 EUROCRYPT?07, Barcelona, Spain. May 22, 2007. ?Algebraic Cryptanalysis
of Keeloq.?
a136 Job talks at:
? SUNY Brockport
? Fordham University
? Georgetown University
? University of Mary Washington
? ...plus scheduled job talks at other institutions, canceled upon accep-
tence of the Fordham University offer.
a136 Mathematical Association of America (MAA) and American Mathematical
Society (AMS) Joint Mathematics Meeting, New Orleans, USA. January 5,
2007. ?Algorithms for Inverting or LUP-Factoring Matrices over GF(2) in
time O(n3/logn).?
a136 ASIACRYPT?06, Shanghai, China. Dec 5, 2006. ?Algebraic Cryptanalysis,
DES, and SAT Solvers.?
a136 EUROCRYPT?06, St. Petersburg, Russia. May 30, 2006. Rump-Session Talk:
?Using the Method of Four Russians for Dense Boolean Matrix Inversion.?
a136 OxfordComputingSociety, May25, 2004, ?TheVulnerabilityofSSLtoChosen-
Plaintext Attack.?
Conferences Attended
a136 2007: Joint MAA/AMS Mathematics Meeting (New Orleans), Australasian
Information Security Workshop (Ballarat, Victoria, Australia), SAGE Days 3
(UCLA, Los Angeles, California), EUROCRYPT (Barcelona, Spain).
a136 2006: ECRYPT-SASC (Stream Ciphers Revisited, Leuven, Belgium), EURO-
CRYPT (St. Petersburg, Russia), YACC (Yet Another Conference in Cryptol-
ogy, Porquerolles Island, France), SECRYPT/ICETE (Intl. Conf. on Security
and Cryptography, Set?ubal, Portugal), CRYPTO (Santa Barbara, California).
a136 2005: EUROCRYPT & ECRYPT-SKEW (?Arhus, Denmark), WeWorc (West-
ern European Workshop on Research in Cryptography, Leuven, Belgium),
CRYPTO (Santa Barbara, California), NIST Cryptographic Hash Workshop
(Gaithersburg, Maryland).
a136 2004: SAC. (Selected Areas of Cryptography, Kitchener, Ontario)
a136 2003: SAC. (Selected Areas of Cryptography, Ottawa, Ontario)
a136 2002: CNE. (Computer Network Exploitation, Bowie, Maryland)
a136 2001: CNE. (Computer Network Exploitation, Bowie, Maryland)