ABSTRACT
Title of Dissertation: SCHEDULING AND RATE PROVISIONING FOR
INPUT-BUFFERED CELL BASED SWITCH FABRICS
Vahid Tabatabaee, Doctor of Philosophy, 2003
Dissertation directed by: Professor Leandros Tassiulas
Department of Electrical and Computer Engineering
In this dissertation, we develop and analyze algorithms for scheduling in input-
bufiered switch fabrics. We have introduced new deterministic and randomized
scheduling algorithms that are capable of rate provisioning, achieves 100% through-
put and have lower complexity than other proposed solutions. We consider QoS
provisioning in general and rate provisioning in particular as the basic requirements
for the next generation switch fabrics.
To do rate provisioning, we extend the concept of packetized tracking policies
for  uid policies to the input-bufiered switches. It is considered that the speed up
of the switch is one and the  uid policy is feasible, i.e., utilization of all ports is
less than one. For the 2?2 switches, we show that ideal non-anticipative tracking
policies always exist. By ideal, we mean a tracking policy that is at most one cell
behind the corresponding  uid policy. Using a 3 ? 3 counter example, we show
that non-anticipative policies do not generally exist. For the N ? N switches, a
heuristic tracking policy is provided.
The encouraging results for the heuristic policy motivated us to explore for
analytical result for its performance. This efiort leads us to the introduction of
maximum node contained matching (MNCM) a new class of deterministic maxi-
mal size matching algorithms. We use  uid model techniques to prove that these
algorithms achieve 100% throughput with no speedup. The only assumption on
the arrival pattern is that it satisfles strong law of large numbers. We also in-
troduce a new weighted matching algorithm in MNCM, maximum flrst matching
(MFM) with complexity O(N2:5). MFM, to the best of our knowledge, is the low-
est complexity deterministic algorithm that delivers 100% throughput. We extend
the concept of MNCM schedulers and introduce the Maximum Size Unit Interval
Matching (MSUIM) algorithm for rate provisioning
Finally, we propose a general parallel architecture for self-randomized algo-
rithms that is appropriate for practical applications. We introduce the concept of
max-min fair self-randomized scheduling algorithms for rate provisioning. Using
 uid model technique, we provide analytical results for the performance of the
self-randomized schedulers.
SCHEDULING AND RATE PROVISIONING FOR
INPUT-BUFFERED CELL BASED SWITCH FABRICS
by
Vahid Tabatabaee
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulflllment
of the requirements for the degree of
Doctor of Philosophy
2003
Advisory Committee:
Professor Leandros Tassiulas, Chairman and Advisor
Professor Anthony Ephremides
Professor Samir Khuller, Dean?s Representative
Professor Richard La
Professor Uzi Vishkin
c Copyright by
Vahid Tabatabaee
2003
DEDICATION
To my mother who taught me the love for life,
to my father who taught me the love for knowledge,
to my wife Mojgan who is my source of inspiration and,
to my son Omid who reminds me of these all everyday.
ii
ACKNOWLEDGEMENTS
I would like to thank my advisor, professor Leandros Tassiulas for his support
and guidance during my studies. He introduced me to the fascinating fleld of com-
munication networks and gave me the opportunity to work on several interesting
problems in this fleld. His support and understanding has quite often gone beyond
the academic matters during this period.
I am specially grateful to professor Leonidas Georgiadis from Aristotle Univer-
sity. During my research, Leo has been a second advisor to me, and working with
him was a great experience. I have learned a lot from his research and problem
solving methodology. In fact, results of chapter 2 are outcome of our joint work.
I am thankful to the members of the dissertation committee, Professors An-
thony Ephremides, Samir Khuller, Richard La and Uzi Vishkin for kindly reviewing
this dissertation and serving on my thesis committee.
In particular, I would like to thank professor Samir Khuller for several discus-
sions that we had on the comlexity of alternative matching algorithms for schedul-
ing. These early discussions motivated me to pursue a path that resulted in the
introduction of the maximum node contained matching algorithms in chapter 3.
During the course of my academic studies, I was fortunate to work and become
friend with some of the brightest and kindest people that I have ever known. I
cherish and honor their friendship and thank them for what I have learned from
them during this long journey.
iii
Last, but not least, I express sincere gratitude to my parents for all the sac-
riflces, to my wife Mojgan for all the support and patience and to my dear son
Omid for his cheerfulness.
iv
TABLE OF CONTENTS
List of Tables viii
List of Figures ix
1 Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Data Path in a packet switching system . . . . . . . . . . . 1
1.1.2 Switch Fabric Architectures . . . . . . . . . . . . . . . . . . 9
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.3.1 Maximum Size Matching . . . . . . . . . . . . . . . . . . . . 20
1.3.2 Maximum Weighted Matching . . . . . . . . . . . . . . . . . 21
1.3.3 Parallel Matching Algorithms . . . . . . . . . . . . . . . . . 22
1.3.4 Randomized Scheduling Algorithms . . . . . . . . . . . . . . 23
1.3.5 Fluid Model Techniques . . . . . . . . . . . . . . . . . . . . 26
1.4 Contributions of Dissertation . . . . . . . . . . . . . . . . . . . . . 27
1.4.1 Packetized Tracking Policies . . . . . . . . . . . . . . . . . . 27
1.4.2 The Maximum Node Contained Matchings . . . . . . . . . . 29
1.4.3 Self-randomized Scheduling Algorithms and Rate Provisioning 32
v
2 Rate Provisioning and Tracking Fluid Policies in Input-bufiered
Switches 35
2.1 Fluid and Packetized Tracking Policies . . . . . . . . . . . . . . . . 37
2.2 Multi-periodic TDMA Satellite Switches . . . . . . . . . . . . . . . 39
2.3 The 2?2 switch . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.4 Heuristic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.4.1 Port Based Tracking . . . . . . . . . . . . . . . . . . . . . . 59
2.4.2 Critical Ports and Links . . . . . . . . . . . . . . . . . . . . 67
2.4.3 Fix Rate Scheduler Simulation . . . . . . . . . . . . . . . . . 70
2.5 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . 74
3 MNCM: A class of e?cient maximal size matching scheduling
algorithms with no speedup 76
3.1 Model and Deflnition . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.2 Link and Port Fluid Models . . . . . . . . . . . . . . . . . . . . . . 82
3.3 Stability Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.3.1 Comparison of the MNCM and LPF . . . . . . . . . . . . . 90
3.4 Maximum First Matching (MFM) . . . . . . . . . . . . . . . . . . . 91
3.5 Rate Provisioning and the MSUIM Scheduling Algorithm . . . . . 98
3.6 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4 Analysis and design of self-randomized max-min fair schedulers 105
4.1 Models and Deflnitions . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.1.1 Randomized Scheduling Algorithms and Fluid model . . . . 109
4.2 Stability of Randomized Schedulers . . . . . . . . . . . . . . . . . . 112
vi
4.3 General Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.4 Backlogged weight function . . . . . . . . . . . . . . . . . . . . . . 117
4.4.1 B1 : Self-Randomized Based on Instantaneous Arrivals . . . 118
4.4.2 B2 : Self-Randomized Based on Total Arrivals . . . . . . . . 119
4.4.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . 121
4.5 Max-min fair scheduling weight function . . . . . . . . . . . . . . . 123
4.5.1 A1 : Self-Randomized Based on Instantaneous Token and
Cell Arrivals . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.5.2 A2 : Self-Randomized Based on Total Token and Cell Arrivals132
4.5.3 A4 : Self-Randomized Based on Total Link Weight and Cell
Arrivals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
4.5.4 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . 136
5 Summary and concluding remarks 146
Bibliography 150
vii
LIST OF TABLES
2.1 Example of Sub-permutation Matrices . . . . . . . . . . . . . . . . 45
2.2 Performance for difierent inspection horizons (U=(0:85;0:8), N=32) 73
2.3 Performance for difierent switch sizes (U=(0:85;0:8)) . . . . . . . . 74
2.4 Performance for difierent utility pairs (N=32) . . . . . . . . . . . . 74
viii
LIST OF FIGURES
1.1 Block Diagram of a Data Switching System. . . . . . . . . . . . . . 2
1.2 Schematic Diagram of a Chassis Based Switch. . . . . . . . . . . . . 3
1.3 A Three Layer Hierarchical Scheduler . . . . . . . . . . . . . . . . . 5
1.4 Sources of Cell Overhead in a Switching System. . . . . . . . . . . . 8
1.5 Schematic Architecture for a 4?4 Shared Memory Switching and
Scheduling Elements. . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6 Schematic Architecture for a 4?4 Bufiered Crossbar Switching and
Scheduling Elements. . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.7 Schematic Architecture for a 4?4 Input Bufiered Switch Fabric. . . 15
1.8 Schematic Architecture of an Input-Bufiered Switch Fabric. . . . . . 17
2.1 Efiect of Cell p3 on completion times of p1 and p2. . . . . . . . . . . 43
2.2 Cell transfer from input port i to output port j. . . . . . . . . . . . 46
2.3 Arrangement of cells for Case 1 in the proof of Theorem 1. . . . . . 49
2.4 Arrangement of cells for Case 2 in the proof of Theorem 1. . . . . . 52
2.5 The backlogged cells for the  uid and packetized tracking. . . . . . 60
2.6 Example of node and link weight matching in bipartite graphs. . . . 64
ix
3.1 Four possible forms of sub-graphs in a combined matching. Links
of M1 are shown with solid lines and M2 with dashed lines. Marked
nodes are shown as black nodes. . . . . . . . . . . . . . . . . . . . . 94
3.2 Average Delay v.s. Throughput for difierent matching algorithms.
The output queueing system is modelled as an M/D/1 system. . . . 104
4.1 The generic block diagram adopted for self-randomized schedulers. . 116
4.2 Mean delay for B1 under diagonal tra?c. . . . . . . . . . . . . . . . 123
4.3 Mean delay for B1 under log-diagonal tra?c. . . . . . . . . . . . . . 124
4.4 Mean delay for B1 under uniform tra?c. . . . . . . . . . . . . . . . 124
4.5 Mean delay for B2 under diagonal tra?c. . . . . . . . . . . . . . . . 125
4.6 Mean delay for B2 under log-diagonal tra?c. . . . . . . . . . . . . . 125
4.7 Mean delay for B2 under uniform tra?c. . . . . . . . . . . . . . . . 126
4.8 Mean delay for A1 under diagonal tra?c. . . . . . . . . . . . . . . . 137
4.9 Mean delay for A1 under log-diagonal tra?c. . . . . . . . . . . . . . 137
4.10 Mean delay for A1 under uniform tra?c. . . . . . . . . . . . . . . . 138
4.11 Mean delay for A2 under diagonal tra?c. . . . . . . . . . . . . . . . 139
4.12 Mean delay for A2 under log-diagonal tra?c. . . . . . . . . . . . . . 139
4.13 Mean delay for A2 under uniform tra?c. . . . . . . . . . . . . . . . 140
4.14 Mean delay for A4 under diagonal tra?c. . . . . . . . . . . . . . . . 141
4.15 Mean Delay for A4 under log-diagonal tra?c. . . . . . . . . . . . . 142
4.16 Mean Delay for A4 under uniform tra?c. . . . . . . . . . . . . . . . 142
4.17 Mean delay for difierent schemes under diagonal tra?c. . . . . . . . 143
4.18 Mean delay for difierent schemes under log-diagonal tra?c. . . . . . 144
4.19 Mean delay for difierent schemes under uniform tra?c. . . . . . . . 144
4.20 Serving rate adaptation to arrival pattern changes. . . . . . . . . . 145
x
Chapter 1
Introduction
1.1 Overview
Due to the progress in optical transmission technology and increase in Internet
tra?c, very fast switching technology are necessary for the internet core and edge
switching and routing systems. In this section, we introduce the basic architecture
and components of a switching system in communication networks. We then focus
on the switch fabric component and review a number of common architectures that
are used for switch fabrics.
1.1.1 Data Path in a packet switching system
Consider the generic packet switching system in flgure 1.1. We focus on the data
plane components. There are two cards shown in the picture: the line card, and
the switch card. In a real chassis based switching system, there are multiple line
cards, and depending on the capacity, and the redundancy model of the system
there can be one or multiple switch cards (flgure 1.2). The Line cards are interface
1
Optics CDR & Serdes Framer / Mapper 
Network 
Process 
or 
Traffic 
Manager 
Switch 
Interface 
Switching 
Element 
Scheduling 
Element 
Data Plane 
Control Plane 
Control Plane 
CPU 
Bus 
Interfac 
e 
Line Card Switch Card 
Back Plane 
Ingress 
Egress 
Figure 1.1: Block Diagram of a Data Switching System.
of the switching system with the outside world. Packets or cells enter the system
through ingress interface of the line cards, and exit the system through the egress
interface of them. The Switch card task is to allow data units that enter the system
through an ingress line card get to their targeted egress line card.
Line Card Functions and Components
We start from the line interface side of the line card. The flrst block is the optics
that does the optical to electrical and electrical to optical conversions. Next block
performs clock and data recovery (CDR) and deserialization (serialization) of bit
stream in the ingress (egress) direction. The framer/mapper extracts payload from
the layer 1 framing protocol (SONET) in the ingress direction and maps data back
into the frames in the egress direction.
The Packet processing starts from the network processor block, which is mainly
a packet classifler. In the ingress direction it identifles the  ows based on the
multi-fleld lookup. Multi-fleld lookup, for instance, can be based on the IP N-
tuple lookup, or MPLS header. After  ow identiflcation it determines the egress
2
Figure 1.2: Schematic Diagram of a Chassis Based Switch.
forwarding port address, mark the QoS parameters, and append the system internal
header information for tra?c manager.
Next block is the tra?c manager that performs the following functions:
? Police ingress tra?c, for instance based on IETF DifiServ or ITU GCRA.
? Provide congestion management based on WRED proflles.
? Enqueue packets into class based queues per destination port.
? Schedule ingress tra?c into the fabric interface using appropriate scheduling
techniques.
? Segment packets into flxed sized switch-cells for switch fabric.
? Reassemble switch-cells back into packets in the egress direction.
? Shape the egress tra?c into outgoing lines.
3
? Schedule the egress tra?c using the appropriate dequeueing method.
The switch fabric interface unit is a queueing management unit that resides on
the line cards. Even though it is located on the line card it is considered as part
of the switch fabric. It bufiers switch cells and sends request for connection to the
other ports. Once a request is granted by the switch fabric scheduling element, it
schedules a cell from one of the class based queues for that egress port.
The fabric interface unit also manages the  ow control massaging between the
line card and the switch card(s). Note that the switch fabrics are usually designed
based on a lossless model, where as the tra?c manager is entitled to discard packets
under congestion using WRED proflle for example.
Tra?c managers are usually capable of supporting a large number  ows using
a hierarchical scheduling model [3]. Consider the three layer hierarchical scheduler
shown in flgure 1.3. In practice there can be more than three layers of scheduling.
The third layer scheduler invokes one of the second layer schedulers. The selected
second layer scheduler invokes one of the flrst layer schedulers, that selects a  ow
in turn.
In a desirable design with k layers of scheduling in the tra?c manager, number
of the queues that are supported in the switch fabric interface unit should match
the number of schedulers in the k ? 1 layer of the tra?c manager. In this way,
we would not lose queue granularity in the switch fabric interface, and the fabric
interface can potentially functions as an extension of the tra?c manager scheduler.
Performance of the tra?c manager scheduler, is highly dependent on the switch
fabric interface behavior. Usually switch fabrics have a limited space for bufiering
the cells, and are relying on a  ow control mechanism between the fabric interface
and the tra?c manager to control the data  ow, and avoid data loss. In essence, the
4
F(1,1) 
F(k,1) 
F(1,2) 
F(k,2) 
F(1,3) 
F(k,3) 
S(1,1) 
S(1,2) 
S(1,3) 
S(2,1) 
F(1,q-2) 
F(k,q-2) 
F(1,q-1) 
F(k,q-1) 
F(1,q) 
F(k,q) 
S(1,q-2) 
S(1,q-1) 
S(1,q) 
S(2,p) 
S(3,1) 
Figure 1.3: A Three Layer Hierarchical Scheduler .
5
switch fabric can block the tra?c manager scheduler and impacts it performance.
The scheduling techniques that are generally used and employed in the traf-
flc managers, such as weighted round robin (WRR), and weighted fair queueing
(WFQ), can provide service guarantees to the  ows [10], [11], [32], [33], [18]. In
order to maintain a manageable and non-disruptive transfer of data through the
switch fabric, it is necessary to be able to control and manage the serving rate of
the queues in the switch fabric too.
Switch Card
The interface between the line and switch cards are established through high speed
serial links (HSSL) that are passed through the backplane. Reliable data rate over
the backplane is limited to 3.125 Gbps with today?s technology, even though a
number of companies and research labs are working on higher speed prototypes.
Therefore, to achieve the required rates each line card uses multiple high speed
serial links in parallel. The problem becomes even more complicated, when we
realize that the serial links overhead reduces the efiective data rate even further.
As an example consider an HSSL with raw data rate of 3.125 Gbps. Using 8b/10b
line coding the efiective data rate reduces to 2.5 Gbps. Therefore, for a line rate of
10 Gbps, and speedup factor of 2, we need to run 8 parallel active and 8 redundant
HSSL from each line card. In a system with 16 line cards this adds up to 256
bidirectional high speed serial links. Each pair of HSSLs of a line card is connected
to a separate switching element in the switch card. For instance, in the previous
example, there will be 16 separate switching elements working in parallel.
Number of HSSLs and switching elements is directly related to the complexity
of the backplane design, number of switch cards, power dissipation and ultimately
6
cost of the system. Therefore, it should be clear that having e?cient scheduling
algorithms that do not rely on extensive speedup factors is highly desirable.
The scheduling unit works as the arbiter between the line cards. While the
tra?c manager schedulers perform a many to one decision between  ows residing
on a single line card, the scheduling unit of the switch fabric performs a many
to many arbitration between the line cards. Generally, the switch interface unit
sends a series of connection requests to the switch scheduling unit, based on the
occupancy of its local queues. The scheduling unit arbitrates between the requests,
and at every time accepts a set of non-con icting requests. Accept messages are
sent back to the switch interface unit, and they send the accepted cells to the
switching unit.
Efiective Speedup
In the previous section, we emphasized that speedup is costly and it can not be
increased arbitrarily. However, there is a misnomer in the industry that scheduling
ine?ciencies can be compensated with moderate speedup factors around 2. The
problem is in deflnition of speedup and the way it is calculated.
We deflned speedup based on the switch interface unit interfaces in flgure 1.1.
Let rSF be the cell rate over the line between the switch interface unit and the
switch card, rTM be the cell rate over the line between the switch interface unit
and the tra?c manager. Speedup factor s is,
s = rSFr
TM
(1.1)
The discrepancy in deflnition occurs, when for some switch fabrics the incoming
rate of data to the line card is used instead of rTM in speedup calculations. For
7
Incoming Packet 
Cell 1 Cell 2 Cell 3 Cell 4 
Cell 1 Cell 2 Cell 3 Cell 4 TM Header TM Header TM Header TM Header 
F 
a 
b 
r 
i 
c 
H 
e 
a 
d 
e 
r 
F 
a 
b 
r 
i 
c 
H 
e 
a 
d 
e 
r 
F 
a 
b 
r 
i 
c 
H 
e 
a 
d 
e 
r 
F 
a 
b 
r 
i 
c 
H 
e 
a 
d 
e 
r 
Cell 1 Cell 2 Cell 3 Cell 4 TM Header TM Header TM Header TM Header 
F 
r 
a 
g 
m 
e 
n 
t 
a 
t 
i 
o 
n 
( 
C 
e 
l 
l 
  
T 
a 
x 
) 
T 
M 
  
O 
v 
e 
r 
h 
e 
a 
d 
F 
a 
b 
r 
i 
c 
  
O 
v 
e 
r 
h 
e 
a 
d 
Figure 1.4: Sources of Cell Overhead in a Switching System.
instance, for a 10Gbps line rate, due to the system overhead, rTM can be higher
than 15Gbps. There are several factors contributing to the system overhead that
are illustrated in 1.4. Cell tax is due to segmentation of variable size packets into
flxed size cells. This process usually ends with a cell that carries last part of the
packet and is augmented with zero?s to flll the cell size. The tra?c manager and
the switch fabric headers are two other main sources of the overhead.
For instance, consider a 40 byte packet that has to be carried over a switch
with 72 byte cell size. For the header size, consider 8 byte header for the fabric
and 12 byte header for the tra?c manager. In this system 10Gbps line rate, results
in 18Gbps of rTM, and with rSF = 20Gbps its efiective speedup factor is 1.1 rather
than 2.
Our discussion on the system overview and role of the switch fabric in the
system provides enough background to highlight major requirements of a switch
fabric:
8
? Number of Queues: Number of the queues in the switch fabric interface
unit should match the hierarchical scheduler structure of the tra?c manager
unit.
? Queue Management: Similar to tra?c managers, switch fabrics should
provide service guarantees to each one of the queues that they recognize.
? E?cientScheduling: Scheduling algorithm should be efiective enough and
can not rely on the speedup to achieve high throughput.
1.1.2 Switch Fabric Architectures
In the previous section we gave an overview of a switching system architecture
and components. The discussion was ended with a short list of major performance
issues and requirements for the next generation switch fabrics. In this section, we
build up on that discussion by reviewing the main switch fabric architectures, their
challenges and advantages.
Output-Bufiered Switch Fabric
In this architecture all cells are bufiered in the egress side of the switch fabric, and
there is a flxed delay for all cells through the fabric. This architecture requires
that the interconnection through the fabric works N times faster the cell rate on
the line card, i.e. rSF = NrTM, so that all ingress ports can forward one cell to
their targeted egress ports in every cell time. Moreover, the bufiers in the egress
ports should support N writes and 1 read in every cell time.
There is no contention in this architecture, since every ingress port has its
dedicated reserved time for forwarding the cells. The bufiering is only at the
9
output side of the switch fabric, and the conventional many to one scheduling
algorithms are su?cient to provide service guarantees.
However, this architecture has serious scalability problem. As the number of
ports and the line rate increases, it is simply not possible to meet the interconnec-
tion bandwidth and memory speed requirements of this architecture.
Shared Memory Switch Fabric
In this architecture the switching element is a shared memory. Consider the 4x4
switching element in flgure 1.5. The ingress interface units forward the cells to the
switching unit. The switching unit is in essence a shared memory that constitutes
from multiple queues per egress port. In flgure 1.5 there are 4 queues, supporting
4 classes of service, for each egress port. The scheduling unit constitutes of 4 many
to one schedulers. Each scheduler is designated to one egress port and performs
scheduling between multiple queues of that egress port.
Scalability is still an issue with this structure. This architecture has resolved N
times over speed requirement of the output-bufiered switch fabric by using a shared
memory in the middle. However, it requires even higher bandwidth memories,
since at every cell time we need to perform N writes and N reads from the shared
memory.
The centralized bufiering structure have had other problematic consequences
for switch fabrics that are based on this architecture. In general in order to control
and provide rate provisioning there should be dedicated queues for every input,
output and class of service. However, it is not simple to implement and schedule
the required number of queues in a centralized architecture such as shared memory.
For instance in a 32x32 switch fabric with 8 classes of service this requires 8192
10
Egress 1 
Scheduler 
Class Based Queueing 
for Egress 1 
Egress 2 
Scheduler 
Class Based Queueing 
for Egress 2 
Egress 3 
Scheduler 
Class Based Queueing 
for Egress 3 
Egress 4 
Scheduler 
Class Based Queueing 
for Egress 1 
To Egress 1 
To Egress 2 
To Egress 3 
To Egress 4 
From Ingress 1 
From Ingress 2 
From Ingress 3 
From Ingress 4 
Figure 1.5: Schematic Architecture for a 4 ? 4 Shared Memory Switching and
Scheduling Elements.
11
distinct queues. A common solution to this problem is aggregation of queues.
For instance, cells that are coming from difierent ingress ports and have the same
priority are bufiered in the same queue and it would become impossible to provide
service guarantees to them. As an example, let say that we want to provide 0.1 of
egress port 1 capacity to input 1 and 0.2 to input 2. However, since cells that are
coming from input 1 and input 2 share the same queue in the switch it is impossible
to guarantee the rates individually. This is basically aggregation of tra?c manager
queues in the switch fabric that makes it impossible for the switch fabric to provide
service guarantees to each of the tra?c manager queues.
Another practical problem with this architecture is the  ow control mechanism
that should be used between the shared memory element in the switch card and
the switch interface units in the ingress line cards (Fig. 1.1). Due to the flnite size
of the shared memory element that is used for bufiering the cells, and to avoid cell
drop on the shared memory, a sophisticated  ow control signalling should be used
between the shared memory and the switch interface units. Basically, at every cell
time, the shared memory element should inform the line card, whenever each of
its queues are full and it can not accept any more cells. This incurs considerable
 ow control overhead as well as performance degradation. The problem gets more
serious when we realize that the  ow control mechanism should be conservative
and take into account the  ow control latency since it will take several cell times for
a  ow control message to reach the ingress line card from the switching elements.
Bufiered Crossbar Switch Fabric
The bufiered crossbar architecture has recently got a lot of attention, and is consid-
ered to resolve some of the inherent problems with the shared memory architecture
12
[7]. In this architecture, the switching element consists of separate bufier units per
input, output and class. Consider the 4?4 bufiered crossbar switch fabric in flgure
1.6. There are 2 classes of service supported in this switch fabric. A many to one
scheduler is used for each egress port to perform the scheduling between queues
that are targeting that egress port. At every cell time there is at most one read
and one write to each memory section, therefore memory speed requirements are
resolved by using dedicated memory instead of shared memory for queues. How-
ever, number of queues, as well as size of the queues are limiting factors in this
architecture.
Implementation constraints in the number of queues restricts number of classes
of service in the bufiered crossbars. This results in aggregation of classes of services
in the tra?c manager, and because of that switch fabric can not provide service
guarantees with the required granularity.
Input-Bufiered Switch Fabric
Consider the 4x4 input-bufiered switch fabric in flgure 1.7. Cells are bufiered in the
switch interface units, and the switching unit is a simple bufier-less crossbar. Cells
are queued in separate queues that are called virtual output queue (VOQ). Each
VOQ is associated to one output and cells are bufiered based on their destination
and class of service.
Introduction of VOQs resolves HOL blocking problem in the input bufiered
switches [16]. HOL blocking occurs if a single FIFO queue is used for all cells in
the ingress side. If the cell at the head of the queue can not be forwarded, due
to contention with cells from other ingress units, other cells in the queue are also
blocked. Note that other cells may have reachable destinations that are idle. In fact
13
Egress 1 
Scheduler 
Egress 2 
Scheduler 
Egress 3 
Scheduler 
Egress 4 
Scheduler 
From Ingress 1 
From Ingress 2 
From Ingress 3 
From Ingress 4 
To Egress 1 To Egress 2 To Egress 3 To Egress 4 
Figure 1.6: Schematic Architecture for a 4 ? 4 Bufiered Crossbar Switching and
Scheduling Elements.
14
VOQs to Egress 1 
VOQs to Egress 2 
VOQs to Egress 3 
VOQs to Egress 4 
Ingress Interface unit 1 
VOQs to Egress 1 
VOQs to Egress 2 
VOQs to Egress 3 
VOQs to Egress 4 
Ingress Interface unit 4 
4x4 
Crossbar 
Crossbar 
Scheduler 
Egress Queues 
Egress Interface unit 1 
Egress Queues 
Egress Interface unit 4 
Figure 1.7: Schematic Architecture for a 4?4 Input Bufiered Switch Fabric.
without VOQs, for Bernoulli i.i.d. arrivals with uniform destination distribution
the maximum achievable throughput is 58:6% [26].
In this architecture, bufiering is distributed between ingress interface units,
and there is no concentrated bufiering in the switching element. Therefore, the
scalability problems that are related to the memory speed, queue size, and number
of queues are not an issue in this architecture.
When contention happens, the crossbar scheduler determines which cells will
be forwarded and which ones will stay in their ingress bufiers. It is clear that
15
the performance of this switch fabric is determined by the crossbar scheduler. In
essence, crossbar scheduler design is the main challenge of this architecture.
In summary, due to its distributed queue structure, input bufiered switch fabric
is the appropriate, scalable architecture for next generation switch fabrics. How-
ever, the challenge is to come up with a low complexity scheduling algorithm that
can provide service guarantees to the queues and provides high throughput at the
same time. Design and analysis of such schedulers is the problem that we have
tackled in this thesis.
1.2 Problem Statement
We consider input queued switches that serve flxed size packets. Each input and
output has the capacity of serving 1 cell per unit time. At most s packet per unit
time can be transferred from the input ports to a given output port, where s is
the speedup factor of the switch. To avoid head of line (HOL) blocking, we consider
that the bufier at input i is partitioned into N queues, Queue(i;1);??? ;Queue(i;N).
The scheduling policy is basically a matching algorithm m that based on the state
of the switch selects a matching between the inputs and outputs in every time slot.
If input i is matched to output j, and the Queue(i;j) is not empty, a cell is sent
from input i to output j. A matching can be represented by a permutation matrix
?. Input ports are represented by the rows and output ports by the columns of
this matrix, and input i is matched to output j if and only if ?ij = 1.
We assume that the cells arrive at the switch at the beginning of a time slot.
A cell that has arrived at the beginning of time slot n can be scheduled at the
same time slot and depart the switch during the same time slot n. Let Aij(n) be
the number of packets that arrived at input i and are destined for output j up to
16
Incoming Traffic 
Queue (1,1) 
Queue (1,N) 
Incoming Traffic 
Queue (N,1) 
Queue (N,N) 
Crossbar 
unit(s) 
Input 1 
Input N 
Crossbar 
Scheduler(s) 
M 
a 
t 
c 
h 
i 
n 
g 
Outgoing Traffic 
Outgoing Traffic 
Output 1 
Output N 
Figure 1.8: Schematic Architecture of an Input-Bufiered Switch Fabric.
17
time n. We assume that there are no arrivals before time 0, i.e., Aij(0) = 0. The
arrival processes fAij(:);i;j = 1;:::;Ng satisfy strong law of large numbers, that
is with probability one,
limn!1 Aij(n)n = ?ij i;j = 1;:::;N: (1.2)
Arrival process is considered to be admissible if no input or output is oversub-
scribed, i.e.,
NP
i=1
?ij ? 1 j = 1;:::;N;
NP
j=1
?ij ? 1 i = 1;:::;N:
(1.3)
In this thesis, we are looking for difierent matching algorithms that can be used
for scheduling (arbitration) in input-bufiered switch fabrics. There are a number
of desirable features for the scheduling algorithms that we attempt to achieve:
? Stability: For a given admissible tra?c pattern, we deflne a matching algo-
rithm to be stable (rate stable) if with probability one,
limn!1 Dij(n)n = ?ij i;j = 1;:::;N; (1.4)
where Dij(n) is the number of departed cells up to time n, from input i to
output j.
? Delay E?cient: The average latency of cells in a switch fabric is a key
performance metric. A scheduling algorithm is delay e?cient, if it has a
good delay performance compared to benchmark matching algorithms. Note
that we consider the delay e?ciency as a comparative metric rather than an
absolute metric. The bench mark algorithms will be introduced later.
? No Starvation: A queue is starved if for a given tra?c pattern, it remains
unserved, and non-empty indeflnitely. Note that starvation focuses on one
18
queue and perhaps latency of one speciflc cell in that queue, while the delay
e?cient criterion focuses on the average latency in the switch fabric.
? Complexity: The matching algorithms that are considered should be imple-
mentable in special purpose hardware for application in high speed switches.
Therefore, the algorithm should be simple enough to be realized in hardware
with flnite logic.
? GuaranteedServiceDelivery: Providing performance (delay, jitter, through-
put) guarantees will be a key requirement for next generation switching sys-
tems. This can not be achieved at the system level, unless we have granular
control over the switch fabric input queues services. The scheduling algo-
rithm should be able to efiectively control and manage the serving rate of
each of the queues, without sacriflcing throughput of the system.
1.3 Related Work
In section 1.1, we brie y reviewed and studied alternative switch fabric architec-
tures. Based on the requirements of next generation switch fabrics, we reached the
conclusion that the crossbar based input-bufiered switch fabric is the appropriate
switch fabric architecture. We emphasized that the crossbar scheduler determines
performance of the input-bufiered switch fabrics, and it is one of the main design
challenges of the input-bufiered switch fabrics. In this section, we review some
of the major techniques, and analytical results for scheduling in input-bufiered
switches.
19
1.3.1 Maximum Size Matching
In section 1.2, we explained that the crossbar scheduling problem is equivalent to
the matching problem in a bipartite graph, where the ingress ports are the flrst
group of vertices (nodes) and the egress ports are the second group of them. There
is an edge between two vertices if and only if the corresponding ingress and egress
ports are backlogged, i.e. there is a cell in the ingress port destined to the egress
port. The scheduling problem is to select a set of connections between backlogged
ingress and egress ports, such that no port involves in more than one connection.
Equivalently, the problem is to select a matching, i.e. to flnd a set of the bipartite
graph edges, such that no vertex is connected to more than one edge in the set.
Many of the scheduling algorithms that are proposed for the input-bufiered
switches are originated in graph theory, where they are studied and analyzed as
matching algorithm for bipartite graphs.
The maximum size (cardinality) matching for bipartite graphs are one of the
most famous problems in the graph theory. Given a bipartite graph G, the maxi-
mum size matching of G is a matching set of G edges with maximum cardinality.
In other words, the maximum size matching has the maximum possible number
of edges among all possible matchings. Several algorithms are proposed for the
maximum size matching; the most e?cient one has O(n2:5) complexity [23].
Even though the maximum size matching, schedules maximum possible number
of cells at every scheduling time, it is not rate stable [29]. Note that for a given
bipartite graph the maximum size matching is not necessarily unique. We will
later prove that a speciflcally characterized maximum size matching is rate stable.
20
1.3.2 Maximum Weighted Matching
Consider a bipartite graph G, where every edge of the graph has a weight value.
The maximum weighted matching is a matching of G with maximum weight, where
weight of a matching is the summation of its edges weights. In the scheduling
problem, weight of a link can be number of backlogged cells for that connection,
or delay of flrst cell in the associated VOQ. For both mentioned weight functions
it is proved that the maximum weighted matching algorithm is rate stable. The
stability proofs was originally for Bernoulli i.i.d. arrivals [37], [29], but was later
extended to all feasible arrival patterns [13].
In the previous section, we explained that even though the maximum size
matching scheduler, transfers maximum possible number of cells in every time slot,
it is not rate stable. To achieve 100% throughput the scheduling algorithm needs
to take into account other factors. In a rate stable maximum weighted matching
algorithm the required information is carried in the weights of the graph.
The maximum weighted matching algorithm is also known as the assignment
problem, and can be described by a concise linear program. The corresponding
linear program, due to its speciflc architecture, can be solved using the Hungar-
ian method [27]. Complexity of the Hungarian method that solve the weighted
matching problem for a complete bipartite graph is O(N3).
In chapter 3, we will introduce Maximum Node Containing Matchings (MNCM),
a class of weighted scheduling algorithms that are not necessarily maximum weighted
matching, but are rate stable. We also introduce a speciflc matching algorithm,
Maximum First Matching (MFM) that is rate stable and its complexity is O(N2:5).
To the best of our knowledge, MFM is the lowest complexity rate stable determin-
istic scheduling algorithm introduced for input-bufiered switches.
21
1.3.3 Parallel Matching Algorithms
Most of the solutions that are proposed for the maximum size and the maximum
weighted matching algorithms are sequential. Sequential algorithms inherently
runs in multiple consequent steps. Number of the steps are usually proportional to
the number of ports (N) or edges of the graph (N2). Therefore, these algorithms
have serious scalability problems, as the number of ports increases. The parallel
matching algorithms, on the other hand, have a distributed structure and can be
divided into multiple simultaneous tasks.
The most common parallel matching algorithms for scheduling can be imple-
mented on 2N blocks (arbiters) working in parallel. Each arbiter is associated to
one of the ingress or egress ports of the system, and makes one selection in every
iteration of the algorithm. Each iteration of the algorithm is consisted of three
steps:
1. Request: Each unmatched input sends a request to every output for which
it has a backlogged cell.
2. Grant: Each unmatched output arbiter selects one of the requests that are
sent to it, and sends a grant to the corresponding input arbiter.
3. Accept: Each input arbiter selects one of the grants that are sent to it.
At the end of each iteration, a set of new connections are added to the matching.
One of the main issues with the parallel matching algorithms is arbiter synchro-
nization. Synchronization happens, when multiple output (grant) arbiters select
the same input. In that case, only one of them will get the accept signal and the
rest will remain unmatched.
22
Parallel Iterative Matching (PIM) [1] uses a random selection mechanism to
avoid synchronization and improve the performance. The PIM performance is not
predictable because of its random nature. Furthermore, it can be unfair between
the contending connections [28].
The most common parallel matching algorithm is perhaps iSLIP [28], where the
arbiters are modifled round robin schedulers. The modiflcations has been done to
avoid synchronization, and hence to improve performance of the scheduler. Many
of current implementations of input-bufiered switch fabrics are using iSLIP, or its
modiflcations.
The iSLIP scheduling algorithm has major unresolved problems. Performance
and throughput of iSLIP is very sensitive to the tra?c destination distribution. For
non-uniform tra?c distributions, iSLIP arbiters sufier from synchronization prob-
lem, which negatively afiect the switch fabric throughput. Moreover, since iSLIP
uses simple round robin arbiters, it lacks a proper mechanism for QoS provisioning,
and control of the performance.
1.3.4 Randomized Scheduling Algorithms
Scalability is one of the main concerns about the input-bufiered switch fabrics. In
the scheduling algorithms that we have discussed so far, the objective is to flnd
an appropriate matching for a bipartite graph. Some of these algorithms have a
good performance, but are too complex for real time implementation in high speed
switches. For instance, the MWM algorithm is proved to achieve 100% throughput,
but its complexity is O(N3). On the other hand, those matching algorithms that
are less complex and are appropriate for high speed switches lack in performance.
Parallel matching algorithms fall in the second category.
23
The basic idea behind randomized scheduling algorithms is, instead of flnding a
good matching, to select the best matching among an appropriate set of predeter-
mined matchings. The best matching is the matching with the maximum weight.
Note that complexity of calculating weight of a matching is O(N), and complexity
of flnding the maximum weight among K candidates is O(K). Therefore for a
constant K, this algorithm has linear complexity in switch size N.
The key issue here is to have an appropriate set of candidate matchings. The
flrst algorithm that comes into mind is:
Algorithm 1:
? At every scheduling time, pick matching R uniformly and randomly from all
possible N! matchings.
? Use R as the schedule.
It can be proved that this algorithm does not deliver 100% throughput [19].
Tassiulas introduced the flrst stable (100 % throughput) randomized scheme
with linear complexity [36] as follow,
Algorithm 2 (Tassiulas):
? Let S(t) be the schedule (matching) used at time t.
? At time t+1, pick matching R uniformly and randomly from all possible N!
matchings.
? Let the schedule at time t+1, S(t+1), be the heavier weighted of S(t) and
R(t + 1).
Tassiulas proved that for Bernoulli i.i.d arrivals this algorithm delivers 100%
throughput [36]. In fact, he showed that if at step 2 of the algorithm probability
of selecting the MWM is ? > 0, then the matching is stable.
24
The candidate matching set in this scheme contains two matchings, one ran-
domly selected matching and the matching that was used in the previous time slot.
Intuitively, since state of the switch (number of backlogged cells) changes slightly
in every time slot, a good matching in previous slot will remain a good matching
for the next time slot.
Although this randomized algorithm is stable, it does not have a good delay
performance. Paper [19] proposed several modiflcations to the basic randomized
algorithm to improve its delay performance. One of the contributions of [19] is
introduction of the self-randomized algorithms. The idea behind self-randomized
scheduling algorithms is instead of selecting a pure random matching, using the
arrival cell pattern as the source of randomization. SERENA is the self-randomized
algorithm introduced in [19]:
Algorithm 3 (SERENA):
? Let S(t?1) be the schedule (matching) used at time t?1.
? Let A(t) = [aij(t)]N?N denote the arrival graph, where aij(t) = 1 indicates
arrival, and aij(t) = 0 indicates no arrival.
? Let the schedule at time t, S(t), be the matching that results from merging
A(t) and S(t?1).
The merging procedure is discussed in chapter 4, where we propose several
modiflcations to the basic self-randomized algorithm. In [19], it is proved that
SERENA delivers 100% throughput for Bernoulli arrivals.
25
1.3.5 Fluid Model Techniques
Most of the classical analytical results on stability of the input-bufiered scheduling
algorithms are based on the stability theorems for Markov processes. Using these
theorems, in [37], [29], [31], it is shown that input bufiered switches with a maxi-
mum weighted matching scheduler can achieve 100% throughput. However these
results hold when the arrival process is independent identically distributed (i.i.d.),
so that the system evolution can be modelled as a Markov process.
It has been believed that the stability results can be extended for arbitrary
arrival process if no ingress or egress port is overloaded. Dai and Prabhakar in
[13] used the  uid model techniques to prove this for the flrst time. In fact, they
proved that if number of backlogged cells in each VOQ is used as the weight of the
corresponding link, a maximum weighted matching scheduler can achieve 100%
throughput for any arbitrary arrival process that satisfles following conditions:
1. It obeys the strong law of large numbers.
2. It does not overload any ingress or egress port.
Results of [13] are derived by considering the  uid model analogs of an input-
bufiered switch. The framework of  uid models has proved to be powerful in
obtaining the maximum throughput region of input-bufiered switches, under very
mild condition on the input tra?c. A general review of the stability analysis of
stochastic networks using  uid models is given in [12]. Basically in this framework,
in order to prove that a switch delivers 100% throughput it is su?cient to prove
that the corresponding  uid model is weakly stable. We will elaborate more on
this in chapters 3 and 4.
In chapter 3, we use the  uid model technique to prove that every matching
26
algorithm in MNCM class achieves 100% throughput. In chapter 4, we use the  uid
model technique to prove self-randomized scheduling algorithms are rate stable.
1.4 Contributions of Dissertation
1.4.1 Packetized Tracking Policies
Fluid and packetized policies are two broad class of scheduling policies that are
studied and considered in switching systems. In the  uid policy it is assumed that
the link capacity can be divided between the queues and each one of them can use
a fraction of the capacity at any time. Contrary, in a packetized policy, at any
time, only one of the queues can be served. A  uid policy determines how much
of the link capacity should be given to each one of the  ows at any time, while a
packetized policy determines which one of the queues should be sole user of the
link at any time.
One of the main issues in the design of integrated services networks is to provide
performance requirements to a broad range of applications. Application require-
ments are translated into network quantitative parameters. The most common
performance measures are packet loss probability, throughput, delay, and jitter.
Scheduling algorithm that are used in a switch has a direct impact on its through-
put, delay and jitter characteristics. On the other hand, the network should also
be capable to analyze the amount of resources that each particular  ow requires.
It is therefore, very important for the network designer to understand efiects a
scheduling policy has on the connection performance and on the usage of network
resources.
In many cases, it is easier to perform the analysis and design of scheduling
27
policies under the modelling assumption that the tra?c arrives and is treated
as a  uid, i.e., the realistic case where information is organized into packets is
not taken into account, [10],[11],[32],[33],[17]. Under the  uid policy, we assume
that at every time instant arbitrary fractions of the link capacity can be shared
among difierent applications. Although in most of the practical situations this is
an idealistic assumption, it enables us to analyze the efiect of a scheduling policy on
the network resources as well as the major performance parameters, and therefore
to design the scheduling policies more conveniently. One approach to the design
of packetized policies is to flrst flnd an appropriate  uid policy, and then to derive
a packetized policy that resembles or \tracks" the  uid policy in a certain sense.
Existence of packetized tracking policies is a well established fact in the single
link case. In fact, several tracking policies are suggested and their performance and
e?ciency are analyzed [32],[33],[17],[14],[21]. Existence of such tracking policies in
input queueing switches is studied chapter 2.
In our model, we consider an input queueing switch, where every input and
output port can serve 1 cell per time slot. In a  uid policy model, at every time
slot every input (output) port can be connected to several output (input) ports,
however the total service rate of any port should not exceed its capacity. Under
a packetized policy, every input (output) port can at most be connected to one
output (input) port at every time slot, i.e., there is no speed up in the switch
fabric. Under these circumstances, our objective is to flnd a packetized policy that
tracks a given  uid policy.
A packetized policy tracks a given  uid policy perfectly if under the packetized
policy every cell departs the system, at most at the end of the time slot that it
departs under the  uid policy. For the special case of 2 ? 2 switches, we prove
28
that the tracking policies always exist and we provide a non-anticipative tracking
policy in chapter 2. We should clarify that a scheduling policy is non-anticipative
if its decision at any time does not depend on the future arrivals. For the general
case, we use a counter-example for a 3 ? 3 switch fabric to prove that a perfect
tracking policy does not exist. However, a heuristic algorithm with good, but not
perfect tracking properties is proposed.
The heuristic algorithm is based on a weighted matching algorithm. The
weighted matching algorithms are usually too complex to implement in hardware.
Here, we employ two notions to reduce the complexity. The flrst concept is to
do port based weighted matching rather that link weighted matching. Mekkit-
tikul and McKeown [31] used a similar concept. They used the queue length as
the weights in their work, and illustrate that they can achieve 100% throughput.
Our weights re ect the amount of work by which the  uid policy is ahead of the
tracking policy at each port, and our main objective is to be able to provide rate
guarantees and high throughput at the same time. The second notion that aids
us in improving the performance and reducing the complexity is the concept of
critical ports and links. Basically, criticality relates to the urgency by which a
port/link needs to be scheduled in order to ensure proper tracking. We detect all
critical links and remove all non-critical links that contend with the critical ones.
In this way, number of contending links and consequently the complexity of the
algorithm is reduced.
1.4.2 The Maximum Node Contained Matchings
In chapter three, we use  uid model techniques to establish some new results for the
throughput of input-bufiered switches. In particular, we introduce a new class of
29
deterministic maximal size matching algorithms that achieves 100% throughput.
Dai and Prabhakar [13] have shown that any maximal size matching algorithm
with speedup of 2 achieves 100% throughput. We introduce a class of maximal
size matching algorithms that we call them maximum node containing matching
(MNCM) algorithms, and prove that they have 100% throughput with no speedup.
We also introduce a new weighted matching algorithm, maximum flrst matching
(MFM) with complexity O(N2:5) that belongs to MNCM. MFM, to the best of our
knowledge, is the lowest complexity deterministic algorithm that delivers 100%
throughput. The only assumption on the input tra?cs is that they satisfy the
strong law of large numbers. Besides throughput, average delay is the other key
performance metric for the input-bufiered schedulers. We use simulation results
to compare and study the delay performance of MFM.
Stability and throughput of input-bufiered switches is a well studied problem.
In papers [37], [29] it is proved that maximum weighted matching (MWM) algo-
rithm can achieve 100% throughput. In [29] number of backlogged packets and
maximum delay of waiting packets in each VOQ are considered as two potential
weight functions. In another work [31], Mekkittikul and Mckeown considered the
case where weights are associated to the ports rather than links and showed that
the proposed algorithm, longest port flrst (LPF), achieves 100% throughput. Com-
plexity of LPF is also O(N3), even though for practical purpose it seems to be more
favorable than MWM [30]. Stability of these algorithms are all proven under the
assumption of i.i.d. arrivals.
We extend some of the results of [13], by introducing maximum node containing
matching (MNCM) algorithms. MNCM is a new class of maximal size matching
scheduling algorithms, that achieves 100% throughput with no speedup. The norm
30
function that is commonly used for stability analysis is norm 2 (k?k2). Here, we
use norm inflnity (k ? k1) instead of it, and focus on the matching algorithms
that function based on that. We prove that these matching schemes achieve 100%
throughput too. We also introduce the maximum flrst matching (MFM) algorithm
that is in the MNCM class, and therefore has 100% throughput. The MFM algo-
rithm employs maximum size matching algorithms with O(N2:5) complexity rather
than maximum weighted matching algorithms with O(N3) complexity.
We also introduce another maximal deterministic matching algorithm with
O(N2) complexity, the maximal sorted matching (MSM). The MSM algorithm
is a trivial generalization of the iLPF algorithm introduced in [31]. MSM is not in
MNCM class, but for practical applications, we think that it performs similar to
MFM. Even though we were not able to prove that MSM has 100% throughput,
due to its similarity to MFM, we think that for all practical purposes it achieves
the 100% throughput. This conjecture is in accordance with our simulation re-
sults, where delay performance of MSM matches performance of MFM. Since we
have used the  uid model technique, the stability results are under very general
conditions for arrival process. The only assumption on the input tra?c is that it
satisfles the strong law of large numbers. Recall that the results of [31] are for
i.i.d. arrivals. Therefore, our results are the flrst stability results for port weighted
matching algorithms under general arrival patterns.
For rate provisioning applications, we introduce Maximum Size Unit Inter-
val Matching (MSUIM) algorithm that can provide rate guarantees and is not in
MNCM. It is true that any scheduling algorithm in MNCM can also provide rate
guarantees since they are rate stable. However, MSUIM is easier to implement and
we were able to bound its performance. In fact, it is proved that MSUIM can be
31
used to track any feasible  uid policy and it lags the  uid policy by no more than
N cells on every node.
1.4.3 Self-randomized Scheduling Algorithms and Rate Pro-
visioning
In chapter four, we consider self-randomized scheduling policies for input bufiered
switches. Randomized scheduling policies are considered for scheduling in input-
bufiered switches. A scheduling policy is said to be self-randomized if the ran-
domized component (process) of the randomized scheduling policy is replaced by
a pseudo-random process that is a function of the cell arrival process. We use  uid
model techniques to show that the self-randomized scheduling algorithms deliver
100% throughput. The only assumption on the arrival pattern is that it satisfles
strong law of large numbers, and no input or output port is oversubscribed. We
provide a general architecture for the design of self-randomized algorithms, and
introduce two algorithms that consider number of backlogged cells as the weight
function and three algorithms We then introduce concept of the max-min fair self-
randomized scheduling algorithms. The idea here is to introduce self-randomized
algorithms that can provide rate guarantees by sharing the switch bandwidth pro-
portional to the assigned weights. We introduce three self-randomized scheduling
algorithms, and prove that they are max-min fair. In order to study and compare
the performance of the proposed scheduling algorithms, several simulations are
carried out and their results are provided and discussed.
In section 1.3.4 we mentioned that paper [19] introduced notion of self-randomized
algorithms to enhance performance of the basic randomized algorithm [36]. The
idea behind self-randomized scheduling algorithms is instead of selecting a pure
32
random matching, using the arrival cell process to generate and derive a pseudo-
random process.
In chapter four, we focus on the self-randomized scheduling algorithms and
introduce some new algorithms that have better delay performance compared to
other proposed algorithms. To that end, we introduce a general architecture for
the self-randomized schedulers and propose two speciflc scheduling algorithms that
use number of backlogged cells as the weight of the links.
We use  uid model techniques to establish some new results for the through-
put of the randomized scheduling algorithms. More speciflcally, we prove that
the original randomized algorithm [36], and all of the self-randomized scheduling
algorithms that are introduced in chapter 4 delivers 100% throughput. The only
assumptions on the input tra?c is that it satisfles SLLN, and is admissible.
Rate provisioning is the other performance measure for scheduling algorithms
that is overlooked in the context of randomized scheduling algorithms. Rate provi-
sioning scheduling algorithms are very common for scheduling in output-bufiered
switches, however for the input-bufiered schedulers with no speedup problem is
more challenging, and there are few theoretical and practical results [6], [35]. Paper
[38] introduces and studies a new token based max-min fair scheduling algorithm.
The basic idea behind a max-min fair scheduler is to allocate bandwidth among
 ows proportional to their weights, and if a  ow can not utilize its bandwidth,
because of constraint elsewhere in the network, then the residual bandwidth is
distributed proportionally among others [22]. Note that this scheduling algorithm
is not originally proposed for input-bufiered switches, but it is directly applicable
to them too. We can categorize this scheduling algorithm as a deterministic MWM
scheduling algorithm that weight of links ( ows) are number of tokens allocated
33
to them. Tokens are distributed among links with their max-min rate.
We extend the idea of max-min fair scheduling algorithms and introduce the
concept of max-min fair self-randomized scheduling algorithms. Three difierent
self-randomized max-min fair algorithms are introduced. For each algorithm, it is
proved that it delivers the max-min rate to all of the links.
34
Chapter 2
Rate Provisioning and Tracking
Fluid Policies in Input-bufiered
Switches
The concept of tracking policies is proven to be a very useful technique in design
and analysis of scheduling algorithms for multiple  ows sharing a single link [10],
[11],[32],[33], [17]. In fact, Weighted Fair Queueing (WFQ) [32], and its extensions
such as Self Clock Fair Queueing (SCFQ) [21], and Worst case Fair, Fair Queueing
(WF2Q) [2] are tracking policies for Generalized Processor Sharing (GPS)  uid
policy. The single link sharing scheduling model is applicable to an output queued
switch fabric, where all cells reside in egress line cards and scheduling problem is
many to one among  ows that share an egress port. However, this is not extendible
to input bufiered switches, where the scheduling problem is many to many among
 ows residing in difierent ingress line cards and targeting arbitrary egress line
cards.
We consider the problem of tracking  uid policies by packetized policies and
35
extend it to input queueing switches. It is considered that the speed up of the
switch is one. One of the interesting applications of the tracking policy in TDMA
satellite Switches is elaborated. For the special case of 2?2 switches it is shown
that a tracking non-anticipative policy always exists. It is found that in general
non-anticipative policies do not exist for switches with more than 2 input and
output ports. For the general case of N ?N switches a heuristic tracking policy
is provided. The heuristic algorithm is based on two notions, port tracking and
critical links. These notions can be employed in the derivation of other heuristic
tracking policies as well. Simulation results show the usefulness of the heuristic
algorithm and the two basic concepts it relies on.
One way for the design of tracking policies is to have Combined Input Output
Queueing (CIOQ) switches with limited speed up that matches the output sequence
of a purely output queueing switch. In fact, it is shown in [8] that speed up of 2 is
su?cient to resemble the output pattern of any output queueing switch. However,
the scheduling algorithm proposed to do that is fairly complicated, and the arbiter
still requires to receive information from the input ports of the switch with speed
up of N. Here, we consider only switches with no speedup, and look for low
complexity algorithms.
In section 2.1, we review the concepts of  uid and tracking policies, and provide
the feasibility condition for both cases. The problem of scheduling multi-periodic
messages in TDMA-SS is explained and elaborated in section 2.2. It is indicated
that this problem is essentially a special case of the input queueing scheduling
problem considered in this paper. In section 2.3, we show that for the 2 ? 2
switches a tracking policy always exist, and we provide a non-anticipative algorithm
to flnd the tracking policy. In section 2.4, some useful ideas regarding the design of
36
heuristic tracking policies are given. Based on these concepts a heuristic scheduling
algorithm is proposed, and used to implement a flxed rate scheduler. The flx rate
scheduler is simulated and used for scheduling in input-bufiered switches. Results
of the simulation are provided and efiect are difierent parameters are discussed.
2.1 Fluid and Packetized Tracking Policies
We consider input queueing switches that serve flxed size cells. Each input and
output port has the capacity of serving 1 cell per time unit. Since queues exist
only at the input ports, the latter assumption implies that tra?c of at most 1 cell
per unit of time can be transferred from the input ports to a given output port.
We assume that the time is slotted and the length of each slot is equal to the
length of a cell. Slots are numbered starting from 1, 2, ... . Slot k is taking the
time interval (k?1;k]: Time k?1 (k) is the beginning (end) of time slot k: Cells
arrive at the beginning of each time slot.
Two broad classes of policies are considered,  uid and packetized policies. Dur-
ing time slot k a  uid policy transmits, wij(k) ? 0, units of information from input
port i to output port j. wij(k) is a nonnegative real number and is measured in
units of cells. Since at most one unit of work can be transferred from a given input
port to the output ports, and since no queueing is permitted at the output ports,
the wij(k)?s must satisfy the following inequalities.
wij(k) ? 0
NP
j=1
wij(k) ? 1; 8i 2f1;:::Ng
NP
i=1
wij(k) ? 1; 8j 2f1;:::Ng
(2.1)
A packetized policy is based on the assumption that during a time slot an input
37
port can transmit a single cell to any one of the output ports. Therefore, for a
packetized policy we have that Sij(k), the number of cells transmitted from input
port i to output port j during slot k, is either 0 (no cell transmission during slot
k) or 1 (a single cell transmission during slot k). A packetized policy is feasible if
at every time slot k we have,
NP
j=1
Sij(k) ? 1; 8i 2f1;:::Ng
NP
i=1
Sij(k) ? 1; 8j 2f1;:::Ng
Sij(k) 2 f0;1g:
(2.2)
Note that the conditions in (2.2) imply that for any k, there can be at most a
single 1 in each column and row of the matrix [Sij(k)]. That is, the matrix Sij(k)
is a sub-permutation matrix.
Usually  uid policies cannot be applied directly in a network since mixing of
tra?c belonging to difierent cells is not allowed. However, as mentioned in the
introduction, they are considered in this paper, because the performance analysis
and the scheduling policy design is often more convenient for  uid policies. An
approach to the design of packetized policies is to flrst design and analyze a  uid
policy, and then implement a packetized policy that resembles in a certain sense the
departure process of the  uid policy. Such a packetized policy is called a tracking
policy. More precisely, for our purposes, we use the following deflnition:
Deflnition I: Given a  uid policy ?f, we say that a packetized policy
is tracking ?f if every cell departs under the packetized policy at the
latest by the end of the time slot at which the same cell departs under
the  uid policy.
A basic question is if tracking policies exist for a given  uid policy. This ques-
38
tion is answered positively for the single link case, where difierent sessions share a
single link [32], [17]. In that case, perhaps the best known  uid policy is the Gen-
eralized Processor Sharing (GPS) policy. Several tracking policies are suggested
for the single link case [32], [17], [21]. The concept of GPS can be appropriately
extended to the multi-input,multi-output input queueing switches. However, the
existence of tracking policies for these switches is still an open question. In section
2.3, we study the special case of 2?2 switches. We will prove that for the special
case of 2?2 switches, for every feasible  uid policy there exists a feasible packe-
tized policy. In fact, our proof is constructive and provides an algorithm to derive
a tracking policy. Before discussing the 2?2 case, in the next section we present
the problem of multi-periodic TDMA-SS scheduling and indicate how the problem
could be trivially solved by the construction of tracking packetized policies.
2.2 Multi-periodic TDMA Satellite Switches
One of the potential applications of tracking policies is in the scheduling of TDMA
Satellite Switches (TDMA-SS). The conventional method to do the scheduling is
based on the Inukai method [25]. This method is based on the assumption that all
messages have the same period. The scheduling is done for a frame length equal
to the period of the messages and it is repeated periodically thereafter. Let L be
equal to the maximum number of cells that can be serviced by an input/output
port during one period. A set of messages is schedulable if for every port the total
number of cells that should be serviced is no more than L. Inukai provided a
scheduling algorithm for any set of schedulable messages.
The Inukai algorithm does not work appropriately when messages have difierent
periods. Let message m from input port sm to output port dm have period pm .
39
To apply the Inukai method the frame length should be set to the Least Common
Multiplier (LCM) of all message periods, say L. For each message m, L=pm unit
length cells are scheduled in the frame. Each of these cells is associated to one
period of the original message. Then, we can use the Inukai method to allocate
these cells inside the frame length L. The problem is that there is no control over
the place of cells inside the frame in the Inukai method. Thus, it is possible that
all cells attributed to a single periodic message are placed next to each other. Such
an assignment sufiers from high jitter. Moreover, the delay of a cell can be equal
to L, which can be very large.
Suppose that the objective is to schedule every cell in the time frame of its
period. Thus, every cell can tolerate a delay up to its period. The question then
arises whether it is possible to provide a schedule under these constraints. A
necessary condition for schedulability, is that the utilization of every input port i
and output port j should not be greater than unity, i.e.,
ui = P
m:sm=i
1
pm ? 1;
uj = P
m:dm=j
1
pm ? 1;
If one considers  uid policies, then it is easy to provide a schedule provided that
(2.3) is satisfled. Speciflcally, consider the  uid policy that assigns the flx service
rate of 1=pm to every message m. Under this policy the switch starts servicing
every cell immediately after its arrival and it takes pm time units to complete its
service. This means that the target deadlines are all accomplished. Therefore, if
we can provide a packetized policy that tracks the  uid policy, then this packetized
policy will satisfy the delay constraints as well. In [34] Philp and Liu conjectured
that (2.3) is the necessary and su?cient condition for schedulability under the
40
specifled delay constraints. Giles and Hajek [20] have proved this conjecture for a
special case. In their model the messages are sorted based on their period, such
that,
p1 ? p2 ????? pM:
Moreover, for every two subsequent messages we have,
pm = kpm+1;
where k is an integer. Unfortunately, their algorithm does not work well in the
general case. Bonuccelli and Clo [5] presented a counter-example for a 4?4 switch,
and illustrate that the conjecture is in general not correct. In the following section,
we show the existence of tracking policies for the special case of 2 ? 2 switches.
Thus, the conjecture is proved for the special case of 2?2 switches. We will also
provide a counter-example for a 3?3 switch.
2.3 The 2 ? 2 switch
In this section, we consider a 2?2 input queueing switch and provide an algorithm
for designing a packetized policy ? that tracks a given  uid policy ?f.
We make the following assumption regarding the  uid policy.
Assumption I: The  uid policy is non-anticipative (its decisions do
not depend on future arrivals) and such that the order in which cells
with origin port i and destination port j complete service, is not af-
fected by new cell arrivals.
This assumption is similar to the one used in the design of tracking  uid poli-
cies in the single server case [17]. Note that the assumption does not preclude
41
the possibility that new cell arrivals afiect the order in which some cells complete
service. Consider the following example of a scheduling policy in a 2 ? 2 switch
(see Figure 2.1). Input port 1 employs a GPS scheduler to schedule cells destined
to output ports 1;2 with weights 1=4 and 3=4 respectively. The scheduler operates
in a work-conserving fashion, serving all eligible cells (i.e., cells that have no trans-
mission constraints at any of the output ports) according to their weights. Output
port 2 uses a strict priority scheme to schedule cells from input ports 1 and 2; cells
from input port 2 have higher priority. Assume now that at the beginning of time
slot 1, two cells p1; p2 arrive at input port 1, destined to output ports 1 and 2
respectively. If no new arrival occurs at the beginning of slot 2, then the flnishing
times of p1 and p2 are 2 and 4=3 respectively, as shown in Figure 2.1, a). Assume
next that at the beginning of time slot 2, cell p3 arrives at input port 2; destined
for output port 2 -see Figure 2.1, b). Then at time slot 2; cell p3 will be transferred
from input port 2 to output port 2, since at output port 2 cell p3 has higher priority
than cell p2: Since p2 cannot be transmitted in slot 2, the GPS scheduler at input
port 1 completes transmission of p1 by time 7=4. The completion time of cell p2
is now 9=4. We see that the order by which cells p1 and p2 complete service is
reversed by the arrival of cell p3. However, if for cells that are going from the same
input port to the same output port we use a policy such as strict priority scheme,
then the specifled switch scheduling policy satisfles Assumption I.
Examples of  uid policies that satisfy Assumption I are:
? Any nonanticipative  uid policy that serves cells with origin port i and des-
tination port j in a First Come First Served (FCFS) manner.
? Any  uid policy that assigns flxed priorities to cells with origin port i and
destination port j
42
Figure 2.1: Efiect of Cell p3 on completion times of p1 and p2.
43
? A policy that employs a general non-anticipative  uid scheduler at input port
i to provide transmission intervals to cells destined to difierent output ports,
and another GPS scheduler to schedule, within the provided transmission
intervals, cells at port i destined to a particular output port j.
The main di?culty in the design of tracking policies for input bufiered switches
arises from the fact that if one sees an input-output port (i;j) as a server, this
server will not be work-conserving, since it may be forced to idle at some slot k.
This can happen if, for example, during slot k a cell is transmitted from input port
i to another output port j1, instead of the cell destined to output port j. Below
we approach the design of the tracking policy in two steps: First we compute a
sequence of sub-permutation matrices Iij(k) that \tracks" the work performed by
the  uid policy for any origin-destination pair. The meaning of tracking in this case
is provided below in (2.3). The interpretation of the matrix Iij(k) is the following.
Whenever Iij(k) = 1, a cell with origin i and destination j may be transferred
through the switch at slot k (if there is such a cell in the queue). Next, we provide
a packetized policy, ?p; that decides which cells among those using a particular
origin-destination pair are to be transmitted at the slots specifled by Iij(k):
Before we proceed we need the following deflnition.
Deflnition II: ?ij(k) is the largest time slot less than or equal to k; at
the beginning of which there is no work at port i destined for port j
under ?f.
Let Iij(k) be a sequence of integer sub-permutation matrices that have the
following property.
max
?ij(k)?l?k
($ kX
s=l
wij(s)
%
?
kX
s=l
Iij(s)
)
? 0; k = 1;2;::: (2.3)
44
Slot ?ij(k) ?ij(k) + 1 ?ij(k) + 2 ?ij(k) + 3 ?ij(k) + 4
wij(s) 0 1=2 1=4 3=4 1=2
Pk
s=l wij(s) 2 2 3=2 5=4 1=2
jPk
s=l wij(s)
k 2 2 1 1 0
Pk
s=l I
(1)
ij (s) 2 2 2 1 1
Pk
s=l I
(2)
ij (s) 2 2 1 0 0
Table 2.1: Example of Sub-permutation Matrices
where bxc denotes the integer part of x: The design of Iij(k) will be provided later
in this section. Consider the example in Figure 2.2 where two sequences, I(1)ij (k)
and I(2)ij (k) are provided. From Table 2.1 we see that the sequence I(1)ij (k) satisfles
(2.3), while I(2)ij (k) does not, since
66
64 kX
s=?ij(k)+3
wij(s)
77
75? kX
s=?ij(k)+3
I(2)ij (s) = 1:
Note that Iij(k) specifles whether in a slot there may be a transfer of a cell
between ports i and j (if Iij(k) = 1 and there is work at input port i at time
k ? 1 destined for output port j), but it does not specify which cell from the
corresponding queue will be chosen for transfer. We now deflne a packetized policy
?p; that specifles the cell to be chosen for transfer in such a way that the  uid policy
?f is tracked.
Deflnition III: ?p is the packetized policy that whenever Iij(k) = 1
and there is work at input port i at time k?1 destined for output port
j; it transfers the cell that completes earliest under ?f:
Provided that Iij(k) can be designed in a non-anticipative fashion, ?p is also
non-anticipative, since by Assumption I, one can decide the order of completion
45
Figure 2.2: Cell transfer from input port i to output port j.
46
times of cells with the same origin-destination pair without knowledge of future
arrivals.
Note that ?p acts in the same manner that the tracking policy in the single
server case acts [17], except that it flrst checks, based on Iij(k); whether a slot is
eligible for transmission of cells with origin-destination pair (i;j). On the other
hand, the specifled policy can be considered as a generalization of the policy for
the single-server case. Indeed, specializing to the single-server case, we deflne
Iij(k) = 1 whenever there is backlog in the system at time k ? 1 under ?f and
Iij(k) = 0 otherwise. Then (2.3) is obviously satisfled, and the resulting policy is
identical to the tracking policy for the single server case, proposed in [17].
The next theorem shows that ?p is tracking ?f. In the following it is assumed
that the system is empty at time 0:
Theorem 2.3.1 Every cell leaves the switch under ?p at the latest by the end of
the time slot at which the same cell leaves the switch under ?f:
Proof: Let bm be the mth time slot at the beginning of which there is no work
at input port i destined for output port j under ?f; while at the end of bm there
is such work. In other words, bm is the beginning of the mth busy period under
?f, for the server related to the pair (i;j). Let also em be the end of the mth busy
period under ?f, that is, the flrst time slot after bm at the beginning of which there
is work at input port i destined for output port j under ?f; while at the end of
em there is no such work. The theorem is true until time slot b1. Assume that
the theorem is true for all cells that are transmitted up to time slot bm under ?f.
Notice that then, according to the statement of the theorem it follows that the
same cells have been transmitted up to time slot bm under ?p.
Next we will show that the theorem holds for all cells that arrive and are
47
transmitted from time bm to em; and therefore, the theorem holds up to time slot
bm+1: Let pn be the nth cell with origin port i and destination port j to complete
transmission in the interval (bm;em] under ?p. Let fn (bfn) be the flnishing time of
pn under ?p (under ?f). We will show that
fn ?
lb
fn
m
( dxe denotes the smallest integer larger or equal than x). For the proof, assuming
the contrary, i.e.,
fn >
lb
fn
m
; or fn ?1 ?
lb
fn
m
; (2.4)
we will arrive at a contradiction.
Consider two cases.
Case 1. Suppose that for all cells pl; l < n; it holds (see Figure 2.3)
bfl ? bfn:
Then, cells p1; p2;:::;pn?1 leave before cell pn under both ?p and ?f: Let ? ? bm
be the last slot before fn such that Iij(?) = 1 and there is no cell to be transferred
from port i to port j under ?p at time ??1: If there is no such slot, set ? = bm.
Let also pl, pl+1;...,pn be the cells that are transmitted in the interval [?; fn] under
?p. Note that all these cells must have arrived at or after time ?: This is so since
either ? = bm and the statement is true by deflnition, or Iij(?) = 1; and therefore,
if one of these cells was in the system at the beginning of slot ?; it would have been
transmitted in that slot. Therefore, we have the following important properties
? Cells pl, pl+1;...,pn are transmitted in slots ? + 1 to fn under ?p and in slot
? + 1 to bfn under ?f:
? Whenever Iij(k) = 1; ? + 1 ? k ? fn, one of the cells pl, pl+1;...,pn is
transferred from port i to port j under ?p.
48
Figure 2.3: Arrangement of cells for Case 1 in the proof of Theorem 1.
49
Therefore,
dbfneX
s=?
wij(s) ?
fnX
s=?
Iij(s) (2.5a)
=
fn?1X
s=?
Iij(s) + 1 (2.5b)
?
dbfneX
s=?
Iij(s) + 1 (2.5c)
?
66
66
4
dbfneX
s=?
wij(s)
77
77
5+ 1 (2.5d)
Equality (2.5b) follows from the fact that since pn was transmitted in slot fn,
Iij(fn) = 1, inequality (2.5c) follows from (2.4), and inequality (2.5d) from (2.3).
Finally, we have the contradiction,
dbfneX
s=?+1
wij(s) ?
66
66
4
dbfneX
s=?+1
wij(s)
77
77
5+ 1
Case 2. Suppose there is a cell pl; l < n such that
bfn < bfl
i.e., cell pl leaves earlier that cell pn under ?p and later than cell pn under ?f (see
Figure 2.4). Assume that pl is the last cell before pn with this property. Then, cells
pl+1;:::;pn leave earlier than, or at the same time as, cell pn under both policies.
This implies that cells pl+1;:::;pn must have arrived at the earliest at the end of slot
fl: This is so, since otherwise, according to the deflnition of ?f and Assumption I,
if one of the cells pl+1;:::;pn was in the system at the beginning of slot fl, it would
have been transmitted earlier than pl since its flnishing time under ?f is smaller
50
than the flnishing time of pl: Therefore, we conclude that cells pl+1;:::;pn arrive and
are transmitted in the interval [fl;fn] under ?p and in the interval [fl; bfn] under
?f: Using again the argument in case 1, we arrive at a contradiction.
It remains to show that the matrix Iij(k) can be constructed in a non-anticipative
manner, based on ?f. We will in fact construct a policy that in addition to (2.3)
has the following property.
Property I. The following inequalities hold for all k:
2X
j=1
wij(k) ?
2X
j=1
Iij(k); i = 1;2
2X
i=1
wij(k) ?
2X
i=1
Iij(k); j = 1;2:
Deflne next Iij(k) recursively as follows
Iij(k + 1) = max
?ij(k+1)?l?k+1($
k+1X
s=l
wij(s)
%
?
kX
s=l
Iij(s); 0
)
; k ? 0; (2.7)
where the notation Pmn = 0 when n > m; is used. Note that since ?f is non-
anticipative, wij(k +1) can be computed at time k based only on the past history
of the system. Since wij(s), Iij(s), 0 ? s ? k are also known at time k, Iij(k + 1)
can indeed be computed at time k. In the next Lemma we show that the matrix
Iij(k + 1) computed using (2.7), with a slight modiflcation, satisfles (2.3).
Lemma 2.3.2 At slot k + 1;compute Iij(k + 1) using (2.7). If it turns out that
some row or column of Iij(k + 1) contains only zeros, then one of the elements of
this row or column can be redeflned to one, so that the resulting matrix remains
51
Figure 2.4: Arrangement of cells for Case 2 in the proof of Theorem 1.
52
a sub-permutation matrix. The matrix Iij(k + 1) so deflned, satisfles (2.3) and
property I.
Proof: Assume Iij(k) satisfles (2.3) and Property I up to slot k. We show
now that the same holds for slot k + 1. First, we have to show that Iij(k + 1) is
an integer sub-permutation matrix, i.e., Iij(k + 1) takes values 0 or 1; and
2X
j=1
Iij(k + 1) ? 1; i = 1;2;
2X
i=1
Iij(k + 1) ? 1; j = 1;2: (2.8)
To show that Iij(k + 1) takes values 0 or 1; notice that
0 ? Iij(k + 1) (2.9a)
? max
?ij(k+1)?l?k+1
($
1 +
kX
s=l
wij(s)
%
?
kX
s=l
Iij(s); 0
)
(2.9b)
= max
?ij(k+1)?l?k+1
(
1 +
$ kX
s=l
wij(s)
%
?
kX
s=l
Iij(s); 0
)
: (2.9c)
Inequality (2.9b) follows from the fact that wij(s) ? 1. Suppose ?ij(k+1) = ?ij(k).
Then from (2.3) we conclude that
$ kX
s=l
wij(s)
%
?
kX
s=l
Iij(s) ? 0; ?ij(k + 1) ? l ? k
and hence
max
?ij(k+1)?l?k+1
(
1 +
$ kX
s=l
wij(s)
%
?
kX
s=l
Iij(s); 0
)
? 1; (2.10)
Taking into account(2.9c) and the fact that Iij(k+1) is integer-valued, we conclude
that Iij(k + 1) takes the values 0 or 1:Suppose now that ?ij(k + 1) 6= ?ij(k). Then
by the deflnition of ?ij(k), we conclude that ?ij(k +1) = k +1 and wij(k +1) = 0,
hence it follows from (2.7) that Iij(k + 1) = 0.
In order to show (2.8) assume that for, say, row 1 we have
I11(k + 1) = I12(k + 1) = 1:
53
Then, be the deflnition (2.7) there must be ?1 ? ?11(k + 1) and ?2 ? ?12(k + 1)
such that
$k+1X
s=?1
w11(s)
%
?
kX
s=?1
I11(s) = 1
$k+1X
s=?2
w12(s)
%
?
kX
s=?2
I12(s) = 1
Assume without loss of generality that ?1 ? ?2. Then, adding the previous equa-
tions, we have
2 ?
$ 2X
j=1
w1j(k + 1) +
kX
s=?1
2X
j=1
w1j(s) +
?1?1X
s=?2
w12(s)
%
?
kX
s=?1
2X
j=1
I1j(s)?
?1?1X
s=?2
I12(s)
(2.11a)
? 1 +
$ kX
s=?1
2X
j=1
w1j(s) +
?1?1X
s=?2
w12(s)
%
?
kX
s=?1
2X
j=1
I1j(s)?
?1?1X
s=?2
I12(s): (2.11b)
Inequality (2.11a) follows from the fact that bxc + byc ? bx + yc and inequality
(2.11b) from the fact that P2j=1 wij(k + 1) ? 1. Hence, taking also into account
that Iij(s) are integers, we have
1 ?
$ kX
s=?1
2X
j=1
w1j(s)?
kX
s=?1
2X
j=1
I1j(s) +
?1?1X
s=?2
w12(s)?
?1?1X
s=?2
I12(s)
%
:
Using Property I we conclude that
kX
s=?1
2X
j=1
w1j(s)?
kX
s=?1
2X
j=1
I1j(s) ? 0
and hence
1 ?
$?
1?1X
s=?2
w12(s)?
?1?1X
s=?2
I12(s)
%
(2.13)
If ?2 = ?1 then
jP
?1?1
s=?2 w12(s)
k
?P?1?1s=?2 I12(s) = 0, which contradicts (2.13). If
?2 < ?1; then necessarily ?12(k + 1) = ?12(k) ? ?2 < ?1; and since Iij(s) satisfles
54
(2.3) for s ? k, we have
jP
?1?1
s=?2 w12(s)
k
?P?1?1s=?2 I12(s) ? 0; which again contradicts
(2.13).
We also need to ensure that Iij(k+1) satisfles Property I. If Iij(k+1) = 1; then
clearly Property I holds for column j and row i. Assume next that for a column
or a row, say column 1; it is computed based on (2.7) that
I11(k + 1) = I21(k + 1) = 0:
In order to ensure that Property I holds for k + 1, we claim that we can redeflne
one of I11(k +1); I21(k +1) to 1 without afiecting the sub-modularity property of
the matrix. To see this, notice that since Iij(k+1) takes values 0 or 1 and satisfles
(2.8), in column 2 there can be at most a single 1, say in position (1;2). We can
therefore set I21(k + 1) = 1 in order to ensure that Property I holds for column
1. Proceeding in this way, we can redeflne some of the Iij(k + 1) if necessary, in
order to ensure that Property I holds. Notice also that with this redeflnition, the
resulting matrix Iij(k + 1) still satisfles (2.3).
It remains to show that Iij(s) satisfles (2.3) for s = k + 1: Assume flrst that
Iij(k + 1) has not been redeflned. Since by the deflnition (2.7),
Iij(k + 1) ?
$k+1X
s=l
wij(s)
%
?
kX
s=l
Iij(s); ?ij(k + 1) ? l ? k + 1;
we easily conclude that
max
?ij(k+1)?l?k+1
($k+1X
s=l
wij(s)
%
?
k+1X
s=l
Iij(s)
)
? 0
If any of the Iij(k + 1) needs to be redeflned, this redeflnition only increases the
value of Iij(k + 1) and hence (2.3) still holds.
55
2.4 Heuristic Algorithms
Let ?f be a feasible  uid policy that at every time slot k specifles the appropriate
 uid scheduling matrix w(k). We showed in the previous section that as long as
?f satisfles Assumption 1, a non-anticipative tracking packetized policy can be
designed for a 2 ? 2 switch. Unfortunately, the approach used in the 2 ? 2 case
does not work for the general case of N ?N switches. Recall that the validity of
relation (2.3) was essential in the proof of Theorem 2.3.1. If we could construct
permutation matrices Iij(k) that satisfy (2.3) for an N ?N switch, then we could
also construct a tracking policy for the general case. However, in Lemma 2.3.2 we
heavily used the properties of a 2 ? 2 switch in order to prove relation (2.3). In
the 2?2 case, there are only two possible permutation matrices to consider and
the simple construction in (2.7) is su?cient to obtain appropriate permutation
matrices Iij(k). For N ? 3, however, this construction does not always work. In
fact, as we will see next, tracking policies do not always exist for N ? 3:
Bonuccelli and Claudia present an example [5] for 4 ? 4 switches with flxed
rate. This example shows that even an anticipative tracking policy may not exist
for N ? 4. Here we provide a difierent example for a 3?3 switch where a non-
anticipative tracking policy does not exist.
Example : Consider a 3?3 switch. Suppose that the serving discipline of every
link under the  uid policy is determined by the number of cells in each bufier and
the priority of the cells as follows.
1. If less than one cells are queued in each link, these cells are served with equal
rates. If there are links for which more than one cells are queued, all these
links - and only these links - are served with equal rate.
56
2. In each of the links, higher priority cells are served flrst.
Suppose that cells p1;:::;p9 arrived at the beginning of the flrst time slot.
Bufiered cells of the  uid and packetized policy are depicted in Fig. 2.5. The
flgure shows the bufiering at the beginning of flrst flve time slots. In every input
there are three parallel bufiers (virtual queues) corresponding to the three outputs.
As it is illustrated there are 9 cells, one in each virtual queue at time 1. Each of
these cells have distinct input/output pairs and all have the same priority. Let
matrix Q(k) specify the bufiered cells under the packetized policy at time k. That
is, element qij(k) of Q(k) specifles the set of cells that are bufiered at input port i
and are destined for output port j under the packetized policy. Assume that
Q(1) =
2
66
66
4
p1 p2 p3
p4 p5 p6
p7 p8 p9
3
77
77
5
:
According to rule 1 above, the service matrices of the  uid policy will be,
w(1) = w(2) =
2
66
66
4
1=3 1=3 1=3
1=3 1=3 1=3
1=3 1=3 1=3
3
77
77
5
:
Without loss of generality, assume that the tracking policy selects the following
two permutation matrices for the flrst two time slots.
S(1) =
2
66
66
4
1 0 0
0 1 0
0 0 1
3
77
77
5
; S(2) =
2
66
66
4
0 1 0
0 0 1
1 0 0
3
77
77
5
Now assume that at the beginning of time slot 3; six new cells p10 :::p15 arrived,
so that the backlogged cells under the packetized policy at the beginning of slot 3
57
are,
Q(3) =
2
66
66
4
p10 p11 p3
p4 p12 p13
p14 p8 p15
3
77
77
5
:
Hence in the flgure at time slot 3, we have one cell in every queue under
the packetized policy. The situation is difierent for the  uid policy, 1=3 of the
previously arrived cells remain, and there are six new arrivals as well. Assume
that p10;:::;p15 have equal priority, higher than the priority of the previous cells
and hence, according to rule 2, they are placed ahead of the previous arrivals in the
 uid policy queues. Based on rule 1, the serving rate of the  uid policy becomes,
w(3) =
2
66
66
4
1=2 1=2 0
0 1=2 1=2
1=2 0 1=2
3
77
77
5
:
In order to ensure proper tracking, the tracking policy should also serve a feasible
set of these high priority cells. For instance let,
S(3) =
2
66
66
4
1 0 0
0 1 0
0 0 1
3
77
77
5
:
At the beginning of time slot 4, assume that six new cells arrive, all with low
priority, so that the queue matrix becomes,
Q(4) =
2
66
66
4
0 p11;p16 p3;p17
p4;p18 0 p13;p19
p14;p20 p8;p21 0
3
77
77
5
:
58
Based on rule 1, the  uid policy would select the following serving rate matrix,
w(4) =
2
66
66
4
0 1=2 1=2
1=2 0 1=2
1=2 1=2 0
3
77
77
5
:
Therefore, at the beginning of time slot 5 six cells, with indices (3,4,8,11,13,14)
are fully served under the  uid policy. However, by that time the packetized policy
can at most, serve three of them. We assume that the tracking policy serves cells
p11;p13;p14, and there remain cells p3;p4;p8 unserved. Similarly, it can be shown
that for any of the other choices of J(3); there is another possible set of new arrivals
that makes it impossible for the packetized policy to track the  uid policy.
Since as the previous example shows it is impossible to construct non-anticipative
policies in the general case, we are motivated to seek for heuristic algorithms with
good but not perfect tracking properties. The design of the heuristic relies on
two main concepts, port based tracking and critical links, that are discussed be-
low. We design a simple tracking policy based on these concepts and illustrate its
performance using simulation results.
2.4.1 Port Based Tracking
One way to implement a packetized tracking policy can be based on flnding optimal
weighted matchings in bipartite graphs. At slot k; for each input-output port pair
(i;j); a weight, vij(k), is associated. This weight represents the amount of work
on input-output port pair (i;j), by which the  uid policy is ahead of the tracking
policy, up to slot k. That is,
vij(k) = max
? kX
l=1
(wij(l)?Sij(l));0
!
:
59
Figure 2.5: The backlogged cells for the  uid and packetized tracking.
60
We call these weights the tracking weights, vij(k):
We view the switch as a bipartite graph, with nodes the input and output
ports of the switch. The weight of link (input-output port pair) (i;j) is vij(k).
The weight of node (port) i; ui(k), is the sum of the weights of the links that
emanate from or terminate at that node. That is, for an input port i and an
output port j; we have respectively,
ui(k) =
NX
j=1
vij(k); (2.14)
uj(k) =
NX
i=1
vij(k): (2.15)
Each sub-permutation matrix S(k) deflning the tracking policy at slot k; cor-
responds to a matching in the bipartite graph. The selection of the appropriate
matching can be based on the link and node weights introduced above. Ideally,
the matchings should be chosen so that all these weights remain close to zero.
Bipartite matching algorithms have been extensively used in switch scheduling
and they are mostly based on the maximum link (edge) weighted matching or on
maximum size matching (that is, a matching that maximizes the number of links
included in the matching) algorithms. Maximum link weighted matching algo-
rithms are complex and computationally intensive, while maximum size matching
algorithms often have poor performance. We concentrate here on algorithms based
on optimal node weight related matchings [31]. As will be seen, algorithms based
on these matchings have good performance, and they are simpler for hardware
implementation. Two possible solutions are matchings that satisfy the following
optimization criteria.
Maximum node weight sum : With this criterion, the matching whose node
weight sum is maximum is chosen. Hence, at every step it is attempted to
61
flnd the matching whose node weight sum \lags" the most from the desired
schedule. Since cells will be transmitted on the links of this matching, its
\cost" will be reduced in the next time slot.
Maximum Lexicographic Order of Node Weights: In this approach, with
each matching we assign a binary vector where each digit is associated to one
node. Nodes are ordered according to their weight and nodes that are in-
cluded in the matching are assigned bit one, while nodes not included in the
matching are assigned bit 0: We then select the matching whose assigned vec-
tor is maximum in the lexicographic order (min-max fair) [4, Section 6.5.2].
In other words, if we consider the bit sequence as binary representation of a
number with the bit associated to the greatest weight node to be the MSB,
the maximum lexicographic order matching is the matching with greatest
binary representation. In essence, we can not add any node to the maxi-
mum lexicographic order matching or replace any of its nodes with a node
with greater weight. In this approach, it is attempted to select the matching
whose nodes have individually large weights and hence are lagging the worst
from the desired schedule.
It can be shown that both the above mentioned criteria are equivalent. This is
due to the special structure of bipartite graphs. In the next lemma we prove the
equivalence of the two criteria.
Lemma 2.4.1 Any matching that maximizes the node weight sum, maximizes the
lexicographic order of weights and vice versa.
Proof: let M1 be a matching that maximizes the node weight sum and M2
a matching that maximizes the lexicographic order of weights. Let also M1; M2
62
be their respective set of nodes. Without loss of generality, we show that for any
input node i0 2 M1?M2; there is a node i1 2 M2?M1, such that ui0(k) = ui1(k)
and vice versa. This implies that the two matchings have the same node weight
sum, and that they are equal in the lexicographic order.
Consider the graph G that consists of links that belong to one and only one
of the two matchings. Note that the maximum degree of a vertex in G is two.
Let i0 2 M1 ? M2: Then, degree of node i0 in G is one. Therefore, there is an
(undirected) path in G that starts from node i0 and ends to a node i1 with degree
one. We concentrate on this path.
The number of nodes in this path is either even or odd. If it is even then it
follows from the deflnition of G that the last node in the path belongs to M1,
while all intermediate nodes belong to both matching. Thus, if we replace the
alternative set of links in the path belonging to M2 with those belonging to M1,
two more vertices will be included to M2, and this contradicts with the optimality
assumption of M2. Therefore, this case is impossible.
If the number of the nodes in the path is odd, then i1 2 M2 ? M1 . In this
case, we necessarily have ui1(k) = ui0(k). Indeed, if ui1(k) < ui0(k) then as in the
previous paragraph we can construct a matching that is better in the lexicographic
order than M2. If on the other hand ui1(k) > ui0(k), then we can similarly
construct a matrix that is better than M1 in the node weight sum criterion.
We call an optimal matching based on the above criterion, Maximum Node
Matching (MNM). Notice that MNM is difierent form conventional maximum
weighted matching, since the latter maximizes the sum of the weights of the links
involved in the matching, while MNM maximizes the sum of the weights of the
nodes involved in the matching. The difierence is illustrated in Fig.2.6. Next, we
63
3
2
2
2
2
44
4
4
3
6
4
5
The bipartite graph 
with nodes and links
weights.  
Maximum link
weighted matching
of the graph.
Maximum Vertex
weighted matching
of the graph.
Figure 2.6: Example of node and link weight matching in bipartite graphs.
need to have an algorithm for flnding a Maximum Node Matching of a bipartite
graph. Note flrst that MNM is a maximum size matching. To see this, assume
that the algorithm employed to flnd the maximum size matching is the augmented
path algorithm for the associated maximum  ow problem in an extended network
[9].
This algorithm starts with an initial matching ( ow on the extended network)
and then at each iteration flnds a  ow augmenting path and a new  ow in the
extended network. Observe that at each iteration, due to the fact that the graph is
bipartite, all the nodes in the original matching are still nodes of the new matching.
Hence, assuming that the algorithm uses the MNM matching as its initial matching,
the flnal matching includes all MNM nodes. On the other hand, the flnal matching
cannot include additional nodes, since then its node weight sum would be larger
than that of the MNM matching. Therefore, MNM matching should be a maximum
size matching.
Assume now that we flnd a maximum size matching M1:We show next how
64
to obtain a MNM, M?; from M1. Given any matching M, deflne an \alternating
path" to be a path in the bipartite matching such that any two consecutive links
in the path, one belongs to M and the other does not belong to M:
If M is a maximum size matching, then it is easy to see that any alternating
path such that one of its endpoints does not belong to M, must contain an even
number of links. Otherwise, we have found an augmented path, i.e. number of the
links in the path that are not in the matching is one more than number of links that
are in the matching. In that case we can replace those link that are in the matching
with those that are not, and end up with a larger size matching. Obviously,
this contradicts with the assumption that the original matching is maximum size.
Hence, the other endpoint of the path belongs to M. The algorithm for flnding an
MNM is based on the following lemma
Lemma 2.4.2 A maximum size matching M1 is an MNM, if and only for any
alternating path with endpoints i belonging to M1 and j not belonging to M1;we
have
ui(k) ? uj(k):
Proof: Let M1 be an MNM. If there is an alternating path such that ui(k) <
uj(k), then we can replace the links of M1 in the path with the links in the path
not belonging to M1: The resulting matching will have larger node weight sum
than M1, which contradicts the assumption that M1 is MNM.
Assume now that for any alternating path with endpoint i 2 M1 we have
ui(k) ? uj(k): Let M? be an MNM matching. Let G be the graph whose links
are the links that belong to one and only one of M1 and M?. Since both M? and
M1 are maximum matchings, using arguments similar to those used in the proof
of Lemma 2.4.1, it can be seen that in G there can be either alternating cycles or
65
alternating paths with even number of links. If G contains only cycles, then M1
has the same nodes as M? and hence the same node weight sum. Assume now
that there is a path in G such that one of its end nodes, i, belongs to M1 and
the other, j, in M?. Since M? is MNM, we must have ui(k) ? uj(k). Since for
any alternating path we have ui(k) ? uj(k), we conclude that ui(k) = uj(k). We
conclude that the node weights of M1 are equal to the node weights of M? in the
lexicographic order and hence they again have the same node weight sum.
Based on Lemma 2.4.2 we have the following algorithm for flnding an MNM
matching.
Matching Algorithm:
1. Find an initial maximum size matching M1.
2. Mark all vertices not in M1 as unexplored.
3. Pick one unexplored node i, and Search the graph for all possible endpoints
j (belonging necessarily to M1) of alternating paths originated from i. Let
jmin has minimum node weight among all discovered end points.
4. If ujmin < ui, then replace in M1, the links in the path belonging to M1 with
the links in the path not belonging to M1:
5. Remove node i from the unexplored set.
6. If the set of unexplored nodes is empty, then M1 is an MNM and process
ends. Else, go to step 3.
The algorithm given above is iterative. There would be at most 2N iterations
since one node will be explored in each iteration. The complexity of each iteration
66
is N2, since we have to search the graph and there are at most N2 links in the
graph. Therefore, complexity of this algorithm is O(N3).
So far we have assumed that all lagging links (those with positive weights) are
included in the bipartite graph, and that the weight of every node is the sum of
weights of the links that emanate from the node. In the next section, we will
discuss a enhancing mechanism that enables us to modify the weight of nodes and
to exclude some of the graph links so that the matching algorithm can come up
with better assignment conflgurations.
2.4.2 Critical Ports and Links
A critical port is a port for which a cell should be scheduled in the next time slot,
in order not to miss a deadline in the future. As an example suppose that we are
at the beginning of kth time slot. Assume that there are two cells, one that needs
to be transmitted from node i to node j1 and the other from node i to node j2.
Assume also that both have deadline k + 2. Note that if we do not schedule any
of these cells, no deadline will be missed in the kth time slot. However, we will
deflnitely miss a deadline at the subsequent time slot, k+1. We say that node i is
a critical node, and links (i;j1) and (i;j2) are associated critical links. In general, a
su?cient condition for a port to be critical at time k is to have at least p cells with
deadlines less than or equal to k+p. Note that we are stating a su?cient condition.
In other words, there might be some critical ports that cannot be detected well in
advance using this criterion. However, for simplicity we concentrate on nodes that
are critical according to the criterion deflned above. Our goal is to detect critical
nodes and increase their chance to be scheduled.
In case of the tracking policy, the deadlines of cells are implicitly given, and
67
are equal to the end of the time slot that the cell departs the switch under the
 uid policy. We may not know the deadlines in advance, since the future rate of
every link under the  uid policy depends on the future arrivals, which in general
are not known. Nevertheless, we may have an approximate deadline for every cell
based on back-logged tra?c or the average arrival rate of the links
The next issue is to set an appropriate inspection horizon. Suppose that we are
at time k. To detect the critical links, we have to account for the cells that should
be scheduled in the next p time slots. We call p the \inspection horizon". There
is a trade-ofi involved here: increasing the inspection horizon helps us in detecting
more critical links, but it increases the complexity of the algorithm as well.
After detecting a critical port, we know that we have to schedule one of the
critical links associated with that port, otherwise we will miss the deadlines. To
give priority to this node, we increase its weight by a constant, so that its weight
exceeds all non-critical nodes weights. Therefore, these nodes are prioritized by
the scheduler. To make sure that one of its critical links are scheduled, we remove
all non-critical links that have a critical node as an endpoint. The algorithm for
detecting critical nodes can be described as follows,
Critical Node Detecting Algorithm:
1. At every time step k, set p = 1.
2. For each node, i, flnd the number of cells that have to be sent in the next p
time slots. These are the cells with deadlines at most k + p.
3. If for some node, i;the number of cells that should be sent in the next p
time steps is larger than or equal to p; then i is critical. Moreover, the links
emanating from i, over which at least a cell should be sent in the next p time
68
slots are critical links
4. Increment p and go back to step 2, if p ? Lmax (Lmax is the inspection
horizon).
The computationally expensive part of this algorithm is step 2. At that step
we need to estimate the work done by the  uid policy in the next p time slots. In
general the estimate depends on future arrivals and cannot be computed exactly.
An approximate value can be obtained by assuming that there are no new arrivals
in the system. There are cases of course, as in the example described in Section
2.2, where the design is done ofi-line and future arrival can be anticipated.
The scheduling process of a switch can be divided into two stages. In the
flrst stage, the weight of the ports are calculated and the criticality of the ports
is investigated. Once the service rates of the  uid policy are computed, the rest
of computations for difierent ports of the switch can be done in parallel, and no
interaction between them is necessary. In the next stage the computed weights
for the nodes and the eligible links for every node are provided to the matching
algorithm.
Finally we provide the scheduling algorithm that is based on the algorithms
described above.
Scheduling Algorithm:
1. At every time slot k do the following steps.
2. Calculate node weights using (2.14), (2.15).
3. Insert all links with positive weights in the eligible links set.
4. Check for critical nodes and their associated critical links.
69
5. Increase the weight of every critical node to a value, C; where C is larger
than the weights of all non-critical nodes. Remove all non-critical links of
the critical nodes from the eligible links set.
6. Find the maximum node weighted matching. The result of the matching is
the schedule for time slot k.
If there are several sessions that are transmitting cells in an input/output pair,
the scheduling algorithm does not specify, which of them should be scheduled in the
associated slot. The scheduler considers all these sessions as an aggregated session
and works with the aggregate rate. Once a slot is assigned to an input/output
pair, it is the responsibility of a local scheduler maintained at the input port to
assign the space to one of the multiple sessions. In principle, any single link sharing
algorithm may be used as local scheduler. Here we will use EDF scheduler. This
hierarchal approach improves the scalability of the scheduler. Note that apart from
the computations that may be needed to calculate rates under the  uid policy, the
complexity of the scheduler does not depend on the number of sessions between
every input/output pair, since all of them are considered as an aggregated session.
The proposed scheduling algorithm is used in next section to schedule flxed
rate sessions, and its performance is evaluated through simulations.
2.4.3 Fix Rate Scheduler Simulation
We consider multiple flxed rate sessions arrive to all input ports of a switch. These
sessions are similar to the periodic sessions introduced in Section 2.2. Each session
m; has an integer period pm. In any interval [kpm; (k + 1)pm ?1] one slot should
be assigned to session m. If the slot is not assigned, we assume that one cell
of that session is discarded. In efiect, we are assuming that real time sessions
70
have strict deadlines, but can tolerate some cell loss. Moreover, the total capacity
dedicated to real time sessions is less than the capacity of the switch. This does not
necessarily mean that some of the capacity is wasted, since the remaining capacity
could be dedicated to non-real time tra?c. In fact, as will be seen, the capacity
we considered allocated to real-time tra?c in our simulations is much higher than
the capacity normally assigned in today?s networks.
The main input parameters to the simulation are the maximum port utilization
and minimum overall utilization of the switch ports , uM and um respectively, and
the switch size N. We denote the flrst two parameters as the utilization pair,
(uM;um). In the flrst set of experiments, the inspection horizon Lmax is considered
as an input and its efiect is studied, while in the rest of experiments it is set to a
constant value.
The sessions are generated as follows. A uniform random number generator is
used to select the input and output ports for every session. The rate of a session is
selected uniformly in the range of [1=1024; 68=1024], so that the period of sessions
is from 15 to 1024 slots. If the selected rate is such that one of the port load exceeds
the maximum port rate, then the rate is reduced so that the overall load of that
port equals the maximum load. The period of the session is then set to the ceiling
of the inverse of the resulting rate. The above process is repeated 10000 times.
This does not mean that there are 10000 sessions in each session set, since in some
of the attempts either the input or the output port are fully loaded. Once set of
sessions is generated, if the resulting average utilization of the switch ports exceeds
um it is accepted, otherwise it is discarded and another session set is formed. The
minimum session rate is set to 1=1024, because for each session set, the simulation
runs for 1024 time steps.
71
One of the main advantages of the heuristic algorithm is that its complexity is
not a function of the number of sessions. The rate of all individual sessions with
same input and output are added up and the arbiter looks at them as an aggregated
session. Once a slot is assigned to a link, then there is a local scheduler that selects
the session that is going to use that slot. In our case, we simply use an EDF
scheduler for this purpose. If no slot is assigned to a session during one of its period
interval, we assume that one of its cells is discarded. To study the efiect of difierent
parameters several experiments are carried out. Each experiment is specifled by
the values selected for switch size, utilization pair, and the inspection horizon.
For each experiment 100 sets of sessions are generated, and for each session set
1024 time steps of simulation is performed. For every session, the percentage of
discarded cells is calculated. The performance measure is the percentage of sessions
with no discarded cell (0% loss ratio), and the percentage of sessions with 10% loss
ratio. Three difierent aspects of the algorithm are studied, inspection horizon,
switch size, and the utilization pair.
Inspection Horizon
We introduced the concept of critical links as a way to detect and increase the
chance of the links and ports that are more urgent to be scheduled. Obviously
this increases the complexity of the scheduling algorithm. In fact, the additional
computation load is a function of the selected inspection horizon. In the flrst series
of experiments, we study the efiectiveness of this procedure and the appropriate
values for inspection horizon. The switch size is set to 32 and the utilization pair to
(0.85, 0.8). The results are given in table 2.2. We can deduce that the detection of
critical links can improve the capacity, and reduce the percentage of non-perfectly
72
Lmax 0% Ratio 10% Ratio
0 0.9322 0.9996
1 0.9732 0.9997
2 0.9786 0.9997
3 0.9797 0.9997
4 0.9803 0.9997
5 0.9800 0.9997
Table 2.2: Performance for difierent inspection horizons (U=(0:85;0:8), N=32)
scheduled sessions by about 5%. Notice that for Lmax = 0 (no check), 0.068 of
sessions have discarded cells, while for Lmax = 1, this reduces to 0.027.
Switch size
In this series of experiments, the inspection horizon is flxed to 5, and the utilization
pair is (0.85, 0.8). Most of the heuristic algorithms provided for input queueing
switches fail to give satisfactory result for moderate size switches [34]. The results
of our simulation are given in table 2.3. We also observe some degree of perfor-
mance degradation as the switch size increases. However, in all cases the 0% ratio
is around 0.98. In fact, if we decrease the network load, we can even get better
results. This is a very important feature of the algorithm, since it is vital for the
algorithm to perform well for larger switch sizes.
Utilization
In this series of experiments the efiect of utilization or switch load is investigated.
The switch size is set to 32, which is a moderate size switch. The results are given
73
N 0% Ratio 10% Ratio
8 0.99 0.9999
16 0.98 0.9997
32 0.98 0.9997
64 0.98 0.9997
Table 2.3: Performance for difierent switch sizes (U=(0:85;0:8))
U 0% Ratio 10% Ratio
(0.55, 0.5) 0.996 1
(0.65, 0.6) 0.993 0.9999
(0.75, 0.7) 0.989 0.9999
(0.85, 0.8) 0.980 0.9997
(0.95, 0.9) 0.952 0.9986
Table 2.4: Performance for difierent utility pairs (N=32)
in table 2.4. As we expect the performance of the system degrades as a function of
utilization. However, with the exception of the utilization pair (0.95, 0.9), which
is very high for a realistic system, the percentage of sessions without any cell loss
is above 98%.
2.5 Summary and Conclusion
In this chapter, the notion of  uid policies and tracking policies are extended to
the N ?N switches. These concepts are useful in the design of high speed input
queued switches, where they can aid in the development of scheduling policies
that provide guaranteed service to difierent applications. The existence of an ideal
74
tracking policy is proved for the special case of 2 ? 2 switches. For the general
case, it is shown that perfect tracking policies do not exist, however a heuristic
tracking policy is provided.
The design of tracking policies for a general N ? N switch is still an open
question. The examples in Section 2.4 show that such a tracking policies cannot be
designed without further constraints on the arrivals or on the policies themselves.
This fact, together with the complexity that an ideal tracking policy entails, justify
the need for less complicated heuristic tracking policies with good performance.
The proposed heuristic algorithm is based on two useful notions, the Maximum
Node Matching, and Critical Nodes. The scheduling is done in a hierarchical
fashion. First the global scheduler selects the input-output pairs on which cells
may be transmitted in a particular slot, and then a local scheduler assigns the
slot to one of the sessions sharing the input-output pair. This approach makes the
scheduler scalable in terms of the number of sessions. The simulation results are
promising and illustrate that the algorithm can be useful in high speed networks
and satellite switches where not only throughput but delay and jitter guarantees
are desirable too.
75
Chapter 3
MNCM: A class of e?cient
maximal size matching scheduling
algorithms with no speedup
In this chapter, we use the  uid model technique to establish some new results
for the throughput of the input-bufiered switches. In particular, we introduce a
new class of deterministic maximal size matching algorithms that achieves 100%
throughput. Dai and Prabhakar [13] have shown that any maximal size matching
algorithm with speedup of 2 achieves 100% throughput. We introduce a class of
maximal size matching algorithms that we call them the maximum node containing
matching (MNCM) algorithms, and prove that they have 100% throughput with no
speedup. We also introduce a new weighted matching algorithm, maximum flrst
matching (MFM) with complexity O(N2:5) that belongs to MNCM. The MFM
algorithm, to the best of our knowledge, is the lowest complexity deterministic
algorithm that delivers 100% throughput. The only assumption on the input tra?c
is that it satisfles the strong law of large numbers.
76
Stability and throughput of input-bufiered switches is a well studied problem.
In the papers [37], [29] it is proved that the maximum weighted matching (MWM)
algorithm can achieve 100% throughput. In [29] number of backlogged packets and
maximum delay of waiting packets in each VOQ are considered as two potential
weight functions. In another work [31], Mekkittikul and Mckeown considered the
case where weights are associated to the ports rather than the links and showed
that the proposed algorithm, longest port flrst (LPF), achieves 100% throughput.
Complexity of the LPF algorithm is also O(N3), even though for practical purpose
it seems to be more favorable than the MWM [30]. Stability of these algorithms
are all proven under the assumption of i.i.d. arrivals.
In chapter 2 we introduced MNM, a heuristic packetized tracking policy. In
essence, the link weight function that we used there specifles how much the pack-
etized policy is behind the  uid policy in serving a link. Intuitively, we can view
the serving rate of the  uid policy as the virtual arrival process, and the weight
function as the virtual backlog of the VOQs if we use the packetized tracking pol-
icy. Therefore, MNM is similar to the LPF algorithm; the only difierence is MNM
uses virtual backlog as the weight function and the LPF uses the real number of
backlogged cells as the weight function.
In order to prove that a packetized policy based on the MNM algorithm can
efiectively track a  uid policy, we have to show that the virtual backlog process
is stable and remain bounded. Recall that virtual backlog process measures how
much the packetized policy lags the  uid policy. If the virtual backlog function
is bounded, i.e. the MNM is rate stable, then the packetized policy is efiectively
tracking the  uid policy. The stability result of [31] is not applicable under these
conditions, since those results are for Bernoulli i.i.d arrivals, where as virtual arrival
77
process is by deflnition not an i.i.d. process.
Dai and Prabhakar [13] have used the  uid model technique to prove that the
maximum link weighted matching achieves 100% throughput for a very general
set of input tra?c patterns. The only assumption on the input tra?c is that it
satisfles the strong law of large numbers and it does not over-subscribe any input
or output port. This result motivates us to use the same technique for the LPF
and MNM algorithms.
In this chapter, we extend some of the results of [13], by proving that the
LPF algorithm is also rate stable. In fact, our result is more general than that,
and we introduce the maximum node containing matching (MNCM) algorithms.
The MNCM is a new class of maximal size matching scheduling algorithms, that
achieves 100% throughput with no speedup. The LPF and MNM algorithms are
both in the MNCM. The norm function that is commonly used for the stability
analysis is the norm 2 (k ? k2). In our stability proof for the MNCM, we use
norm inflnity (k?k1) instead of it, and focus on the matching algorithms that
function based on that. We also introduce the maximum flrst matching (MFM)
algorithm that is in the MNCM class, and therefore has 100% throughput. The
MFM algorithm employs the maximum size matching algorithms with O(N2:5)
complexity rather than the maximum weighted matching algorithms with O(N3)
complexity. Recall that both MNM and LPF have O(N3) complexity.
We also introduce another maximal deterministic matching algorithm with
O(N2) complexity, the maximal sorted matching (MSM). MSM is basically a trivial
generalization of the iLPF algorithm introduced in [31]. MSM is not in the MNCM
class, but for practical applications, we think that it performs similar to the MFM.
Even though we were not able to prove that the MSM has 100% throughput, due
78
to its similarity to MFM, we think that for all practical purposes it achieves the
100% throughput. This conjecture is in accordance with our simulation results,
where delay performance of the MSM matches performance of the MFM.
For the speciflc application of rate provisioning, we introduce the Maximum
Size Unit Interval Matching (MSUIM) algorithm that can provide rate guarantees
to all connections in a switch fabric. MSUIM is not in the MNCM class but it is
based on the same concepts. In fact, we will show that the number of cells served
under the MSUIM is at most N units behind an arbitrary feasible  uid policy.
This chapter is organized as follow. In section 3.1, we introduce our notation
and model. In section 3.2, we review the link based model of the switch, and ex-
tend the link based  uid model proposed in [13] to a port based model. In section
3.3, we prove the main result of the paper, which is the stability of the MNCM
class of matching. In section 3.4, we present the MFM and MSM algorithms and
elaborate on their complexity. In section 3.5, we introduce the MSUIM algorithm
and prove that it can provide rate guarantees. We conclude this chapter in sec-
tion 3.6 with some simulation results, that demonstrate the delay performance of
proposed matching algorithms.
3.1 Model and Deflnition
We consider input queued switches that serve flxed size packets (cells). Each input
and output has the capacity of serving 1 cell per unit time. Since queues exist
only at the input ports, the latter assumption implies that tra?c of at most 1 cell
per unit time can be transferred from the input ports to a given output port. To
avoid HOL blocking, we consider that the bufier at an input is partitioned into N
virtual output queues. The scheduling policy is basically a matching algorithm m
79
that based on the state of the switch selects a matching between the inputs and
outputs in every time slot. If input i is matched to output j, and the corresponding
VOQ is not empty, a packet is sent form input i to output j. A matching can be
represented by a permutation matrix ?. Input ports are represented by the rows
and output ports by the columns of this matrix, therefore input i is matched to
output j if and only if ?ij = 1.
We assume that the cells arrive at the switch at the beginning of a time slot,
and they depart the switch at the end of a time slot. A packet that has arrived at
the beginning of time slot n can be scheduled at the same time slot and depart the
switch at the end of time slot n. Let Aij(n) be the number of packets that arrived at
input i and are destined for output j up to time n. We assume that there are no ar-
rivals before time 0, i.e., Aij(0) = 0. The arrival processes fAij(:);i;j = 1;:::;Ng
satisfy strong law of large numbers, that is with probability one,
limn!1 Aij(n)n = ?ij i;j = 1;:::;N: (3.1)
?ij is the arrival rate of cells destined from input i to output j: Similarly, we show
the number of departed cells up to time n; from input i to output j with Dij(n):
We consider the following deflnition for stability.
Deflnition 3.1.1 A switch operating under a scheduling algorithm is rate stable
if, with probability one
limn!1 Dij(n)n = ?ij i;j = 1;:::;N; (3.2)
for any arrival process that satisfles (3.1).
Deflnition 3.1.2 A scheduling algorithm is e?cient if (3.2) holds for any arrival
80
process satisfying the feasibility conditions,
NP
i=1
?ij ? 1;
NP
j=1
?ij ? 1:
(3.3)
Let Zij(n) be the number of backlogged packets in VOQij at time n, hence
Zij(n) = Aij(n)?Dij(n): (3.4)
We will introduce and analyze some scheduling algorithms that work based
on the port level parameters. Port level parameters are deflned to be aggregate
(summation) of their corresponding link level parameters. We number the ports
of the switch from 1 to 2N where index sets f1;:::;Ng; and fN + 1;:::;2Ng
corresponds to input and output ports respectively. Let (i;j); i;j 2 f1;:::;Ng;
indicates one of the switch links and k 2 f1;:::;2Ng one of the switch ports. We
say link (i;j) goes to port k and show it as (i;j) ! k; if k and either i or j are
associated to the same physical port, i.e. either k = i or k = j +N. We deflne the
port arrival process H(n) = fHk(n);k = 1;:::;2Ng, departure process E(n) =
fEk(n);k = 1;:::;2Ng, the backlogged process B(n) = fBk(n);k = 1;:::;2Ng,
and the port arrival rate r = frk;k = 1;:::;2Ng,
Hk(n) = P
(i;j):(i;j)!k
Aij(n);
Ek(n) = P
(i;j):(i;j)!k
Dij(n);
Bk(n) = P
(i;j):(i;j)!k
Zij(n);
rk = P
(i;j):(i;j)!k
?ij:
(3.5)
81
In the next section, we provide the equations that govern evolution of the port
level parameters.
3.2 Link and Port Fluid Models
The  uid model for the links of an input bufiered switch is given in [13]. We flrst
review those relations and then extend them to derive the port based  uid model.
Consider an input-bufiered switch that employs scheduling algorithm m. Suppose
that Tm? (n) be the total time that permutation matrix ? is used up to time n.
The following equations describe the link level discrete dynamic of the switch, for
n ? 0, and i;j = 1;:::;N,
Zij(n) = Zij(0) + Aij(n)?Dij(n);
Dij(n) = P
?2?
nP
l=1
?ij1fZij(l)>0g(Tm? (l)?Tm? (l?1));
P
?2?
Tm? (n) = n;
(3.6)
where ? is the set of all N?N permutation matrices. The flrst equation describes
the basic relation between arrival, departure and backlogged process. The second
equation counts the number of total departures from input i to output j by counting
number of times that a permutation matrix with a one in (i;j) position is used,
while there were a backlogged packet in the corresponding VOQ.
The port dynamics of a switch can be derived from link dynamics. Every port
82
k = 1;:::2N, dynamic is basically summation of its corresponding link dynamics,
Bk(n) = Bk(0) + Hk(n)?Ek(n);
Ek(n) = P
?2?
nP
l=1
NP
(i;j):(i;j)!k
??
ij1fZij(l)>0g(Tm? (l)?Tm? (l?1))
?;
P
?2?
Tm? (n) = n:
(3.7)
Now we describe a deterministic continuous  uid model of a switch operating
under some matching algorithm m. The link level  uid model that is used in [13]
is,
Zij(t) = Zij(0) + ?ijt?Dij(t) ? 0;
_Dij(t) = P
?2?
?ij _Tm? (t) ? 0; if Zij(t) ? 0;
P
?2?
Tm? (t) = t:
(3.8)
We can derive the port level  uid model of the switch from relation 3.8 and 3.5,
or directly from 3.7.
Bk(t) = Bk(0) + rk t?Ek(t) ? 0;
?E
k(t) =
P
?2?
P
(i;j):zij(t)>0&(i;j)!k
?ij _Tm? (t) ? 0;
P
?2?
Tm? (t) = t:
(3.9)
Note that (3.9) is not an independent complete set of equations that describes the
dynamic behavior of the port level  uid model for the switch, since in the second
equation we still use link level parameters (zij(t)) of the switch.
83
The matching algorithm that is used for scheduling provides additional  uid
model equations. For instance, if we use the conventional maximum weighted
matching algorithm and consider number of backlogged cells of every link, Zij(t)
as its weight, there will be one additional  uid equation for every link (i;j) [13],
_Tm? (t) = 0 if h?;Z(t)i < h?0;Z(t)i for some ?0 2 ? : (3.10)
In other words, at time t, under the maximum weight matching algorithm, a
matching ? that has less weight than another matching ?0 is not employed.
Here, we consider weighed matching algorithms that function based on the
port weight rather than link weights. Weight of a port is total weight of all links
connected to it. The LPF [31] is one of these algorithms, where we consider the
backlogged tra?c in every port as the weight of that node. The maximum node
matching (MNM) [35] that we introduced in chapter 2 is another example, where
weights are amount of service that scheduler owes every link, according to the
reserved rate of that port. The only difierence between the MNM and LPF is in
the weight function that is used for links, but they have the same complexity. We
proved that the obtained matching under the MNM is a min-max fair, and it has
the maximum lexicographic ordering. In other words, it is impossible to add a new
port to the matching, without removing a higher weight node from the matching.
This is also true in the LPF for the corresponding weight vectors. This property
can be used to write an additional  uid equation for the LPF, however we consider
a generalized class of matching algorithms, MNCM that includes the LPF, and
derive some interesting result for this general class of matching algorithms. The
signiflcance of the MNCM class becomes more clear later, when we introduce a
deterministic matching algorithm in the MNCM that has lower complexity, and
has a very good delay performance. First we need to deflne the MNCM:
84
Deflnition 3.2.1 A maximal size matching algorithm belongs to the MNCM class
if and only if it always uses a matching m that contains all nodes that their weight
are greater than (1?1=N)Bmax, where Bmax is the maximum weight.
For now, we assume that the MNCM class is non-empty, and for any combi-
nation of the link weight values there exists at least one matching that is in the
MNCM. Under this assumption, the LPF turns out to be in the MNCM. Recall
that the LPF is a min-max matching algorithm. This means that if a node i is not
in the LPF matching, there is no matching m that contains node i and all nodes
that are in the LPF matching and their weight is larger than i.
In order to prove that the LPF is in the MNCM, we assume that it is not,
and reach a contradiction. Assume that the LPF is not in the MNCM, then
there should be a node i that is not in the LPF but its weight is greater than
(1?1=N)Bmax. On the other hand, we know that there exists a matching m in the
MNCM that contains node i and all other nodes that their weight is larger than
i. This contradicts with the min-max property of the LPF, and hence the LPF
should be in the MNCM. In the next section, we prove that any MNCM scheduler
is e?cient, and this immediately proves that the LPF is e?cient. Moreover, we
show that there are some less complex algorithms in the MNCM class too.
Recall that the matching algorithm that is used for scheduling results in ad-
ditional  uid model equations. For any matching in the MNCM the following
additional  uid equation holds. For any port k such that Bk(t) > (1?1=N)Bmax
we have,
P
?2?
( P
(i;j):zij(t)>0&(i;j)!k
?ij) _Tm? (t) = 1 if Bk(t) > (1?1=N)Bmax (3.11)
Equation (3.11) says that all ports that their weight is greater than (1 ?
1=N)Bmax are fully served under an MNCM policy. In fact, we are restating the
85
deflnition of the MNCM policy here. We can do that since the threshold value is
a linear function of the node weights. For instance, if the scheduling policy was to
include all nodes that their weight is greater than Bmax ?1, we could not use the
same criterion in the corresponding  uid model, since the threshold is not a linear
function of the node weights.
In order to make this clear, it would be helpful to go over some basic concepts of
the  uid model technique and show how the  uid model variables such as B(t) are
derived from the original queueing model parameters such as B(n) [12]. Starting
from the discrete process B(n), we deflne its continuous time extension bB(t) by
linear interpolation. The  uid model focuses on the behavior of,
lim
k!1
bB(kt)
k = B(t); (3.12)
that is called the  uid limit of the queue length vector. Notice that if the weight of
a node i, Bi(n) is greater than (1?1=N)Bmax(n), then its  uid limit counterpart
Bi(t) is also greater than (1?1=N)Bmax(t). In contrary, if Bi(n) > Bmax(n)?1,
we can not conclude that Bi(t) > Bmax(t)?1.
Before ending this section we need to deflne the function f(B(t)). Function
f(B(t)) is deflned from R2N ! R as,
f(B(t)) = maxfB1(t);:::;B2N(t)g (3.13)
This function plays a role similar to the Lyapanov function in the stability proofs.
Under a  uid model, B(t) is difierentiable, and consequently, f(B(t)) is continuous
but not necessarily difierentiable. In paper [13], norm 2 of the weight vector is used
in deflnition of the function f(t). Therefore, f(t) is difierentiable, but here we have
used norm inflnity and the function f(t) is not difierentiable anymore.
86
3.3 Stability Results
Our main objective here is to prove the following theorem,
Theorem 3.3.1 A switch operating under an MNCM matching algorithm is e?-
cient.
To prove theorem 3.3.1, we will use the following theorem that is proved in [13].
Theorem 3.3.2 A switch operating under a matching algorithm is rate stable if
the corresponding  uid model is weakly stable.
A  uid model is weakly stable if for every  uid model solution (E;T;B) with
B(0) = 0; B(t) = 0 for all t ? 0.
Intuitively, from equation 3.12 we can conclude that in the  uid model B(0) =
0, whenever there is a flnite number of customers in the system at time 0. Moreover,
if the length of the queues B(n) stays flnite with probability one, the  uid limit
variable B(t) should remain 0. This is consistent with the deflnition of weakly
stable  uid model.
Proof of theorem 3.3.1: Let (E;T;B) be a solution to equations (3.9), (3.11)
with B(0) = 0: In order to prove that B(t) remains zero for t > 0, it su?ces to
prove that f(B(t)) remains zero. Therefore, it is su?cient to show that if the
function f(B(t)) > 0, it will be absolutely decreasing.
Suppose that f(B(t0)) = L > 0. Due to continuity of f(B(t)), there exists
?0 > 0 such that every node that has the maximum weight for some t 2 [t0;t0+?0],
its weight remains in the interval ((1?1=N)f(B(t));f(B(t))] for all t 2 [t0;t0 +
?0]. For example, suppose that the node nmax(t0) has the maximum weight at
time t0. It is possible that this node does not have the maximum weight in the
87
neighborhood of t0. However, due to continuity, its weight remains in the ((1 ?
1=N)f(B(t));f(B(t))] interval for some time interval [t0;t0 + ?].
Let N(t0;t0+?0) be the set of the nodes that their weight is maximum for some
t in [t0;t0 + ?]. By the deflnition of the MNCM and the way ?0 is set, all nodes
i in N(t0;t0 + ?0) are fully utilized for all t in this interval, since their weight is
always in the ((1?1=N)f(B(t));f(B(t))] region. Therefore, their backlog function
is decreasing and consequently, f(B(t)) which is the maximum weight of the nodes
in N(t0;t0 + ?0) will be decreasing for all t 2 [t0;t0 + ?0] as well.
Using the same argument as above, given any initial non-zero backlogged value
at time t0 it is possible to partition the time line into the intervals, [tk;tk+1]; k =
0;1;???, (tk+1?tk = ?k > 0) such that f(B(t)) is either zero or absolutely decreas-
ing in all intervals.
Therefore, we have proved that Z(t) = 0 and  uid model is weakly stable.
Using theorem 3.3.2, we can conclude that the MNCM policies are e?cient.
The novelty of our approach is to consider the maximum norm in deflnition
of function f(?). In the previous approaches, usually a second norm function was
used in deflnition of f(?).
Our proof is not complete yet. We have to prove that there is always a matching
m(t) that belongs to the MNCM. The matching m(t) is in the MNCM if and only if
it contains all nodes that have their weight in the interval ((1?1=N)f(B(t));f(B(t))].
The existence of such a matching m(t) is proved in the following lemma.
Lemma 3.3.3 Consider an N ?N input bufiered switch fabric. For all values of
f(B(t)) > 0, all nodes that have their weight in ((1?1=N)f(B(t));f(B(t))] can
be included in a matching m(t).
88
Proof of lemma 3.3.3: Without loss of generality, assume that there is an input
node such that its weight is in the ?-neighborhood of f(B(t)) and can not be
included in the matching. We show that ? ? (1=N)f(B(t). Let A be the set of
all input nodes that their weight is in ?-neighborhood of f(B(t)) and E be the set
of all output nodes. Using Hall?s theorem1, since there is not a perfect matching
between A and E that contains all nodes in A, we can conclude that there is a
subset S of nodes in A such that its cardinality is greater than its neighbor set
NS,
jNSj?jSj?1: (3.14)
Note that NS is the set of all output nodes that have a common link with at least
one node in S. Let W(S) be the total weight of all nodes in S; by deflnition we
have,
(f(B(t)))??)jSj? W(S) ? f(B(t))jSj (3.15)
Since NS is connected to all links connected to S,
W(NS) ? W(S): (3.16)
From relations 3.15 and 3.16,
W(NS) ? (f(B(t)))??)jSj (3.17)
1Hall?s Theorem: There exists a perfect matching that contains all input nodes A in a bipartite
graph if and only if jSj ? jNSj for every subset S  A, where NS is the set of neighbor nodes
of set S.
89
Therefore, there is a node k 2 NS such that,
uk(t) ? W(NS)jNSj
? (f(B(t))??)jSjjNSj
? (f(B(t))??)jSjjSj?1
? (f(B(t))??)(N)N?1
= (f(B(t))??)(N)N?1 ?f(B(t)) + f(B(t))
= NN?1
?
f(B(t))
N ??
?
+ f(B(t))
(3.18)
Second relation is from 3.17, third relation is from 3.14 and fourth relation is due
to the fact that jSj? N.
If ?(N;f(B(t))) < f(B(t))=N from 3.18 we have
uk(t) > f(B(t)): (3.19)
This contradicts the deflnition of f(B(t)), hence there always exists a matching m
that contains all nodes that their weight is in epsilon neighborhood of f(B(t)) for
?(N;f(B(t))) < f(B(t))=N.
3.3.1 Comparison of the MNCM and LPF
It would be instructive to compare the MNCM with the LPF. We know that the
LPF is min-max fair, i.e., weight of the node that has largest weight among nodes
90
that are not in the matching is minimized. Therefore, LPF not only contains all
nodes that their weight is in the interval ((1?1=N)f(B(t));f(B(t))], but it will
also try to contain all other nodes such that the priority is always given to a node
with larger weight. In contrary, the MNCM only focuses on those nodes that their
weight is in the interval ((1?1=N)f(B(t));f(B(t))] and neglects other nodes. In
fact, for the MNCM we have identifled a group of nodes that their weight is larger
than others and we know that it is possible to contain all of them in a matching.
Moreover, we have proved that for the stability it is su?cient to contain only these
nodes and we really do not care about other nodes. Intuitively, it makes sense that
for stability the only concern is for those nodes that their weight is close to the
maximum value. This fact should enable us to come up with rate stable algorithms
that has lower complexity algorithms than the LPF.
We can also view the MNCM as a quantized version of the LPF. Consider a
quantizer for the node weights that has two levels, f(B(t)) and 0. All weight values
that are above (1?1=N)f(B(t) are quantized to f(B(t)) and all other weights are
quantized to 0. It should be clear that if we use the LPF with the quantized
weights the result would be an MNCM scheduler.
In the next section, we introduce a matching algorithm that is in the MNCM
and has lower complexity than all previous suggested deterministic algorithms
including the LPF.
3.4 Maximum First Matching (MFM)
In the previous section, we proved that the MNCM algorithms are e?cient. In
the MFM algorithm, we flnd two matchings that contain respectively, all input
and output nodes that their weight is in the ((1?1=N)f(B(t));f(B(t))] interval.
91
We call these nodes the marked nodes. Note that existence of such matchings is
proved in lemma 3.3.3.
The LPF algorithm that is introduced in [29] is an example of MNCM algo-
rithms. LPF works on non-matched nodes one by one starting with the node that
has highest weight. To flnd the matching, for each one of the nodes we need to
search all N2 links of the bipartite graph. Since there are 2N nodes, the complex-
ity of the LPF turns out to be O(N3): Basically, this algorithm is a modiflcation
of the Edmonds Karp max- ow algorithm [15] and its complexity is the same as
that.
The Hopcroft and Karp maximum size matching algorithm [23] has O(N2:5)
complexity. In this algorithm the nodes are introduced into the matching simulta-
neously; this is not possible in the LPF algorithm.
In the MFM algorithm, we replace the maximum weighted matching algorithm
to a limited number of the maximum size matching algorithms. In this way, we
can use the Hopcroft and Karp algorithm to obtain the maximum size matching
in each step, and reduce the complexity of the algorithm. We can do this since
our objective is not to flnd the maximum weighted matching anymore, but to have
a matching that contains all node with the maximum weight. The details of the
algorithm is as follow (complexity of each step is given at the end of each step),
Algorithm 3.4.1 (MFM):
1. Sort all input and output nodes according to their weight , and mark all nodes
that their weight is greater than (1?1=N)f(B(t) (O(N logN)).
2. Find a matching M1 that contains all marked input nodes (O(N2:5)).
3. Find a matching M2 that contains all marked output nodes (O(N2:5)).
92
4. Combine M1 and M2 to get a merged matching containing all critical nodes
in M1 [M2 (O(N)).
5. Performasimplesortedmaximalsizematchingalgorithmontherestofnodes
not in the matching (O(N2)).
6. Combine matchings of steps 4 and 5; result is the MFM matching.
?
In step 1, all nodes are sorted according to their weights. Since there are 2N
nodes the complexity of this step is O(N logN). In step 2, we consider all marked
input nodes, together with all output nodes and flnd a maximum size matching,
M1, in the corresponding graph. Note that in lemma 3.3.3, we concluded that
such a matching always exist. The complexity of this step is O(N2:5) which is the
complexity of flnding a maximum size matching. Step 3 is similar to step 2, but
here we flnd a matching, M2 that covers all marked output nodes. In step 4, we
combine the two matchings to flnd a matching that covers the input nodes covered
in M1 and the output nodes covered in M2.
It is useful to elaborate more on step 4. Consider the bipartite graph that
contains only those links that are in M1 or M2 (M1 [ M2). Our objective is to
flnd a matching in the bipartite graph M1 [ M2 that contains all of the marked
nodes. The maximum degree of a node in the combined graph is 2, since there are
at most two links connected to every node (one from M1 and one from M2). This
bipartite graph can be divided into disjoint sub-graphs. In each subgraph, we have
to obtain an optimal matching. The subgraphs can be classifled into four classes
and we explain the process of obtaining a matching for each of them. In general,
93
Figure 3.1: Four possible forms of sub-graphs in a combined matching. Links of
M1 are shown with solid lines and M2 with dashed lines. Marked nodes are shown
as black nodes.
we have to select a group of links, belonging to either M1 or M2, that construct a
matching in the sub-graph as follow (Fig.3.1):
1. Single link subgraph: If there is an isolated link connecting two nodes,
include that link into the matching.
2. Cycle subgraph: Since the graph is bipartite, cycles have even number of
links that alternatively belong to M1 and M2. We can select either set of the
links for matching, since both cover all nodes in the cycle.
3. Path subgraph with odd number of links: Here, we have an alternate
path, that is an augmented path in the matching terminology. Basically one
set of links either those belonging to M1 or M2 has one more element. That
set covers all the nodes in the sub-graph, and thus should be selected for the
matching. Both end nodes of the path belong to same matching and one
94
of them is in the input side and the other in the output side. The set of
alternating links that is connected to the end nodes should be included in
the matching.
4. Path subgraph with even number of links: Without loss of generality,
assume such a path that starts from an input node. Obviously, since there
are an even number of the links, this path ends also at an input node. One of
these nodes belongs to M1; and therefore is marked the other belongs to M2
and is not a marked node (because it is not in M1). If we select those links
that belong to M1, only the unmarked end node will be excluded, which is
not important.
Therefore, we can employ the following general rule for the paths.
If a path starts from an input (output) node that is marked, include
those set of alternating links in the matching that contains the specifled
marked node.
Therefore, we have to take the following sub-steps in step 4 to merge M1 and M2:
Algorithm 3.4.2 (Merge (step 4)):
4.1 Search for an input or output marked node i that is not selected yet and is
connected to only one link in M1 [M2. If such a node does not exist go to
step 4.4.
4.2 In the path that is originated from the node i, select the set of the alternat-
ing links that contain the link connected to node i and include it into the
matching. Discard the other alternating set of the links.
95
4.3 Flag all nodes that are in the path covered in 4.2 as searched and go back to
step 4.1.
4.4 Search for an input or output node i that is not selected yet and is connected
to two links in M1[M2. If such a node does not exist, the desired matching
is found and the search is flnished.
4.5 In the cycle containing node i select one set of alternating links (arbitrary)
and include them into the matching and discard the other set.
4.6 Flag all nodes that are in the cycle covered in 4.5 as searched and go back to
step 4.4.
In the merge algorithm steps 4.1 to 4.3 search for paths and steps 4.4 to 4.6
search for cycles. To determine the complexity of step 4, notice that in the merge
algorithm described above we are searching the bipartite graph M1 [ M2: This
graph has 2N links and 2N nodes and therefore the search complexity is O(N).
The ultimate matching covers all critical marked nodes.
In step 5, we perform a sequential maximal matching on the nodes that are
not in the matching. Starting from the node with the maximum weight, we scan
all its neighbors starting from the one with the maximum weight. If we flnd a
neighbor that is not in the matching we match and include the corresponding link
to the matching list. The number of nodes that we have to scan in this step is
less than 2N; and for each one we have to check at most N neighbors. This has a
complexity of O(N2).
One may think there should be algorithms that are practically e?cient and
has lower complexity than the MFM. the maximal sorted matching (MSM) can be
considered as a flrst attempt toward such an algorithm. The MSM is basically very
96
similar to iLPF algorithm that is introduced in [31]. Similar to the MFM, MSM
works on a sorted list of the nodes, however it does not treat the marked nodes
separately to ensure that they are contained into the matching. The MSM scans
all nodes starting with the one with maximum weight and going down the sorted
list, and tries to include the scanned (primary) nodes into the matching. At every
step it scans all neighbor nodes (starting from the neighbor node with maximum
weight) of the primary node until it flnds a free (not matched) node or until it
scans all neighbors. If it flnds a free neighbor, it considers it as the secondary node
and match the primary and secondary nodes and add the connecting link to the
matching. This is essentially similar to the step 5 of the MFM and results in an
O(N2) complexity algorithm.
Algorithm 3.4.3 (MSM):
1. Sort all input and output nodes according to their weight O(N logN).
2. Perform a simple sorted maximal matching algorithm on all of the nodes
(O(N2)).
?
Although we can not prove that MSM achieves 100% throughput, in simulations
its performance was identical to MFM. We review some of the simulation results
in the next section.
97
3.5 Rate Provisioning and the MSUIM Schedul-
ing Algorithm
In this section, we introduce the Maximum Size Unit Interval Matching (MSUIM)
scheduling algorithm that can provide rate guarantees to the backlogged connec-
tions. This result is useful by itself and is also insightful for understanding the
stochastic stability results of the MNCM. Even though the MSUIM is not in the
MNCM, it is based on the same concepts and for the purpose of rate guarantees
it appears to be more suitable.
Consider that a variable guaranteed rate matrix G(n) is given. Our objective
is to provide the reserved rates gij(n) to the cells going from the input i to the
output j, whenever there are backlogged cells for that connection. We consider
a credit based scheduling algorithm, where vij(n), credit of the link (i;j), re ects
how much service we owe to that link at time n. When a link (i;j) is backlogged,
its credit increments by its guaranteed rate gij(n) and decrements by one if that
link is scheduled at the time slot n.
Furthermore, for simplicity we assume that all links are always backlogged and
a link can not be scheduled if its credit is less than one. Therefore, the credit
matrix evolves according to the following linear relation,
V (n + 1) = V (n) + G(n)?S(n): (3.20)
S(n) is the scheduling matrix at time n and sij = 1 if link (i;j) is scheduled at
time n and is zero otherwise.
To avoid confusion with the number of backlogged cells, we use uk(n) k =
1;??? ;2N to show the node weight in this section:
98
uk(n) = P
(i;j):(i;j)!k
vij(n): (3.21)
Now, we deflne the MSUIM scheduling algorithms that we use in this section.
Note that weight of a node is sum of the credits of links connected to it.
Deflnition 3.5.1 A maximal size matching algorithm belongs to the MSUIM class
if and only if it always uses a matching m that contains all nodes that their weight
are greater than umax ?1, where umax is the maximum weight.
Therefore, for the MSUIM, we have to flnd the maximum weight of all nodes
umax and mark all nodes that their weight is in the interval [umax ?1;umax]. Next,
we have to flnd a matching that contains maximum number of marked nodes.
It is not always possible to contain all marked nodes in the MSUIM but as we
proved in lemma 3.3.3 for large enough maximum weights this becomes possible.
The guaranteed rate matrix is feasible if,
NP
i=1
gij ? 1 j = 1???N;
NP
j=1
gij ? 1 i = 1???N:
(3.22)
We will show that if the guaranteed rate matrix is inside the feasibility region,
the credit matrix remains bounded below N +1. Hence, the algorithm can provide
variable rate guarantees.
We deflne function f(n) to be the equal to the maximum weight node.
f(n) = maxfu1(n);??? ;u2N(n)g: (3.23)
Theorem 3.5.2 Consider an N ?N input-bufiered switch with all virtual output
queues backlogged and a given variable guaranteed rate matrix G(n) inside the
99
feasibility region. If the scheduling policy is MSUIM and credits of links are used
as the weight function then credits of all links remain bounded below N + 1.
Proofoftheorem3.5.2: We show that for f(n) > N, we can conclude f(n+1) <
f(n). Due to the feasibility of the matrix G(n) weight of nodes can not increase
by more than one at every cell time and since f(0) = 0, the maximum weight of
a node is always less than N + 1. Therefore, maximum credit of a node is always
less than N + 1 and the algorithm can provide rate guarantees.
The proof is very similar to lemma 3.3.3. Let f(n) > N, we show that there
exists a matching that contains all marked nodes, i.e. all nodes that their weight
is in [f(n)?1;f(n)]. Suppose that this is not true and without loss of generality,
there is a marked input node that can not be included in the matching. Using
Hall?s theorem, this means that there is a subset of the input marked nodes S such
that its cardinality is greater than its neighbor set NS,
jNSj?jSj?1: (3.24)
Let W(S) be the weight of all nodes in S, by deflnition we have,
(f(n)?1)jSj? W(S) ? f(n)jSj (3.25)
Since NS is connected to all links connected to S,
W(NS) ? W(S): (3.26)
From relations 3.25 and 3.26,
W(NS) ? (f(n)?1)jSj (3.27)
100
Therefore, there is a node k 2 NS such that,
uk(n) ? W(NS)jNSj
? (f(n)?1)jSjjNSj
? (f(n)?1)jSjjSj?1
? (f(n)?1)(N)N?1
= (f(n)?1)(N)N?1 ?f(n) + f(n)
= NN?1
?
f(n)
N ?1
?
+ f(n)
> f(n)
(3.28)
This contradicts by the deflnition of f(n). Hence, for f(n) > N all marked nodes
can be included in the matching. Therefore, if k is a marked node,
uk(n + 1) = uk(n)?1 +
X
(i;j)!k
gij < uk(n) ? f(n); (3.29)
and if k is not a marked node (uk(n) < f(n)?1),
uk(n + 1) ? uk(n) +
X
(i;j)!k
gij < uk(n) + 1 < f(n): (3.30)
From relations 3.29 and 3.30, we can conclude that,
f(n + 1) < f(n): (3.31)
Therefore, if f(n) ? N we know that f(n + 1) < N + 1 by deflnition and if
N < f(n) < N +1 from 3.31 we can conclude that f(n+1) < f(n) < N +1, and
therefore, f(n) does not exceed N + 1.
101
It should be clear that we can reach the same result with a scheduling algorithm
in MNCM since for f(n) ? N the corresponding interval of the MNCM covers the
interval of the MSUIM.
3.6 Simulations
In this section we present our simulation results. The main objective is to study
delay performance of the proposed scheduling algorithms. For practical purposes
it is not su?cient for a scheduling algorithm to have 100% throughput. Therefore,
it is essential to compare and study the delay performance of difierent scheduling
algorithms.
We consider a 32?32 input-bufiered switch. For each experiment the average
throughput, ?; of the switch is given. A random 32?32 rate matrix is generated
such that the aggregate rate of every input (row summation) and output (column
summation) is ?:
The rate matrix is generated iteratively. In each iteration a new  ow between
two randomly selected input and output nodes is generated. After selecting the
input node i, and output node j, of a  ow the maximum allowable rate (MAR)
for that  ow is set to minimum of:
? Maximum  ow rate that is set to 0.1 in our experiments.
? Difierence between ? and aggregate rate of input port i. Aggregate rate of i
is summation of all elements of rate matrix in row i.
? Difierence between ? and aggregate rate of output port j. Aggregate rate of
j is summation of all elements of rate matrix in column j.
102
Next, a uniform random number between zero and MAR is generated as the
rate of that  ow and it is added to the (i;j) element in rate matrix. Basically, MAR
is set so that the aggregate rate of each port does not exceed ?. This procedure
is repeated until aggregate rates of all ports become very close to ?: In this way,
we are able to create a non-uniform rate matrix. This matrix is used to generate
i.i.d. Bernoulli arrival patterns for all of the connections.
We have studied four difierent systems which are three input-bufiered switches
with matching algorithms LPF, MFM, MSM , and an output bufiered switch. The
flrst three systems were simulated, but for the output-bufiered system we modelled
it as an M/D/1 system. We have elaborated on the flrst three, and proved here
that LPF and MFM achieve 100% throughput. The output-bufiered system is
mainly included as a benchmark.
The main performance measure that we have considered here is delay versus
throughput, and it is plotted for all of the scheduling algorithms in Fig. 3.2. For
each scheduler the simulation is stopped, when the throughput reaches 1.
As we expected all of the systems achieve 100% throughput. Delay performance
of LPF is slightly better than MFM and MSM, but recall that it is more complex
as well. The performance of MFM and MSM is not distinguishable. Although we
have not been able to prove the stability of MSM, the simulations reveal that for
practical purposes it functions as 100% throughput algorithm, with suitable delay
performance. Recall that MSM complexity is O(N2); which is lower than other
proposed algorithms.
103
10?1 100
100
101
102
103
Rate
Delay
MFM 
MSM 
LPF 
OQ  
Figure 3.2: Average Delay v.s. Throughput for difierent matching algorithms. The
output queueing system is modelled as an M/D/1 system.
3.7 Summary
In this chapter, a new class of weighted matching algorithms for scheduling in
input-bufiered switches is introduced. Using the  uid model techniques, we were
able to prove that they achieve 100% throughput. Basically, it is shown that to
achieve 100% throughput it su?ces to include only nodes that their weight is in
??neighborhood of the maximum weight, where ? = umax(1?1=N). This result can
lead us to the development of less complex and more e?cient scheduling algorithms
for the input-bufiered switches. Two particular matching algorithms, the MSM
and MFM that can be considered for practical systems, are also introduced. The
MFM is proven to achieve 100% throughput. Furthermore, the MSUIM scheduling
algorithm is proposed and it is proved that it can provide rate guarantees and its
maximum service deviation from guaranteed rate is bounded to N.
104
Chapter 4
Analysis and design of
self-randomized max-min fair
schedulers
In this chapter, we consider the self-randomized scheduling policies for input
bufiered switches. A scheduling policy is said to be self-randomized if the random-
ized component (process) of the randomized policy is replaced by a pseudo-random
process that is a function of the cell arrival process. We use  uid model techniques
to show that the self-randomized scheduling algorithms deliver 100% throughput.
The only assumption on the arrival pattern is that it satisfles strong law of large
numbers, and no input or output port is oversubscribed. We provide a general
architecture for the design of the self-randomized algorithms, and introduce two
algorithms that consider number of backlogged cells as the weight function. We
then introduce the concept of the max-min fair self-randomized scheduling algo-
rithms. The idea here is to introduce self-randomized algorithms that can provide
QoS by sharing the switch bandwidth proportional to the assigned weights. We
105
introduce three self-randomized scheduling algorithms, and prove that they are
max-min fair. In order to study and compare the performance of the proposed
scheduling algorithms, several simulations are carried out and their results are
provided and discussed.
In chapters 2 and 3, we introduced and elaborate on the deterministic schedul-
ing algorithms for the input-bufiered switch fabrics. Our objective was to develop
new algorithms that achieve 100% throughput, can provision rate, and have low
complexity. For instance, the MNCM class of matchings that was introduced in
section 3, achieves 100% throughput. Moreover, if we use the link weight func-
tion that was introduced in chapter 2 in equations 2.14, 2.15, MNCM can track
appropriate  uid policy, and thus to provision rate. We also introduced the MFM
algorithm in MNCM class with O(N2:5) complexity and the MSIM algorithm that
is not in MNCM but can provide rate guarantees. Even though MFM is the lowest
complexity deterministic scheduling algorithm that achieves 100% throughput, it
is still desirable to develop algorithms with lower complexity.
Tassiulas introduced a new class of randomized scheduling algorithms in [36].
These algorithms have linear complexity O(N), and at the same time achieve 100%
throughput. The basic idea is instead of flnding the scheduling matching, to select
it from a set of candidate matchings. A simple randomized procedure is used to
obtain or update the candidate set in every time slot. A completely randomized
set results in a very low performance system. The candidate matching set in [36]
contains two matchings, one randomly selected matching and the matching that
was used in the previous time slot. The candidate matching with higher total
weight of links is selected, where weight of the link is number of backlogged cells.
The stability proof is for i.i.d. arrival. Intuitively, since state of the switch (number
106
of backlogged cells) changes slightly in every time slot, a good matching in previous
time slot is also a good matching for the next time slot.
Although this randomized algorithm is stable, it does not have a good delay
performance. Paper [19] proposed several modiflcations to the basic randomized
algorithm to improve its delay performance. One of the contributions of [19] is
introduction of the self-randomized algorithms. A scheduling policy is said to be
self-randomized if the randomized component of the randomized policy is replaced
by a pseudo-random process that is a function of the cell arrival process.
In this chapter, we focus on the self-randomized scheduling algorithms and
introduce some new algorithms that have better delay performance than the pre-
viously proposed algorithms. To that end, we introduce a general architecture
for the self-randomized schedulers and propose two speciflc scheduling algorithms
that use the number of backlogged cells as the weight of the links. We use the
 uid model techniques to establish some new results for the throughput of the
randomized scheduling algorithms. More speciflcally, we prove that the original
randomized algorithm of [36], and all of the self-randomized scheduling algorithms
that are introduced in [19] delivers 100% throughput. The only assumptions on
the input tra?c is that it satisfles SLLN, and is admissible. To the best of our
knowledge this is the flrst time that the  uid model is applied to the random-
ized scheduling algorithms, and all previous stability results were contained to i.i.d
assumption.
None of the randomized scheduling algorithms that were introduced previously
was capable of QoS provisioning. We employ the notion of the max-min fair to the
self-randomized schedulers to develop self-randomized schedulers that can provi-
sion rate. The basic idea behind a max-min fair scheduler is to allocate bandwidth
107
among  ows proportional to their weights, and if a  ow can not utilize its band-
width, because of the constraints elsewhere in the network, then the residual band-
width is distributed proportionally among others [22]. Tassiulas and Sarkar [38]
has recently introduced a max-min fair scheduler that is not originally proposed
for input-bufiered switches, but is applicable to them too. We can categorize this
scheduling algorithm as a deterministic MWM scheduling algorithm that weight
of links ( ows) are number of tokens allocated to them. Tokens are distributed
among links with their max-min rate.
We extend the idea of max-min fair scheduling algorithms and introduce the
concept of max-min fair self-randomized scheduling algorithms. Three difierent
self-randomized max-min fair algorithms are introduced in this chapter. For each
algorithm, it is proved that it delivers the max-min rate to all of the links. Further-
more, their delay performance is studied and compared using simulation results.
The rest of this chapter is organized as follows. Section 4.1 deflnes and intro-
duces the basic concepts that are used in the rest of the paper. Section 4.2 contains
proof of the main stability result for the randomized scheduling algorithms. Sec-
tion 4.3 introduces the general architecture that is adopted for self-randomized
schedulers in this chapter, and elaborate on some of its basic features. Section 4.4
introduces two scheduling algorithms based on backlogged weight function, and
provides analytical as well as simulation results for them. Section 4.5 contains
theoretical and simulation results for the max-min fair randomized scheduling al-
gorithm.
108
4.1 Models and Deflnitions
The basic assumptions and model are similar to what deflned in section 3.1, so we
do not repeat it here. We start with the discrete dynamic relations that govern
the input bufiered switch behavior. We adopt the notation that is used in [13].
Consider an input-bufiered switch that employs scheduling algorithm m. Suppose
that Tm? (n) be the total time that permutation ? is used up to time n. The
following equations describe the switch dynamics,
Zij(n) = Zij(0) + Aij(n)?Dij(n);
Dij(n) = P
?2?
nP
l=1
?ij1fZij(l)>0g(Tm? (l)?Tm? (l?1));
P
?2?
Tm? (n) = n:
(4.1)
The flrst equation describes the basic relation of arrival, departure and backlogged
packets. The second equation counts the number of total departures by counting
number of times that a permutation matrix with a one in (i;j) position is used,
when there were a backlogged cells at that VOQ.
The matching algorithm that is used adds more equations to (4.1). For instance,
if we use the conventional MWM, with number of backlogged cells as the link
weight, there will be one more equation for every link (i;j) [13],
TMWM? (n)?TMWM? (n?1) = 0 if h?;Z(n)i < h?0;Z(n)i for some ?0 2 ?:
(4.2)
The above equation says that under the maximum weight matching algorithm, a
matching ? that has less total weight than another matching ?0 is not used.
4.1.1 Randomized Scheduling Algorithms and Fluid model
Consider the following scheduling algorithm,
109
Algorithm 4.1.1 (Tass)[36]:
? Let S(n) be the schedule used at time n.
? At time n+1 choose a matching R(n+1) uniformly at random from the set
of all permutation matrices ? 2 ?.
? S(n + 1) = arg max
s2fS(n);R(n+1)g
hs;Z(n + 1)i
At every time slot, a random matching is generated, and its weight is compared
to the weight of the matching that is used in previous time slot. The matching with
highest weight is selected for scheduling in the present time slot. Paper [36] proves
that the above algorithm is stable under any Bernoulli i.i.d. admissible input.
This proof is the basis for all subsequent stability proofs for other randomized
scheduling algorithms. Here, we generalize the stability proof for any admissible
arrival process. In other words, we prove that TASS scheduling algorithm is
e?cient. To that end, we introduce Basic Randomized (BR) scheduling algorithm.
Algorithm 4.1.2 (BR):
? Let S(n) be the schedule used at time n.
? At time n+1 choose a matching R(n+1) uniformly at random from the set
of all permutation matrices ? 2 ?.
? S(n + 1) =
8
><
>:
R(n + 1) if R(n+1) is MWM for Z(n+1)
S(n) otherwise
BR algorithm does not have any practical merit since its performance is worse
than MWM but its complexity is higher. Note that in step 3 of BR algorithm we
need to determine the MWM scheduling. The BR algorithm is used to prove the
110
basic stability results for randomized scheduling algorithms. Let ?(n) be a binary
random variable such that,
?(n) =
8
><
>:
1 if R(n) is MWM for Z(n)
0 otherwise
(4.3)
Let, tm? (n) = Tm? (n)?Tm? (n?1). We can add the following equation to (4.1) for
the BR algorithm.
tBR? (n) = ?(n)tMWM? (n) + (1??(n))tBR? (n?1) (4.4)
Equation (4.4) is a difierence equation that needs an initial condition at time 0
to fully characterize the BR scheduling policy. We assume that probability of
scheduling a link at time 0 is at least equal to probability of cell arrival on that
link,
Prob[?ij = 1] ? ?ij: (4.5)
Next, we review a deterministic continuous  uid model of a switch operating under
some matching algorithm m. We use the same model that we used before in 3.8,
and is originally given in [13],
Zij(t) = Zij(0) + ?ijt?Dij(t) ? 0
_Dij(t) = P
?2?
?ij _Tm? (t); if Zij(t) > 0
P
?2?
Tm? (t) = t
(4.6)
In order to translate (4.4) into  uid model, we flrst write it in the following form,
(1??(n))(tBR? (n)?tBR? (n?1)) = ?(n)(tMWM? (n)?tBR? (n)) (4.7)
111
Using Relations (4.7, 4.5), and assuming, Prob[?(n) = 1] = ? > 0, we can add
the following  uid equation for BR algorithm,
_tBR? (t) = ?1??(tMWM? (t)?tBR? (t));
P
?2?
?ijtBR? (0) ? ?ij:
(4.8)
Adding an equation to the  uid model is related to  uid limits and is discussed
in [12]. Basically, the time difierence of tBR? (n) is turned into its derivative and
expectation of the random process ?(n) has replaced ?(n) itself. Note that we are
assuming that ?(n) satisfles the strong law of large numbers.
Recall that tm? (n) = Tm? (n)?Tm? (n?1), and consequently in the  uid model
we have tm? (t) = _Tm? (t). Using (4.6), (4.8), we can conclude,
?DBRij (t) = ?1??( _DMWMij (t)? _DBRij (t));
_DBRij (0)??ij ? 0:
(4.9)
Relation (4.9) provides additional equations that characterizes the BR scheduling
algorithm  uid model.
4.2 Stability of Randomized Schedulers
In this section, we use the  uid model calculus to prove the stability of randomized
scheduling algorithms. Our main objective here is to prove the following theorem,
Theorem 4.2.1 The Basic Randomized scheduling algorithm is e?cient.
To prove theorem 4.2.1, we will use the following theorem that is proved in [13]
and we used it in chapter 3.
Theorem 4.2.2 A switch operating under a matching algorithm is rate stable if
the corresponding  uid model is weakly stable.
112
Recall that a  uid model is weakly stable if for every  uid model solution
(D;T;Z) with Z(0) = 0; Z(t) = 0 for all t ? 0. In [13] it is proved that the
maximum weighted matching algorithm is rate stable, by proving that for the
function V (t) = hZ(t);Z(t)i, _V (t) = 2h _Z(t);Z(t)i ? 0; t ? 0. Therefore, if
Z(0) = 0, we can conclude that Z(t) = 0; t ? 0, and consequently MWM algorithm
is rate stable. Since the MWM is rate stable, for Z(0) = 0,
_DMWMij (t) = ?ij t ? 0 (4.10)
We use equation (4:10) and relation between MWM and BR scheduling algorithms
that is formulated in (4.9) to prove theorem 4.2.1.
Proof of theorem 4.2.1: We prove that starting with Z(0) = 0, we can conclude
that _DBRij (t) = ?ij for all t ? 0 is the only solution, and therefore Z(t) = 0 for
t ? 0. To that end, we rewrite the main difierential equations and initial conditions
that describe evolution of DBR(t),
?DBRij (t) = ?1??( _DMWMij (t)? _DBRij (t));
_DBRij (0) = ?ij;
_DMWMij (t) = ?ij if Z(t)=0
(4.11)
First equation is from relation (4.9) repeated. Second equation is the result of
(4.9) and the fact that departure rate at time zero is bounded by the arrival rate.
Finally, from rate stability of MWM we can conclude the last relation.
It is easy to verify that _DBRij (t) = _DMWMij (t) = ?ij is the only solution for t ? 0
that satisfles all the equations above, and therefore for all  uid model solutions,
we have Z(t) = 0.
113
Next, we use stability result proved in theorem 4.2.1 to prove that the original
randomized algorithm 4.1.1 is also rate stable.
Lemma 4.2.3 TASS scheduling algorithm is e?cient.
Proof of lemma 4.2.3: Proof is very similar to proof of theorem 4.2.1, so we
skip the details. First, we need to derive the additional  uid model relations
that describe TASS algorithm. Suppose that H = 'm1;??? ;mk(n)? be the set of
matchings with higher weight than mTASS(n?1) at time n, where mTASS(n?1)
is the matching used at time n?1. Let ?m(n) be the binary variable that is one if
we randomly select matching m and use it at time n and Pm is a policy that uses
matching m. We can write,
tTASS? (n) = ??(n)tP?? (n) +
?
1?
X
?02?
??0(n)
!
tTASS? (n?1) (4.12)
Next, we sum up overall matchings ?,
X
?2?
tTASS? (n) =
X
?2?
??(n)tP?? (n) +
?
1?
X
?02?
??0(n)
!X
?2?
tTASS? (n?1) (4.13)
In order, to derive the  uid model equivalent of 4.13, we have to flrst charac-
terize ??(n). Prob[??(n) = 1] = ? > 0 if ? 2 H and it is zero otherwise. Similar
to BR algorithm case in relation 4.8, we can derive the  uid model equation for
TASS algorithm,
(1?k(t)?)
X
?2?
_tTASS? (t) = ?X
?2H
tP?? (n)?k(t)?
X
?2?
tTASS? (t) (4.14)
k(t) is the number of matchings that has a higher weight than the matching
used at time t. Similar to the BR algorithm, for the TASS algorithm the following
difierential equation can be added to the basic  uid model equations,
114
?DTASSij (t) = ?1?k(t)?
 
_DPm1(t)ij (t) +???+ _DPmk(t)(t)ij (t)?k(t) _DTASSij (t))
?
;
_DTASSij (0) = ?ij:
(4.15)
_DPm1(t)ij (t);??? ; _DPmk(t)(t)ij (t) represents matchings that their weight is larger than
_DTASSij (t)). Substituting the flrst two equations of (4.11) with (4.15) and fol-
lowing the same argument as in theorem (4.2.1), we can show that _DTASSij (t) =
_DMWMij (t) = ?ij is the only solution that satisfles the  uid model difierential equa-
tion. Therefore, TASS algorithm is also e?cient.
The methodology that is used in lemma 4.2.3 is quite general and can be used
to prove the stability of the randomized algorithms that will be introduced in the
following sections.
4.3 General Architecture
In this section, we introduce the general architecture that will be used in all of
self-randomized scheduling algorithms of this chapter. The scheduler structure is
shown in Figure 4.1 and it is divided into two stages . First stage has a memory
bank that bufiers K previously used matchings. The flrst stage also contains the
circuitry that selects a candidate matching out of the K matchings. The selection
module computes weight of all K contending matchings in parallel, and selects the
one with highest weight. Weight of a matching is summation of its link weights.
The candidate matching is passed to the second stage.
Randomized scheduling algorithms usually store and process the matching that
is used in the previous time slot; we have extended this idea by taking into account
115
M(K-1) M(K-2) M(1) M(0) 
Compute the maximum weighted matching 
L i n k s 
w e i g h t 
Merge 
Self- 
randomized 
Matching 
computer 
A r r i v a l   ( o r   v i r t u a l 
a r r i v a l )   G r a p h 
New Matching 
Figure 4.1: The generic block diagram adopted for self-randomized schedulers.
matchings that are used in K previous time slots. Memory or state notion plays a
critical role in the stability of randomized schedulers. Saving the previous matching
and using it as a new candidate is the main reason for the stability of Tassiulas
algorithm, and for that reason it has been retained in all proposed self-randomized
algorithms.
The concept of maintaining K previous matchings is also in accordance with
some of the proposed deterministic schemes. It is proved in [6] that flxed serving
rates can be obtained, using a periodic sequence of matching matrices for schedul-
ing. The rate matrix is decomposed to a sequence of permutation matrices, that
should be repeated in a periodic form. Based on the same result, paper [24] has
proposed a rate provisioning scheme that uses a weighted fair queueing scheduler
to select one matching from a flxed set of matchings. Increasing the memory order
of the system enables us to reuse the same set of matchings, and can improve the
performance, specially if we intend to do rate provisioning.
Main functions of the second stage are, to generate a matching in the self-
randomized block, and to merge that matching with the candidate matching out
116
of the flrst stage. The end result of the merging block is the scheduling matching
for that time slot. This matching is also passed to the flrst stage and is stored in
the memory bank for future use in subsequent time slots.
The overall quality and performance of the scheduler is very much dependent
on the procedure that is used to generate the self-randomized matching. Self-
randomized matching is generated based on the arrival of cells, and/or virtual
cells (tokens) arrival pattern for QoS provisioning. We use some simple techniques
for that purpose, that will be discussed later.
In this architecture, scheduling encompasses weight calculation for K matchings
at the flrst stage, and one merging at the second stage. This structure is based on
the fact that weight of a matching can be simply calculated in hardware, where as
merging of two matchings is not a simple scalable task for hardware.
The self-randomized scheduling algorithms that we introduce in the rest of the
paper are based on the general architecture that is described here. Each of these
algorithms is completely deflned by two main characteristics,
1. The weight function,
2. The self-randomized matching generator.
The weight function specifles the fundamental features and objectives of the algo-
rithm, while the self-randomized matching generator determines how well and fast
scheduler responds and adapts to the changes in the arrival pattern.
4.4 Backlogged weight function
In this section, we introduce B1, and B2, two self-randomized scheduling algorithm
that their weight function is number of backlogged cells. This is the most common
117
weight function for scheduling in switches [29] [31] [36], even though it has its
own shortcomings. From a theoretical stand point, most of the stability results are
originally derived for backlogged weight function. From a practical stand point, it is
simple to compute and update weights of links. Moreover, in terms of average delay
and number of backlogged cells, this weight function provides a good performance.
However, backlogged weight function has its own drawbacks; flrst of all it is
not starvation free, i.e., it is possible that some flnite number of cells experience
inflnite delays [29]. Secondly, it lacks a mechanism to provide QoS. In section 4.5,
we introduce an alternative weight function that resolves these problems.
4.4.1 B1 : Self-Randomized Based on Instantaneous Ar-
rivals
B1 is similar to the SERENA algorithm that is introduced in [19]. We use it as a
bench mark, and as a means to discuss some of the features of the self-randomized
algorithms. As mentioned before, number of backlogged cells is the weight function
of this scheme. The arrival graph is the input to the self-randomized matching
computer block. The arrival graph at time t is a bipartite graph that has a link
between node i and j if there is an arrival at time t in i destined for j. The
self-randomized matching generator extracts a matching graph out of the arrival
graph. The extracted matching will be merged with the candidate matching out
of the flrst stage (Fig.4.1).
Arrival pattern is not a matching since there can be multiple arrivals with the
same destination. Self-randomized matching block, extracts a matching from this
graph by selecting the highest weight link for every output node. This process
can be done in parallel for all output ports. The exact same procedure is used in
118
SERENA [19].
B1 characterization:
? Weight function: Bufiered cells for each link.
? Self-Randomized Generator: Works based on the recent cell arrival
graph.
B1 is a simple extension of the SERENA algorithm, that uses K previous match-
ings rather than only one previous matching.
4.4.2 B2 : Self-Randomized Based on Total Arrivals
In B1, every new arrival gets one chance to be included in the self-randomized
matching. However, links may lose their chance, either to other new arrivals in
the process of generating the self-randomized matching, or in the merging phase.
If a link loses its chance, it should wait for another new arrival to get another
opportunity. Therefore, even under low utilization, it is possible that some of the
cells incur high delays. In fact, the situation is worse for very low arrival rates,
since the average waiting time for the next arrival is higher. We intend to resolve
this problem in B2.
B2 uses the total arrival graph as the source for extraction of self-randomized
matching. In the total arrival graph there is a link for all previous cell arrivals that
are not scheduled yet. Total arrival graph evolution is based on link insertion, and
ejection rules in every cell time.
? Insertion : Any link with new arrival that is not in the total arrival graph
is added to the total arrival graph.
? Ejection : Links that are scheduled are removed from the total arrival graph.
119
At every cell time a matching is extracted from the total arrival graph. The
extraction process is a trivial extension of the process that is used in B1. Similar
to B1, for every output port the link with largest weight is selected from the total
arrival graph. In B2, there may be some contention at the input ports between
the selected links. In the next step, if there are multiple selected links connected
to the same input port, the one with highest weight is selected. The outcome of
this two step selection process is a matching that is used as the self-randomized
matching.
Only those links that are included in the flnal scheduling matching are ejected
from the total arrival graph. Therefore, a new arrival link is continuously consid-
ered for the self-randomized matching, until it is scheduled. We expect that the
delay performance of B2 for the links with low arrival rates be better than B1,
because they get more chances to be scheduled.
B2 characterization:
? Weight function: Bufiered cells for each link.
? Self-Randomized Generator: Works based on total cell arrival graph.
We end this section by stating the stability theorem for the B1 and B2 algorithms
introduced here.
Corollary 4.4.1 B1 and B2 schedulers are e?cient.
proof of corollary 4.4.1: B1 and B2 can be both considered as special cases
of the TASS scheduling algorithm. Note that in B1 and B2 either the previous
time slot matching is selected or another randomly selected matching with a higher
weight is selected. The mechanism of selecting the randomized matching is more
complicated, but still there is a positive probability ? > 0 that the MWM is
120
selected. Therefore, the basic properties of the TASS algorithm that is used in
proof of lemma 4.2.3 is maintained in these algorithms and with the same reasoning
we can prove that they are e?cient.
4.4.3 Simulation Results
In this section, we use simulation results to study and discuss the delay performance
of B1 and B2 scheduling algorithms. Simulations are done for a 16x16 input-
bufiered switch, and the memory length for the bufiered matchings K are set to
1, 16, and 32. The arrival process for all links are i.i.d Bernoulli, and the arrival
rate for the link between input i and output j is ?ij.
Three difierent load distribution models, uniform, diagonal, and log-diagonal
between the links are studied. The simplest one is uniform, where the total load
? of every input port is uniformly distributed among its outgoing links. The
diagonal is the most compact form, where the arrival rates are ?ii = 2?=3 and
?iji+1j = ?=3, i = 1;??? ;N and 0 for all other links, where jij = (i mod N).
Therefore, there are only two non-zero elements in every row. In the log-diagonal
model ?ijjj = 2?ijj+1j;i = 1??? ;N;j = i;??? ;N + i.
The simulations results for B1 are shown in Figures 4.2, 4.3, 4.4, for Diago-
nal, Log-diagonal and Uniform destination distributions respectively. In all cases,
increasing memory size K improves the delay performance of the switch. Improve-
ment is more signiflcant for uniform distribution. The uniform distribution matrix,
has the maximum entropy, and correlation between subsequent arrivals is minimal.
Therefore, for low arrival rates, the previous matching is usually not a good can-
didate for the present time slot, and having K alternative matchings signiflcantly
increases probability of selecting a good matching.
121
The other interesting and non-desirable phenomena is having large delays at
very low rates. This is to some extent, because of the self-randomized matching
generator that is used. Note that new links are introduced to the system through
the self-randomized graph, and a link is introduced to the self-randomized graph,
only when there is a new arrival for that link. Therefore, if a link is not included
in the self-randomized matching after its cell arrival, it should wait for the next
cell arrival to get another chance. The extensive delay for low utilizations incurs
more dramatically for log-diagonal and uniform load distribution, since the arrival
rate for some of the links are very low under these distributions. This behavior is
not clear in [19] results at the flrst glance, since authors have reported and plotted
average queue length, rather than the average delay in their simulation results.
Even though delay and queue length are directly related by Little?s theorem, delay
is more tangible, common and relevant for practical applications.
The same set of simulations are repeated for B2 in Figures 4.5, 4.6, 4.7, for di-
agonal, log-diagonal, and uniform load distributions respectively. For the diagonal
tra?c the results are very similar to the B1 algorithm. In both cases, increasing
the matching memory size results in considerable improvement of delay perfor-
mance. For the log-diagonal distribution, B2 outperforms B1 delay performance
for low utilizations. The average delay for B1 is around 40 cell times, where as it
is around 20 for B2.
There is still a minimum in the delay curve of B2, but its intensity is decreased.
Recall that in B2 a link with new arrival is continuously introduced to the self-
randomized block, until it is scheduled. Therefore, new arrivals scheduling delay
is in general less than B1. This, specially improves the delay performance of the
very low rate links.
122
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 110
0
101
102
103
Diagonal Traffic
Normalized Load
Mean Delay
W = 1 
W = 16
W = 32
Figure 4.2: Mean delay for B1 under diagonal tra?c.
For the uniform tra?c, advantage of employing memory for the matching is
considerable, and delay performance is evidently enhanced by increasing the mem-
ory size from 1 to 16 and 16 to 32.
In summary, we can conclude that B2 outperforms B1 in cases where tra?c
load is distributed among a larger number of ports, and specially when we have
some links with very low rates. The advantage of introducing memory for previous
matching is clear and it improves the performance, since we have more alternative
matchings to select from.
4.5 Max-min fair scheduling weight function
In this section, we introduce a new class of self-randomized scheduling algorithms
that use a set of token values for the link weight function. Token based weight
function enables us to do bandwidth allocation among difierent  ows based on their
reserved rates. In the context of prioritized max-min fair scheduling algorithms
123
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 110
1
102
103
Log?diagonal Traffic
Normalized Load
Mean Delay
W = 1 
W = 16
W = 32
Figure 4.3: Mean delay for B1 under log-diagonal tra?c.
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 110
1
102
103
Uniform Traffic
Normalized Load
Mean Delay
Figure 4.4: Mean delay for B1 under uniform tra?c.
124
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 110
0
101
102
103
Diagonal Arrival
Normalized Load
Mean Delay
W = 1 
W = 16
W = 32
Figure 4.5: Mean delay for B2 under diagonal tra?c.
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 110
0
101
102
103
Log?diagonal Traffic
Normalized Load
Mean Delay
Figure 4.6: Mean delay for B2 under log-diagonal tra?c.
125
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 110
1
102
103
104
Uniform Load
Normalized Load
Mean Delay
W = 1 
W = 16
W = 32
Figure 4.7: Mean delay for B2 under uniform tra?c.
each  ow i has a normalized reserved rate or priority, wi. Greater the reserved rate
of a  ow, better the service it should receive, subject to bandwidth availability. A
bandwidth allocation is max-min fair if it is not possible to increase the bandwidth
of any  ow i, without hurting another  ow j for which gj(n)=wj ? gi(n)=wi, gi(n)
is the bandwidth allocated to  ow i at time n.
Tassiulas and Sarkar [38] have proposed a token based deterministic max-min
fair scheduling algorithm that works based on MWM algorithm. Weight of every
 ow depends on an estimate of the max-min fair bandwidth of the  ow, previous
service received by the  ow, and tra?c demand of the  ow. The proposed algo-
rithm achieves max-min fair objective for any arbitrary set of  ows and graphs.
More speciflcally, their result is general and is valid for an arbitrary network graph
topology and a set of  ows traversing arbitrary paths in the network. We will use
and extend the same concepts in the context of self-randomized scheduling algo-
rithms, and introduce a new class of randomized scheduling algorithms that have
rate provisioning capabilities for input bufiered switches.
126
Rate provisioning is one of the important requirements for switch fabrics schedul-
ing algorithms that has not been considered in the context of randomized schedul-
ing algorithms. Token based weight functions enable us to divide the available
bandwidth among backlogged queues according to the reserved rates of the links
in a max-min fair fashion. In this way, we can provision the serving rate of the
links and guarantee their quality of service.
The flrst question that should be answered is feasibility of a set of normalized
reserved rates wi. For input bufiered switches, a necessary condition for feasibility
is that sum of normalized reserved rates of all  ows sharing a node should not ex-
ceed 1. Paper [38] has shown that this is also the su?cient condition for feasibility
when the network graph is bipartite, which is the case for input-bufiered switches.
The original algorithm proposed in [38] distributes tokens between  ows ac-
cording to the following procedure,
Algorithm 4.5.1 (Deterministic Max-min Fair):
1. Each node allocates service tokens to the links connected to it, in a weighted
round-robin (WRR) fashion. The service token allocation mechanism is de-
scribed later.
2. For each link the total number of tokens that are allocated to it, in each of
its two end nodes, are maintained. The minimum of the two token buckets
equals link?s weight that is used in matching if that link is backlogged. For
non-backlogged links weight is zero.
3. Whenever a link is scheduled and one cell is served, one token is deducted
from each of its two token buckets.
127
For the deterministic max-min fair scheduling algorithms it is proved [38] that in
any interval of length t, the total number of served cells for a link (i;j) difiers from
gijt by at most a constant ?ij. Note that these results are based on the assumption
that the
gij ? ?ij: (4.16)
For the randomized max-min scheduling algorithms we assume that 4.16 holds, and
we consider the following deflnition for the randomized max-min fair scheduling
algorithm.
Deflnition 4.5.2 A randomized scheduling algorithm is max-min fair if with prob-
ability one
limn!1 Dij(n)n = gij i;j = 1;:::;N (4.17)
for any arrival process that satisfles (4.16).
We will introduce three difierent self-randomized max-min fair scheduling algo-
rithms that work based on three difierent token allocation methodologies. The pro-
posed token allocation mechanisms are slightly difierent from the original scheme
proposed in [38].
The same general architecture that is given in Fig. 4.1 is employed here too.
Link weights are tokens and are determined based on the procedure given above.
Difierent algorithms are characterized through their token allocation mechanism,
and the self-randomized matching generator scheme that is used in the second
stage of the general architecture.
128
4.5.1 A1 : Self-Randomized Based on Instantaneous Token
and Cell Arrivals
The service allocation mechanism in [38] uses a WRR scheduler to select one of
the eligible  ows to get a token. A link between node i and j is eligible to receive
a token in node i, if it meets the following eligibility conditions.
1. Queue condition: Number of queued cells for the link is more than the
number of tokens.
2. Token condition: Tokens of link (i;j) in i does not exceed its tokens in j
by more than a constant W.
In A1, we only check the token condition for eligibility. Basically, for A1 once
a  ow becomes backlogged it is eligible to acquire token in node i if the token
condition is met. However, when all cells of a link are scheduled, its tokens are
reset to zero.
In the original scheme, due to queue condition, tokens always lag number of
backlogged cells. This can result in starvation or excessive delay for some of the
cells. For instance, consider a link with a high reserved rate. Intuitively, we expect
that the cells belonging to that link experience low delay. However, regardless
of the  ow?s reserved rate, if there is only one cell in the queue of that link its
weight does not exceed 1, and it can experience a very high delay. In contrast, in
A1 this link accumulates tokens with its max-min fair rate, even if there is only
one bufiered cell for that  ow. Hence, weight of backlogged links increases and
ultimately they will be scheduled.
For full characterization of A1 we need to specify the self-randomized generator
mechanism too. Self-randomized scheme that is used in A1 is based on instanta-
129
neous cell and token arrivals. It is similar to B1, but it works based on the arrival
of both cells and tokens. By adding the token arrival, every link gets more chances
to be included in the self-randomized matching, and the higher the rate of the
link, the higher is its chance. This enhances delay performance of the scheduling
algorithm, as it is illustrated in the simulation results later.
A1 characterization:
? Weight function: Number of token for each link with token condition for
eligibility test.
? Self-Randomized Generator: Works based on the recent token and cell
arrival graph.
To prove that A1 is max-min fair, we flrst prove that A1 allocates enough tokens
to links. The following lemma proves that number of tokens for a link difiers from
its max-min fair share by a constant.
Lemma 4.5.3 If the threshold W is su?ciently large, then the number of tokens
generated at each end of a link (i;j) in any interval of length t difiers from gijt
by at most a constant ?ij, where gij is the max-min fair rate of (i;j), and ?ij is a
constant which does not depend on the length of the interval.
Proof of lemma 4.5.3: We will prove the theorem under the assumption that
all queues have packet for transmission all the time. The generalization for the
case when all queues may not have packet all the time is straightforward. Recall
that the token allocation scheme is deterministic and very similar to the original
algorithm 4.5.1. The same result is proved in [38] for the original algorithm when
both token and queue conditions are tested for eligibility. However, in the A1 only
the token condition is used.
130
Note that if the number of backlogged cells in each queue is arbitrary large then
for the original algorithm the queue condition does not have any impact on the
token generation process. Therefore, A1 can be considered as a special case of the
original algorithm 4.5.1 when for all queues number of backlogged cells is larger
than the number of tokens. Hence, this lemma can be considered as a special case
of the more general result proved in Theorem 3 of [38].
Theorem 4.5.4 A1 is a randomized (self-randomized) max-min fair scheduling
algorithm.
Proof of theorem 4.5.4: In lemma 4.5.3, we proved that tokens are generated
with max-min fair rate. Moreover, whenever a link that has token is scheduled,
at least one of its tokens is consumed. Therefore, we can view the tokens as
virtual cells that are arriving with max-min rate, and are served (consumed),
whenever a link is scheduled. If we view tokens as virtual cells, A1 functions as a
self-randomized scheduling algorithm with virtual cell (token) backlogged weight
function. Since A1 serves at least one virtual cell whenever a link is scheduled,
token serving rate of a link under A1 is at least as large as its scheduling rate. This
is exactly similar to B1, with the exception that B1 always serves at most one cell
when it is scheduled. Since B1 is rate stable (corollary 4.4.1), we can conclude
that A1, with respect to token arrivals, is rate stable. Hence,
limn!1 Dij(n)n ? gij i;j = 1;:::;N (4.18)
So far we have proved that serving rate of every  ow is at least equal to its
max-min rate. We still need to prove that it is exactly equal to the max-min rate.
Consider that for a link (i;j), serving rate is higher than its max-min rate. By
131
deflnition, every link (i;j) has at least one bottleneck node [38]. Node i(j) is a
bottle neck for link (i;j), if sum of the max-min rate of all links connecting to
that node is 1, and link (i;j) normalized share gij=wij is maximum amongst all
links connected to i(j). Without loss of generality, assume that i is the bottleneck
node for link (i;j). If (i;j) is served at a higher rate than gij then by deflnition of
bottleneck node, there is another link (i;j0) that should be served at a lower rate
than its max-min rate, and this contradicts (4.18) for (i;j0). Therefore every link
is exactly served with its max-min rate.
4.5.2 A2 : Self-Randomized Based on Total Token and Cell
Arrivals
Similar to A1, A2 uses number of tokens as the weight function for links. Tokens are
generated and distributed similar to A1. However, for self-randomized generator
block, we use the same idea that is used in B2. In B2, we introduced the concept
of total cell arrival graph. The idea was to persistently present new arrivals to
the self-randomized block, until they are scheduled. In general, this expedites the
process of scheduling a new arrival, specially at low rates. We employ the same
idea in A2, by maintaining total cell, token arrival graph in the self randomized
matching computer block. The self randomized matching generator, extracts a
matching from the total cell, token graph at every time slot. Similar to the total
arrival graph introduced in B2, total cell, token arrival graph evolves by insertion
and ejection rules,
? Insertion : For new cell and token arrivals, if the link is not in the total cell
token graph, it is inserted to the graph.
132
? Ejection : All scheduled link are removed from the total cell, token arrival
graph.
A2 outperforms A1 when the utilization is low, because it expedites scheduling
of low rate links. However, for high utilization, both algorithms perform poorly.
There are two intuitive reasons for low performance of these algorithms in high
rates.
The flrst reason is related to the total cell and token arrival graph. The main
objective of the total graph is to provide a small set of candidate links for the self-
randomized block. At high utilizations the probability of a link being backlogged
is very high, and consequently as soon as a link acquires a token, it is inserted into
the Total graph. This results in a very populated total graph, that does not serve
the purpose of providing a good, small set of candidate links for scheduling.
The second reason is related to the weight function. Weight function sets
priority for scheduling of the links. At high rates it is vital to give high priority to
the congested links that has a large number of backlogged cells. Weight function of
A1 and A2 is merely based on the max-min rate which does not necessarily re ects
the actual number of backlogged cells. These problems are addressed in A4.
4.5.3 A4 : Self-Randomized Based on Total Link Weight
and Cell Arrivals
A4 is introduced to compensate for some of the problems that are observed in A1
and A2. The weight function in A4 is difierent from A1 and A2. In A4, weight
of a link does not exceed its queue length by more than a constant B. Number of
tokens still grow, regardless of the queue length, but it is not re ected directly into
the link weight. However, as soon as a new cell arrives, weight of the link increases
133
if there are enough tokens in both node buckets. Note that this is slightly difierent
from the algorithm proposed in [38], where number of tokens generated in a node
for a connected link was limited to the number of backlogged cells.
Similar to A1, we prove that A4 allocates tokens to the links according to their
max-min rate. The following lemma proves that number of tokens for a link difiers
from its max-min fair share by a constant.
Lemma 4.5.5 Under A4, for an arbitrary threshold B, and a su?ciently large
threshold W, the number of tokens allocated to a link (i;j), Fij(n), satisfles the
following asymptotic relation,
limn!1 Fij(n)n = gij i;j = 1;:::;N (4.19)
Proof of lemma 4.5.5: A4 uses the same token generation mechanism that was
used in A1, and A2. Hence, based on lemma 4.5.3, we know that in every time
interval n, total number of tokens generated at each end node for link (i;j), difiers
from gijn, by a constant ?ij. However for A4, Fij(n), total number of assigned
tokens to the weight of a link is less than the total number of arrived cells plus B,
Fij(n) ? Aij(n) + B i;j = 1;:::;N: (4.20)
Hence,
limn!1 Fij(n)n ? ?ij(n) i;j = 1;:::;N: (4.21)
Based on (4.16), number of generated tokens at end nodes of a link for a  ow, is
asymptotically less than or equal to number of arrived cells. Therefore, as n tends
to inflnity weight of a link is limited by the number of generated tokens and not
the arrival rate, i.e.,
limn!1 Fij(n)n = gij i;j = 1;:::;N (4.22)
134
Besides the weight function, A4 also uses a slightly difierent mechanism to
generate the self-randomized matching. Recall that A2 uses the total cell, token
arrival graph as the basis for derivation of the self-randomized matching. In the
previous section, we explained that at high rates this graph becomes so dense that
it would not provide a good set of candidate links for the self-randomized graph.
In A4, instead of working with total cell and token arrival graphs, we consider the
total cell arrival and weight increase graph. Basically, a link is inserted to the graph
if there is a new arrival or there is an increase in weight of that link. Consequently,
even at very high rates, links are not inserted into the self-randomized computing
block excessively.
A4 characterization:
? Weightfunction: At each ending node tokens are generated for a link with
token condition for eligibility set. A link weight is minimum of it tokens and
number of backlogged cells plus a constant B.
? Self-Randomized Generator: Works based on the total cell arrival, link
weight increase graph.
Before ending this section, we state that A4 algorithm is a max-min randomized
(self-randomized) scheduling algorithm.
Theorem 4.5.6 A4 is a randomized (self-randomized) max-min fair scheduling
algorithm.
Proof of theorem 4.5.6: Proof is very similar to theorem 4.5.4, so we skip
the details. The key point is in lemma 4.5.5, where we proved that link weight
evolves with max-min fair rate. On the other hand, whenever a link that has
135
token is scheduled at least one of its tokens is consumed and weight of the link
will be decremented by one. Hence again, link weight can be considered as virtual
cell arrival process, and the same reasoning given for A1 in theorem 4.5.4 can be
repeated here.
4.5.4 Simulation Results
In this section, we present two sets of simulations. First set presents the delay
performance of the three token based self-randomized schedulers introduced here.
In the second set, we demonstrate that A4 serves links according to their max-min
fair rate, and furthermore it adapts to the changes in max-min rate very fast.
Delay Performance
We start with A1 scheduler. Experiments setup is similar to what we used for
backlogged weight function schedulers. Results are given in Figures 4.8, 4.9, 4.10.
At low to moderate utilizations, delay performance for diagonal and uniform load
distributions are better than B1. Specially for low utilizations A1 performs better.
The undesirable phenomena of having high delays at very low rates is also elimi-
nated for uniform and diagonal tra?c. Improvement at low rates is due to the fact
that links are inserted to the candidate set for the self-randomized matching based
on the token and cell arrival. Hence, low rate links are considered more often.
Recall that the weight function of A1 is customized to provide max-min fairness
and not good delay performance, however good delay performance is achieved as
a side product.
In A2 the input graph to the self-randomized computer is persistent, and there-
fore we expect a better delay performance, specially for low utilization factors.
136
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 110
0
101
102
103
104
Diagonal Traffic
Normalized Load
Mean Delay
W = 1 
W = 16
W = 32
Figure 4.8: Mean delay for A1 under diagonal tra?c.
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 110
0
101
102
103
104
Log?digonal Traffic
Normalized load
Mean Delay
W = 1 
W = 16
W = 32
Figure 4.9: Mean delay for A1 under log-diagonal tra?c.
137
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 110
0
101
102
103
104
Uniform traffic
Normalized Load
Mean Delay
W = 1 
W = 16
W = 32
Figure 4.10: Mean delay for A1 under uniform tra?c.
Simulation results are given in flgures 4.11, 4.12, 4.13. In all cases we see a better
overall performance specially for low rates. For the flrst time, even for log-diagonal
tra?c, the delay-throughput curve is uniformly increasing, which is the expected,
desirable shape. The draw back of A2 is in higher utilization rates, where B1 and
B2 outperform it.
In A4, link weight exceeds the number of backlogged cells by not more than
a constant B. In this way, we intended to capture desirable features of both
schemes, and have a good delay performance for the whole utilization range, and
at the same time, maintain the max-min fairness property. A4 has the best overall
delay performance for all examined load distributions, and for the total range of
utilization.
The difierence is better illustrated for log-diagonal tra?c in low utilization. All
other algorithms presented in this paper, have a poor delay performance for low
utilizations, and up to some point, average delay keeps decreasing, until it reaches
its minimum. However, throughput-delay curve for A4 is always increasing.
138
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 110
0
101
102
103
104
105
Diagonal Traffic
Normalized Load
Mean Delay
W = 1 
W = 16
W = 32
Figure 4.11: Mean delay for A2 under diagonal tra?c.
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 110
0
101
102
103
104
Log?diagonal Traffic
Normalized Load
Mean Delay
W = 1 
W = 16
W = 32
Figure 4.12: Mean delay for A2 under log-diagonal tra?c.
139
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 110
0
101
102
103
104
Uniform Traffic
Normalized Load
mean Delay
Figure 4.13: Mean delay for A2 under uniform tra?c.
The adopted weight function has a positive efiect on the performance. In token
based schemes, links? weights increases at a rate which is proportional to the max-
min rate of the links. Therefore, weight function operates proactively, prior to the
cell accumulation in the bufiers. Intuitively, links that have higher reserved rates
are served faster, even if they are not the most congested bufiers. This means
that we have allowed  ows to accumulate tokens, even before the cell arrivals, and
consequently experiences lower delays, and bufier length.
However, for stability purposes, we need to maintain a balance and take into
account the current bufier lengths. This becomes more crucial at higher utilization,
as the number of congested links increases. In A4 the link weight can not exceed the
number of bufiered cells by a constant. In this way, A4 makes a balance between
the two weight function; at low rates links weight is efiectively equals number of
tokens, where as at higher rates, it re ects number of backlogged cells.
The weight function enables A4 to have a dual behavior, it works based on
the reserved rates, and provides max-min fairness, and at the same time takes the
140
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 110
0
101
102
103
Diagonal Traffic
Normalized Load
Mean Delay
W = 1 
W = 16
W = 32
Figure 4.14: Mean delay for A4 under diagonal tra?c.
bufiers length into account. It is interesting that the dual behavior of A4 is also
re ected in delay performance curves . For instance for the uniform tra?c the
delay curve slope changes considerably and gets lower around 0.7 utilization, as if
the delay curve is combination of two difierent curves, one for lower rates based on
the token weight function, and the other for higher rates based on the backlogged
cells weight function.
The delay performance of all the proposed schemes are compared in flgures 4.17,
4.18, 4.19. Generally, B1 and B2 performs better for high utilization, and A1 and
A2 in low utilization. A4, due to its dual behavior, follows the curve of A1 and A2
for low utilization, and B1 and B2 in high utilization. In fact the performance of A4
is even better than A1 and A2 in low utilization. The difierence is greater for log-
diagonal tra?c, where A4 is the only algorithm that does not have an intermediate
minimum in the delay curve. Recall that log-diagonal load distribution is in a
sense the most non-uniform load distribution among the examined distributions.
On some links it has the lowest non-zero arrival rate, and at the same time has the
141
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 110
0
101
102
103
104
Log?diagonal Traffic
Normalized Load
Mean Delay
Figure 4.15: Mean Delay for A4 under log-diagonal tra?c.
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 110
0
101
102
103
104
Uniform Traffic
Normalized Load
Mean Delay
W = 1 
W = 16
W = 32
Figure 4.16: Mean Delay for A4 under uniform tra?c.
142
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 110
0
101
102
103
104
Diagonal Traffic
Normalized Load
Mean Delay
B1
B2
A1
A2
A4
Figure 4.17: Mean delay for difierent schemes under diagonal tra?c.
highest rate on some other links. The dual behavior of A4, enables it to perform
well both for low and high rate links.
Max-min fairness simulation for A4
In this section, we study the max-min fairness behavior of the proposed algorithm.
We consider a 4x4 input-bufiered switch fabric, with log-diagonal weight matrix.
The average serving rate for all links are calculated for every 1000 time slots, and
plotted.
We study how well the algorithm tracks the max-min fair rate, and how fast it
adapts the serving rate to the changes in arrival pattern. We start the experiment
with all queues backlogged, then in the middle of simulation we stop arrivals for
links (1;1), and (2;2). The max-min fair rate, before and after the shut down are
given below in rate matrices G1 and G2 respectively,
143
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
101
102
Log?diagonal Traffic
Normalized load
Mean Delay
B1
B2
A1
A2
A4
Figure 4.18: Mean delay for difierent schemes under log-diagonal tra?c.
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 110
0
101
102
103
104
Uniform Load
Normalized Load
Mean Delay
B1
B2
A1
A2
A4
Figure 4.19: Mean delay for difierent schemes under uniform tra?c.
144
0 20 40 60 80 100 120 140 160 180 2000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
min?max fair scheduling
time
Normalized rate
Figure 4.20: Serving rate adaptation to arrival pattern changes.
0
BB
BB
BB
B@
0:533 0:267 0:133 0:067
0:067 0:533 0:267 0:133
0:133 0:067 0:533 0:267
0:267 0:133 0:067 0:533
1
CC
CC
CC
CA
0
BB
BB
BB
B@
0: 0:8 0:133 0:067
0:6 0: 0:267 0:133
0:133 0:067 0:533 0:267
0:267 0:133 0:067 0:533
1
CC
CC
CC
CA
G1 G2
Simulation results are given in Figure 4.20. The arrival pattern changes at time
frame 100 in the plot. Each time frame consists of 1000 cell times. The serving rate
adapts to the new arrival pattern immediately. This experiment illustrates that
serving rate of A4 scheme is very close to the max-min rates. Moreover, algorithm
adapts to the changes of arrival pattern very rapidly.
145
Chapter 5
Summary and concluding remarks
In this dissertation we investigated the problem of scheduling in an input bufiered
crossbar based switch fabric. The switch fabric consists of N input and N output
ports. At each input port, cells are bufiered in separate virtual output queues
(VOQ) corresponding to difierent output ports. The scheduling problem is equiv-
alent to flnding a matching between the input and output ports. The objective is
to develop scheduling algorithms that have low complexity and high performance.
The performance metrics that we considered are latency, throughput and rate
provisioning capability. The latency and throughput are common performance
metrics for switch fabrics. We consider the rate provisioning as another vital re-
quirement for next generation switch fabrics that in most of the proposed solutions
is overlooked.
The rate based schedulers such as weighted round robin (WRR) and weighted
fair queueing (WFQ) are the dominant solutions for line card schedulers. There
are several analytical results regarding the delay performance and memory require-
ments for data  ows when a rate based scheduler is used. In line cards, scheduling
is a many to one problem since there are multiple  ows that are sharing a sin-
146
gle outgoing link. In a crossbar switch fabric, scheduling is inherently a many to
many problem and line card solutions are not applicable anymore. Nevertheless,
it is desirable to have a rate based scheduler with predictable performance.
In chapter 2, we start from  uid policies as the origin of rate based schedulers
and tried to develop packetized tracking policies for  uid policies. Under a packe-
tized tracking policy departure time for every cell is at most one cell time behind
the  uid policy. For the special case of 2 ? 2 it is proved that tracking policies
always exist and a packetized tracking policy is provided. Furthermore, a counter
example for a 3?3 switch fabric is given and it is concluded that in general track-
ing policies do not exist. However, a heuristic tracking policy is proposed that
performs reasonably well. Under the heuristic scheduling algorithm all cells are
scheduled with bounded delay with respect to the  uid policy. The heuristic policy
measures how many cells are served for every link node and compare it with the
number of cells that should have been served under the  uid policy. The difierence
is considered as the credits of a node and the heuristic algorithm always select the
matching with the highest credit. The credits can also be considered as number of
virtual backlogged cells.
The scheduler is capable of rate provisioning if accumulated credits remain
flnite. If we envision credits as the virtual backlogged cells, rate provisioning is
possible if the scheduler is stable. This fact motivates us to focus on the heuristic
algorithm and try to prove that it is stable. In chapter 3, we were able to go
beyond that and introduce a class of scheduling algorithms that we call them the
maximum node containing matching (MNCM) algorithms and show that they are
rate stable. We use  uid model techniques to prove that these algorithms achieve
100% throughput with no speedup. Fluid model techniques are very powerful
147
tools for the analysis of the queueing systems. The only assumption on the arrival
pattern is that it satisfles strong law of large numbers. Note that this property was
essential for us since most of the classical proofs of stability are for Bernoulli i.i.d.
arrivals. While the Bernoulli arrival model may seem reasonable for cell arrivals it
is deflnitely not a valid assumption for the rate based systems when we are dealing
with virtual cell (credit) arrivals. We also introduce a new weighted matching
algorithm, the maximum flrst matching (MFM) with complexity O(N2:5) that
belongs to MNCM. The MFM algorithm, to the best of our knowledge, is the lowest
complexity deterministic scheduling algorithm that delivers 100% throughput.
We also introduced maximal sorted matching (MSM) algorithm that for all
practical purposes perform similar to MFM. MSM complexity is O(N2), which is
less than MFM. However, it does not belong to the MNCM class. One interesting
path for future research is to prove that the MSM is also rate stable or to provide
lower complexity algorithms that are rate stable.
For the rate provisioning applications, we introduced MSUIM. The MSUIM
tries to flnd a matching that contains maximum number of nodes that their weight
is at most one unit less than the maximum weight. We were able to flnd a deter-
ministic bound (worst case performance bound) for the MSUIM.
In chapter 4, we focused on self-randomized scheduling algorithms. Due to their
low complexity, the randomized algorithms have recently created a lot of interest
in research community. Contrary to the deterministic scheduling algorithms, the
randomized scheduling algorithms do not calculate a new matching every time,
but they select a matching from a predetermined set of matchings. The selection
criterion is usually weight of the matching and the matching with the highest
weight is always selected. Note that calculating weight of a set of matchings and
148
selecting the one with highest weight is much simpler than constructing a matching
with the maximum weight. The matching set is usually updated by insertion of
a randomized generated matching to the set. In the self-randomized algorithms,
source of the randomization process is usually the arrival pattern of the cells.
We used  uid model techniques to show that the self-randomized scheduling
algorithms deliver 100% throughput. We provided a general architecture for the de-
sign of the self-randomized algorithms, and introduce two algorithms that consider
number of the backlogged cells as the weight function. To do rate-provisioning,
we introduced the concept of the max-min fair self-randomized scheduling algo-
rithms. The idea here is to introduce self-randomized algorithms that can provide
QoS by sharing the switch bandwidth proportional to the assigned weights. We
introduced three self-randomized scheduling algorithms, and prove that they are
max-min fair. In order to study and compare the performance of the proposed
scheduling algorithms, several simulations were carried out and their results were
provided and discussed.
We illustrated that previously proposed self-randomized scheduling algorithms
do not have acceptable delay performance. In particular, we showed that for
very low throughput values, cells can experience high delays. The delay perfor-
mance gets better as the throughput increases to a certain level. After that as
we expect, increasing the throughput results in the delay increase. Note that the
self-randomized scheduling algorithms rely on the arrival pattern as the source of
randomization, and when the arrival rate is very low they do not have su?cient
sample points to generate good randomized matchings. We have addressed this
problem in our proposed solutions and to some extent were able to resolve this
problem. However, we believe that this fleld is far from maturity and it is still pos-
149
sible to come up with new randomized scheduling algorithms that perform well,
have low complexity, and are capable of rate provisioning.
150
BIBLIOGRAPHY
[1] T. Anderson, S. Owicki, J. Saxe, and C. Thacker. High-speed switch schedul-
ing for local area networks. ACM Transactions on Computer Systems, Novem-
ber 1993.
[2] J.C.R. Benett and H. Zhang. WF2Q: Worst-case Fair Weighted Fair Queueing.
IEEE INFOCOM ?96, 1996.
[3] Jon C. R. Bennett and Hui Zhang. Hierarchical packet fair queueing algo-
rithms. IEEE/ACM Transactions on Networking, 5(5):675{689, 1997.
[4] D. Bertsekas and R. Gallager. Data Networks. Prentice Hall, 1992.
[5] M.A. Bonuccelli and M.C. Clo. EDD algorithm performance guarantee for
periodic hard-real-time scheduling in distributed systems. IPPS/SPDP, April
1999.
[6] C.S. Chang, W.J. Chen, and H.Y. Huang. Birkhofi-von neumann input-
bufiered crossbar switches. IEEE INFOCOM?00, 2000.
[7] N. Chrysos and M. Katevenis. Transient behavior of a bufiered crossbar con-
verging to weighted max-min fairness. Inst. of Computer Science, FORTH,
Heraklio, Crete, Greece, 2002.
151
[8] S. Chuang, A. Goel, N. McKeown, and B. Prabhakar. Matching output queue-
ing with a combined input output queued switch. INFOCOM ?99, 1999.
[9] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms.
Mc Graw Hill, 1990.
[10] R. L. Cruz. A Calculus for Network Delay, Part I, Network Elements in
Isolation. IEEE Transactions on Information Theory, 37(1):114{131, 1991.
[11] R. L. Cruz. A Calculus for Network Delay, Part II, Network Analysis. IEEE
Transactions on Information Theory, 37(1):132{141, 1991.
[12] J.G. Dai. Stability of  uid and stochastic processing networks. Miscellanea
Publication, No. 9, Center for Mathematical Physics and Stochastics, Den-
mark (http:/www.maphysto.dk/), January 1999.
[13] J.G. Dai and B. Prabhakar. The throughput of data switches with and without
speedup. INFOCOM ?00, pages 556{564, 2000.
[14] A. Demers, S. Keshav, and S. Shenkar. Analysis and simulation of a fair
queueing algorithm. Proceedings of SIGCOM ?89, pages 1{12, 1989.
[15] J. Edmonds and R.M. Karp. Theoretical improvements in the algorithmic
e?ciency for network  ow problems. Journal of the ACM, 19, 1972.
[16] G. L. Frazier and Y. Tamir. The design and implementation of a multi-
queue bufier for VLSI communication switches. International Conference on
Computer Design, Cambridge, Massachusetts, pages 466{471, 1989.
152
[17] Leonidas Georgiadis, Roch Gu?erin, and Abhay K. Parekh. Optimal multi-
plexing on single link: Delay and bufier requirements. IEEE/Transactions on
Information Theory, 43(5):1518{1535, 1997.
[18] Leonidas Georgiadis, Roch Gu?erin, Vinod Peris, and Kumar N. Sivarajan.
E?cient network QoS provisioning based on per node tra?c shaping. IEEE/
ACM Transactions on Networking, 4(4):482{501, 1996.
[19] P. Giaccone, B. Prabhakar, and D. Shah. Towards simple, high-performance
schedulers for high-aggregate bandwidth switches. INFOCOM ?02, 2002.
[20] J. Giles and B. Hajek. Scheduling multirate periodic tra?c in a packet switch
shaping. Conference on Information Science and Systems at John Hopkins
University, 1997.
[21] S. Golestani. A self-clocked fair queueing scheme for broadband applications.
Proceedings of INFOCOM ?94, pages 636{646, April 1994.
[22] E.L. Hahne. Round-robin scheduling for max-min fairness in data networks.
IEEE Journal on selected areas in communications, 9(7), 1991.
[23] J.E. Hopcroft and R.M. Karp. An n5=2 algorithm for maximum matching in
bipartite graphs. SIAM Journal on Computing, 2(4):225{231, 1973.
[24] A. Hung, G. Kesidis, and N. Mckeown. Atm input-bufiered switches with
guaranteed rate property. IEEE ISCC?98, Athens, 1998.
[25] T. Inukai. An e?cient SS/TDMA time slot assignment algorithm. IEEE
Transactions on Communications, 27, 1979.
153
[26] M. Karol, M. Hluchyj, and S.P. Morgan. Input versus output queueing on
a space division packet switch. IEEE Transactions on Communications,
35(12):1347{1356, 1987.
[27] H. W. Kuhn. The hungarian method for the assignment problem. Naval
Research Logistics Quarterly, 2:83{97, 1955.
[28] N. Mckeown. iSLIP: A scheduling algorithm for input-queued switches. IEEE
Transaction on Networking, 7(2):188{201, April 1999.
[29] N. McKeown, V. Anantharam, and J. Warland. Achieving 100% Throughput
in an Input-Queued Switch. INFOCOM ?96, pages 296{302, 1996.
[30] A. Mekkittikul and N. McKeown. Scheduling VOQ switches under non-
uniform tra?c. Technical Report CSL-TR-97-747, Stanford University, 1997.
[31] A. Mekkittikul and N. McKeown. A Practical Scheduling Algorithm to
Achieve 100% Throughput in Input-Queued Switches. INFOCOM ?98, pages
792{799, 1998.
[32] A.K. Parekh and R.G. Gallager. A Generalized Processor Sharing Approach to
Flow Control in Integrated Services Networks: The Single-Node Case. IEEE/
ACM Transactions on Networking, 1(3):344{357, 1993.
[33] A.K. Parekh and R.G. Gallager. A Generalized Processor Sharing Approach
to Flow Control in Integrated Services Networks: The Multiple Node Case.
IEEE/ACM Transactions on Networking, 2(2):137{150, 1994.
[34] I.R. Philp. Scheduling real-time messages in packet-switched networks. PhD
thesis, Department of Computer Science University of Illinois at Urbana-
Champaign, 1997.
154
[35] V. Tabatabaee, L. Georgiadis, and L. Tassiulas. QoS Provisioning and Track-
ing Fluid Policies in Input Queueing Switches. IEEE Transaction on Net-
working, 9(5), 2001.
[36] L. Tassiulas. Linear complexity algorithms for maximum throughput in radio
networks and input queued switches. INFOCOM?98, 2:533{539, 1998.
[37] L. Tassiulas and A. Ephremides. Stability properties of constrained queueing
systems and scheduling policies for maximum throughput in multihop radio
networks. IEEE Transaction on Automatic Control, 37(12):1936{1948, Dec.
1992.
[38] L. Tassiulas and S. Sarkar. Maxmin fair scheduling in wireless networks.
INFOCOM?02, 2002.
155