ABSTRACT Title of Thesis: AN ALGORITHM TO EVALUATE THE ECHO SIGNAL AND THE VOICE QUALITY IN VOIP NETWORKS Andre Neumann Kauffman, M.S., 2006 Directed By: Associate Professor, Steven A. Tretter, Department of Electrical and Computer Engineering Voice over the Internet Protocol (VoIP) has been increasingly popular, but reliability and voice quality remain important factors that limit the widespread adoption of VoIP systems. Providing good voice quality is of major importance for the transition from the PSTN to VoIP networks. There are several non-real-time algorithms that estimate the voice quality such as the PESQ and the E-model. In this thesis we propose a real- time fuzzy algorithm to estimate the echo quality component of the voice quality in VoIP networks. Differently from the existing algorithms, the proposed algorithm does not need a reference signal and has low computational complexity. For these reasons, the proposed algorithm can be embedded in every VoIP system of a network to monitor live calls, giving an estimate of the instantaneous voice quality to the network provider. AN ALGORITHM TO EVALUATE THE ECHO SIGNAL AND THE VOICE QUALITY IN VOIP NETWORKS By Andre Neumann Kauffman Thesis submitted to the Faculty of the Graduate School of the University of Maryland, College Park, in partial fulfillment of the requirements for the degree of Master of Science 2006 Advisory Committee: Associate Professor Steven A. Tretter, Chair Associate Professor Carol Y. Espy-Wilson Assistant Professor Min Wu ? Copyright by Andre Neumann Kauffman 2006 ii Acknowledgements I'm very grateful to the Electrical and Computer Engineering Department in the University of Maryland for all their support and for making the past two years a wonderful and fascinating experience for me. I also would like to express my gratitude to Texas Instruments for providing me a Research Assistantship during part of my graduate studies. I would like to thank Dr. Bogdan Kosanovic for all his support, friendship, helpful discussions and directions that greatly influenced and improved this work. I would like to thank Professor Steven Tretter for his friendship, support and guidance during my graduate studies and during the development of this work. Finally, I would like to thank my parents for their unlimited confidence in me and for encouraging me, since the beginning, to pursue knowledge and education. iii Table of Contents Acknowledgements........................................................................................................ii Table of Contents..........................................................................................................iii List of Tables .................................................................................................................v List of Figures...............................................................................................................vi 1 Introduction............................................................................................................1 1.1 Voice Quality in VoIP Networks...................................................................1 1.1.1 Network Quality ....................................................................................3 1.1.2 Subjective Voice Quality.......................................................................4 1.1.3 Objective Voice Quality ........................................................................5 1.2 Contributions .................................................................................................9 1.3 Thesis Outline ..............................................................................................11 2 Echo Control and Voice Quality..........................................................................13 2.1 Echo in Telecommunications Networks......................................................13 2.1.1 Network Delay Makes Echo Noticeable..............................................13 2.1.2 Delay in VoIP Networks......................................................................15 2.1.3 Echo in VoIP Networks .......................................................................16 2.2 Line Echo .....................................................................................................16 2.2.1 Line Echo in VoIP Networks...............................................................17 2.3 Acoustic Echo ..............................................................................................18 2.4 Line Echo Cancellation................................................................................18 3 Fuzzy Logic .........................................................................................................23 3.1 Introduction to Fuzzy Logic ........................................................................23 3.2 Advantages of Using Fuzzy Logic ..............................................................23 3.3 Fuzzy Logic and Embedded Systems ..........................................................24 3.4 Fundamentals of Fuzzy Logic .....................................................................25 3.4.1 Fuzzy Logic Versus Boolean Logic ....................................................26 3.4.2 Fuzzy Sets and Membership Functions ...............................................27 3.4.3 Fuzzy Set Operations ...........................................................................29 3.4.4 Fuzzy Rules .........................................................................................30 3.4.5 Fuzzy Implication................................................................................32 3.4.6 Aggregation Operator ..........................................................................33 3.4.7 Defuzzification....................................................................................34 3.4.8 Fuzzy Inference System.......................................................................35 4 A Fuzzy Algorithm to Evaluate the Echo Component of the Voice Quality in a VoIP Network..............................................................................................................37 4.1 Objective Evaluation of the Voice Quality..................................................37 4.2 Channel Based Algorithm............................................................................40 4.3 Network Based Algorithm ...........................................................................50 4.4 Simulation Results .......................................................................................51 5 Summary and Future Work .................................................................................57 5.1 Summary......................................................................................................57 5.2 Further Work ...............................................................................................58 iv Appendix......................................................................................................................60 Bibliography ................................................................................................................62 v List of Tables Table 1-1 Service requirements that are taken into consideration when evaluating the performance of a VoIP network ............................................................................2 Table 1-2 Description of MOS scores ...........................................................................5 Table 2-1 Some typical transmission facility delays ...................................................13 Table 2-2 ITU-T limits for one-way transmission time (delay) with echo control.....14 Table 3-1 Some successful fuzzy systems ...................................................................23 Table 4-1 Fuzzy sets associated to the input parameters.............................................42 Table 4-2 Fuzzy rules to evaluate the echo component of the voice quality...............44 vi List of Figures Figure 1-1 Different perspectives for voice quality evaluation in a VoIP network.......3 Figure 1-2 Common concepts for the PSQM and PAM methods .................................6 Figure 1-3 Block diagram of the PSQM algorithm .......................................................8 Figure 1-4 Components required for evaluating the performance of a VoIP network, highlighting where the VoIP voice quality fits in the bigger picture ..................11 Figure 2-1 Impact of delay on call quality with and without echo ..............................14 Figure 2-2 The hybrid device and line echo generation ..............................................17 Figure 2-3 Simplified block diagram of a TDM-IP gateway......................................18 Figure 2-4 Circuit with a line echo canceller...............................................................19 Figure 2-5 Block diagram of a line echo canceller ......................................................20 Figure 3-1 Block diagram of a generic fuzzy inference system..................................26 Figure 3-2 Example of a fuzzy member function........................................................28 Figure 3-3 Comparing a characteristic function to a membership function................29 Figure 3-4 Set of rules being aggregated and the final defuzzification.......................32 Figure 3-5 Example of the fuzzy implication method chosen for our proposed algorithm..............................................................................................................33 Figure 3-6 Example of the fuzzy aggregation method used in our proposed algorithm .............................................................................................................................34 Figure 3-7 Example of a defuzzification by center of mass ........................................35 Figure 3-8 Detailed fuzzy inference system................................................................36 Figure 4-1 Classification of voice quality algorithms for VoIP systems .....................38 Figure 4-2 Fuzzy inference system to estimate the voice quality in an VoIP network .............................................................................................................................39 Figure 4-3 Fuzzy inference system that estimates the performance of a VoIP network for a call ...............................................................................................................39 Figure 4-4 Output membership functions for the echo component of the voice quality .............................................................................................................................43 Figure 4-5 ERL fuzzy membership function...............................................................45 Figure 4-6 ACOM fuzzy membership functions .........................................................46 Figure 4-7 Receive speech fuzzy membership function..............................................47 Figure 4-8 Transmit noise fuzzy membership function...............................................47 Figure 4-9 Graphical interpretation of the fuzzy rule..................................................49 Figure 4-10 Echo quality estimation for a simulated call............................................49 Figure 4-11 Speech and echo signals in the send path ................................................50 Figure 4-12 Speech signal in the receive path.............................................................50 Figure 4-13 Estimating the voice quality for a subnetwork of VoIP channels............51 Figure 4-14 Average quality for calls with good echo component .............................52 Figure 4-15 Average quality for calls with bad echo component................................53 Figure 4-16 Histogram showing the echo signal quality.............................................54 Figure 4-17 Histogram for a call with a good estimated echo signal quality..............54 Figure 4-18 Comparing the estimated echo quality with the estimated ACOM .........55 1 1 Introduction 1.1 Voice Quality in VoIP Networks Voice quality is essential in any communication system that is based on speech transmission. Voice over the Internet Protocol (VoIP) systems have been increasingly popular in the past few years and will continue to spread both in the carrier and enterprise sectors. In fact, current projections estimate that the total market value for services using VoIP is forecast to grow almost ten fold over the next five years. It is clear that VoIP will evolve from being a replacement service for the public switched telephone network (PSTN) market to providing truly converged services to the home and business. Voice is one of the hardest services to provide on an IP network. The PSTN was built to provide an optimal service for time-sensitive voice applications, with low delay, low jitter, and cons tant but low bandwidth. IP networks on the other hand have been built to support non-real-time data applications such as email or file transfer. These applications are characterized by bursty traffic, with occasional peaks in demand for high bandwidth, but are not sensitive to delay. During a conversation, humans have little tolerance to delays, jitter, echo (which is a direct consequence of the delay in VoIP networks) and noise (which, for instance, can be introduced during low bit rate voice coding that is commonly implemented by VoIP systems ). In addition to the degrading factors introduced in the PSTN, VoIP networks include additional factors such as latency, delay jitter and packet loss. In order to provide a good quality of service (QoS) for VoIP networks, the existence of an embedded module that assesses the voice quality in each live call is necessary. This embedded module is the main concern of this work. The performance of a VoIP network can be determined by a variety of parameters such as the availability of the network and dial tone, call setup request processing performance, call completion, call drop rate, one-way voice transport delay, voice quality during the call, and so on. The next table briefly discusses the service requirements that are taken into consideration when evaluating the performance of a VoIP network. 2 Service requirements Parameters Service requirements before call setup ? Availability of dial tone. ? Availability of computing and network resources for honoring call processing requests. Service requirements during call setup ? Total amount of time to setup a call (can vary from 500 ms to 10 s, depending on availability of the network). ? The number of simultaneous calls that can be handled without any per call wait. Service requirements during a VoIP session ? Voice coding and processing delay ? Voice packet loss ? Echo ? Jitter Service requirements after a VoIP session is complete ? Maintenance of a complete call log and call detail record (CDR). Table 1-1 Service requirements that are taken into consideration when evaluating the performance of a VoIP network Providing good QoS in VoIP networks is of major importance for the transition from the PSTN to VoIP networks. We are so used to the QoS provided by the PSTN that anything less than that would become a barrier to the deployment of VoIP systems and networks. The evaluation of the QoS for a VoIP system or network depends on a set of parameters and requirements that contain those described in Table 1-1. In this work we will be concerned only with a subset of the requirements described in Table 1-1. We will not analyze the requirements related to the signaling and call control protocols. More specifically, in this work we will be interested in evaluating the service requirements during a VoIP session, that is, the voice quality over VoIP networks (here we are considering that the parameters that are associated to the requirements during a VoIP session in Table 1-1 are all grouped under what we are calling voice quality). One of the main components of the voice quality parameters is the amount of echo present in the conversation and we will discuss this to some detail and present an algorithm to evaluate how efficient is the echo cancellation (or how the echo signal is influencing the voice quality) in a VoIP call. We can evaluate the quality of voice over IP networks in three different perspectives: the network quality, the objective quality, and the subjective quality, as illustrated by the next figure. 3 Figure 1-1 Different perspectives for voice quality evaluation in a VoIP network The network quality reflects the provider?s perspective. The objective and subjective quality reflect the customer?s perspective. The network quality can be relatively easily measured by network parameters, such as the packet loss rate or packet delay or jitter. Subjective quality is generally more meaningful than network quality, as it relates directly to user-perceived quality. Assessing subjective voice quality, however, requires listening tests with a large number of test subjects. For this reason, objective quality measures that predict subjective quality are typically employed in the evaluation of voice transmission systems. In the next sections we briefly describe the different perspectives for voice quality evaluation in a VoIP network. 1.1.1 Network Quality In general, poor network quality decreases the performance of a VoIP system. In VoIP applications, delay, jitter and packet loss are the main network impairments that affect perceived voice quality. Jitter can be partially compensated for by using a playout buffer at the receiving end, but this introduces further delay and additional packet loss. There are several components (logical and physical) in the IP network that cause delay, jitter and packet loss. Here we briefly describe some of these I/O device Voice client Transport Voice server Network quality Objective voice quality Subjective voice quality 4 components that characterize the network quality and at the same time impact the quality of VoIP systems. There are several components of the network that can result in delay, jitter and packet loss. Some of these components are ? Network protocols - routing protocols, traffic control protocols ? Router operation ? Bandwidth of the links ? Network reliability Network reliability is an important component that introduces delay and packet loss, specially in the backbone of IP networks. There are two important scenarios that can directly influence the network reliability: routing reconfiguration and link failures. ? Link failure: There are many reasons that can lead to link failures such as fiber cuts, linecard or router crashes and maintenance operations. In fact, long outage durations are typically attributed to a link failure in the IP network backbone [R 9]. ? Routing reconfiguration: It is typical for a routing protocol to require around 5 seconds to converge to a new configuration when a link goes down and around 15 seconds when a link goes up. During this reconfiguration period, forwarding may be disrupted and voice packets may be lost. All the network behavior described above can influence the amount of delay and packet loss present in a VoIP system. When this happens, the IP network can exhibit undesirable characteristics, such as large delay spikes, periodic delay patterns and packet loss on one or more paths. All these lead to poor VoIP performance. There are experiments showing that calls using the G.711 (that is, PCM) encoder with high intrinsic quality, good echo cancellation but with some delay and packet loss are barely able to provide acceptable VoIP service (MOS > 3.6 ? MOS is defined in the next section). So, we can assess somehow the voice quality in a VoIP system by assessing the quality of the IP network, but this gives more a provider?s perspective. In order to obtain a more precise evaluation of the voice quality (a user?s evaluation) we need to go to the end points of the system as shown in Figure 1-1. The next sections describe the other two possible perspectives on quality in performance evaluation of a VoIP system. 1.1.2 Subjective Voice Quality The MOS - mean opinion score - is a subjective voice quality assessment method. It is considered by many researches as the best evaluation method for assessing voice 5 quality because its result is based on the human direct ears. The MOS is a subjective rating system that is defined in ITU-T P.800. It is based on the opinions of several testing volunteers who listen to a sample of voice traffic and rate the quality of that transmission. The volunteers listen to a variety of voice samples and are asked to consider factor suc h as loss, noise and echo. The volunteers then rate the voice samples by giving a score in range from 1 to 5 as described in Table 1-2. The MOS score is calculated as an average of scores given by all listeners. The MOS scores are defined as follows MOS score Description 5 Excellent 4 Good 3 Fair 2 Poor 1 Bad Table 1-2 Description of MOS scores While MOS represents the true perceptual assessment of speech quality, it has obvious limitations. It is a time consuming process, it is not an automated method and it can not be applied to estimate the quality of a call in a real-time environment. It is interesting to note that even using this time consuming MOS methodology, most experiments can only indicate the speech quality of unidirectional connections [R 7]. For instance, the MOS test does not indicate how the increased delay degrades the final QoS due to decreased interactivity when long transmission delays are introduced. We will refer to the MOS score through out this work as a reference to quantify and compare the voice quality in different scenarios. This is what is normally used in research papers that assess voice quality. 1.1.3 Objective Voice Quality One of the advantages of objective voice quality algorithms over subjective voice quality algorithms is that objective algorithms can be automated and may not require any human intervention at all. There are two main classes of objective voice quality algorithms: active and passive algorithms. Objective voice quality monitoring, whether active or passive, has recently gained ground among VoIP providers. In active monitoring, a network analyzer injects traffic patterns that resemble a VoIP application into the network; the analyzer then observes the overall voice quality by comparing the impaired voice with the original voice sample using a perceptual model. Although this scheme can provide useful 6 input for optimization and network dimensioning, it uses network resources, provides non real-time results, and can?t concretely determine the causes of degradation. A passive monitoring scheme, on the other hand, can operate in real-time, and lets VoIP applications take corrective action when quality of service is unacceptable. For these reasons, the algorithm we propose in this work is a passive, objective voice quality algorithm. Several active, objective algorithms have been proposed to automate the voice quality assessment of a call. The most successful two methods are the PAMS - Perceptual Analysis / Measurement System - and the PSQM - Perceptual Speech Quality Monitor. In both methods a reference speech sample representing the transmitted speech signal is passed through degradation producing the degraded speech sample representing the received signal. Signal analysis is performed both on the time and frequency domains of the two speech samples and an estimate of the MOS score is provided. The next figure is a high level view of the common concepts behind the PAMS and PSQM methods. Figure 1-2 Common concepts for the PSQM and PAM methods Despite the fact that the objective techniques described above are automated and they simplify the voice quality evaluation process compared to the MOS method, they are considered intrusive or active. This is due to the assumption that the reference speech sample representing the transmitted signal is available. This is, in general, not true in live communication calls. It is worth noting that the ITU-T Recommendation P.861 specifies a model to map audio signals to their representation inside the head of a human. The basic idea of the modeling approach is to take measurements of the processed (compressed, encoded, etc) signal, perform an objective analysis between the original and the processed Original speech signal Voice quality evaluation Network impairments PSQM or PAM score 7 version and offer an "opinion" as to the "goodness" of the signal. The result is an absolute number. The challenges of measuring voice quality in real-life situations are a little more complex, and more data is often necessary than derived in Recommendation P.861. We can see this complexity in the block diagram description of the PSQM algorithm in Figure 1-3. The PSQM algorithm derives objective numbers that are an estimate of the quality of the voice being delivered. The PSQM algorithm uses several steps in processing the input and output signals. The next figure shows a block diagram with the processing steps for the PSQM algorithm - extracted from [R 6]. In this block diagram, x[n] is the input signal (reference) and y[n] is a scaled version of the output signal. The PSQM also requires time-aligned input and output stream samples which maybe difficult to obtain in practice because this requires a precise knowledge of the delay that affected the output signal. 8 Figure 1-3 Block diagram of the PSQM algorithm xi[n] Hanning windowing FFT and power spectrum density calculation Frequency warping Filter with receiving characteristics of handset + Intensity warping yi[n] Hanning windowing FFT and power spectrum density calculation Frequency warping Filter with receiving characteristics of handset + Intensity warping Calculate local scaling factor X Noise Calculate loudness scaling factor X Cognitive subtraction, asymmetry processing and silent interval weighting 9 We note that the PSQM, besides requiring a reference signal and time-alignment of the signals, also requires a considerable computational power in order to perform operations such as FFTs and filtering. This is a heavy requirement to be posed to embedded systems that are monitoring live calls, especially systems with a high density of channels (which sometimes have hundreds of channels being processed by a few DSPs in a single platform). This algorithm is just not applicable in this scenario. On the other hand, our proposed algorithm is not as complete as the PSQM, but offers a tradeoff between computational requirements and accuracy of the estimation of the voice quality. As a result of the combinations of some ideas of the PSQM and PAMS methods, recently the ITU-T created the PESQ - Perceptual Evaluation of Speech Quality - in Recommendation P.862. In order to provide a voice quality measure in real-time for ongoing calls, non- intrusive (or passive), objective approaches have also been proposed. One such methodology relies on measuring various parameters that can affect the voice quality in the IP network. Those parameters are packet loss, latency and delay jitter. An example of this approach is the E-model of the ITU-T Recommendation G.107. In this method the two ends of a VoIP call can exchange information on these instantaneous parameters and the voice quality measure can be obtained. However, in this method the actual voice is not used and for this reason the results will not be accurate enough. Most current methods [R 33] use the E-model to measure voice quality, but the E-model requires subjective tests to derive model parameters which is time-consuming and often impractical. A real life VoIP network may cross several countries and in this case, the E-model will need to be trained for each country. As a result, the E-model is only applicable to a limited number of codecs and network conditions. It is important to note here that all algorithms that were described (even the objective ones) have some kind of heuristic motivation. For instance, in the PSQM algorithm the "silent interval weighting" operation (Figure 1-3) is believed to allow a fitting of the cognitive processing to cultural differences. Changing the way this operation is done may result in more precise voice quality estimation in different cultures. In the same way, some of the reasoning behind our proposed algorithm is based on heuristics, for instance, on how humans perceive echo. 1.2 Contributions As was stated in the beginning of this chapter, voice quality is essential in any communication system that is based on speech transmission. We also emphasized that providing good QoS in VoIP networks is of major importance for the transition from the PSTN to VoIP networks. As was seen in Table 1-1, there are many parameters that should be controlled in order to provide a good QoS in a VoIP network or 10 system. In this work we will focus on the voice quality portion of the QoS set of requirements. We will propose a real-time algorithm to evaluate the voice quality in a VoIP system or network. We will show the details and implementation results of one building block of such algorithm. Specifically, we will focus on the building block that evaluates the echo component of the voice quality in VoIP networks. The state-of-the-art in the subject of evaluating the voice quality was briefly highlighted in this chapter. As is stated in [R 8], the objective assessment methods such as PSQM, PAM are mainly developed for the evaluation of the speech codec performance and are not fitted for the delayed and jittered speech signal. Another trend of algorithms to estimate voice quality is represented here by the E-model (Section 1.1.3), which requires parameters that depend on the telephone terminal and are difficult to be obtained. The conclusion is that ?there is a need for simple speech objective evaluation methods? [R 8]. In this work we propose a real-time, low computational complexity fuzzy inference system to evaluate the echo component of the voice quality over VoIP networks (Figure 4-2). We also propose extensions to be incorporated in the algorithm in order to obtain the overall performance of the VoIP system or network (Figure 4-3). We suggest a simpler (compared to PSQM, PESQ or the E-model) objective voice quality evaluation method which divides the voice quality assessment into three main engines that separately compute the contributions of the three main factors that affect the voice quality ? delay, jitter and echo. It should be clear that the main developments in this work are done for the echo component of the voice quality. The use of playout buffers at the receiving side of a VoIP call can be used to compensate for the effects of jitter based on a tradeoff between delay and packet loss [R 10]. However, as we can see in Figure 2-1, the effect of short delay (less than 300 ms) in the voice quality when echo is not present is minimal. If we can eliminate jitter (at the same time decreasing the rate of packet loss - due to jitter compensation) by increasing the delay and keeping the echo over control, the final voice quality of the call should be better than dealing with jitter in the conversation. This is one of the reasons why we emphasize the study of the echo quality parameter for estimating the voice quality in VoIP networks. Of course these elements are interconnected in non-trivial, non-linear ways. For instance, jitter requires a jitter buffer (playout buffer), which causes delay and delay may amplify some existing echo that is generated by speech signal reflection somewhere in the network. Our algorithm doesn?t try to evaluate precisely each separate contribution, nor does it try to precisely model how these components interact. We use fuzzy logic inference systems separately for each element (echo, jitter, delay) to try to approximate their contribution and also the final voice quality assessment. 11 Besides using fuzzy logic, which already requires low computational effort, we try to use as inputs for our fuzzy inference systems parameters that are ?free? of computational effort once they are already available in the system for some other purposes (e.g. the echo canceller has to estimate the loss in the echo path and we use this estimation as an input to our fuzzy inference system that estimates the echo component of the voice quality). While most of the existing voice quality assessment techniques are active and can not be applied to real-time analysis in live calls, our proposed algorithm does not interfere with the call or the signals present in the network. This characteristic of being able to analyze live calls with low computational complexity is the main advantage of our proposed method over the methods described in this chapter. The main disadvantages of our proposed algorithm, as we will show in Chapter 4, are the lack of accuracy (usual fuzziness introduced by the fuzzy engine) and the need for calibration of thresholds of the algorithm for each specific VoIP system or network. Finally, we would like to give an idea of where our algorithm fits in the bigger picture of evaluating the performance of VoIP networks, that is, not only the voice quality, but the performance of the whole VoIP system or network. As was described in Table 1-1, we can split the analysis of the performance of a VoIP system with respect to the various stages of a call over IP networks. We have performance requirements before the call is set up, during the setup, during the VoIP session and after the VoIP session. For each one of these steps that define a VoIP call, we have a set of parameters that are used to evaluate the performance of the specific step. In the next diagram we show how our proposed algorithm fits in the larger view of analyzing the performance of a VoIP system. ? before call setup ? during call setup ? during a VoIP session Performance of a VoIP system or network {? after session is complete {Voice quality { ? delay ? jitter ? packet loss ? echo Figure 1-4 Components required for evaluating the performance of a VoIP network, highlighting where the VoIP voice quality fits in the bigger picture 1.3 Thesis Outline In Chapter 2 we describe how echo is created in VoIP networks and the relationship between echo control and voice quality. The description of the echo problem given in this chapter is important because the main result of this work is an algorithm to 12 evaluate the echo component of the voice quality in VoIP systems or networks. In Chapter 3 we give an overview of fuzzy logic and soft computing, which was the methodology used in the development and implementation of our proposed algorithm. Chapter 4 is our original contribution and the main result of this work. It describes the ideas, implementation and simulation results of our proposed algorithm. We present our conclusions and some directions for further work in Chapter 5. 13 2 Echo Control and Voice Quality 2.1 Echo in Telecommunications Networks In most cases our everyday conversations take place in the presence of echoes. We hear echoes of our speech waves as they are reflected for instance from the floor and the walls. However, if the reflected waves arrive shortly after we speak them, we do not perceive them as echo but as some reverberation. On the other hand if the reflected wave takes 20 or 30 milliseconds (ms) to come back to us, we will identify it as an annoying echo. Similarly, in telecommunications networks echo can also be quite annoying and, if left uncontrolled, can make it impossible to carry on a conversation. Hearing your own voice in the receiver while you are talking is common and reassuring to the speaker. Hearing your own voice in the receiver after a delay of more than about 25 ms, however, can cause interruptions and can break the cadence in a conversation. Whether a caller hears echo is chiefly dependent on the amount of delay present in the circuit or network. Most callers will hear echo of their own voice if the circuit contains as little as 30 milliseconds of round-trip delay. If the round-trip delay approaches 50 ms, virtually all callers will complain of echo if it is left uncontrolled. 2.1.1 Network Delay Makes Echo Noticeable Delay is introduced into the telecommunications network primarily by transmission facilities and transmission equipment. Negligible delay is introduced into the telecommunications network by some types of transmission equipment, such as a digital switch. Other transmission equipment, such as low bit rate voice encoders, often introduces significant delays. Depending on the network topology, and the type of transmission equipment used in the network, 30 ms of roundtrip delay can occur in connections that are across country or just across town. The next table depicts some typical transmission facility delays. Transmission facility Delay per 100 miles T1 carrier over copper 1 ms Fiber optic cable 1 ms Microwave radio 0.7 ms Table 2-1 Some typical transmission facility delays 14 Even when echo control is present, there is a limit for the amount of delay that will be tolerated by an average user. The ITU-T Recommendation G.114 provides limits for one-way transmission time (delay) on connections with adequately controlled echo. One-way transmission time User acceptance 0 to 150 ms Acceptable for most users 150 to 400 ms Acceptable but has impact 400ms and above Unacceptable Table 2-2 ITU-T limits for one-way transmission time (delay) with echo control The next figure shows how for a specific network configuration the quality of a call degrades with increasing delay. The figure also shows how this degradation is further affected by the presence of echo. Figure 2-1 Impact of delay on call quality with and without echo It is clear from the figure above that echo is a determinant component in decreasing the quality of the call. More specifically, echo has two drawbacks: it can be loud and it can be long. The louder and longer the echo, the more annoying it becomes. So far, we have analyzed the phenomenon of echo in telecommunications networks in general. The next sections discuss in more detail what causes delay in VoIP networks (which is important because echo is noticeable only when delay is present) and we also discuss the echo problem specifics for VoIP networks. 0 100 200 300 400 one-way delay, ms worst best relative quality echo present no echo 15 2.1.2 Delay in VoIP Networks Differently from PSTN networks, where delay is in large part due to the propagation delay in the transmission facilities, which means that short distance calls rarely suffer from delays, VoIP networks have delay injected by several reasons as we describe next. Traditionally used in the PSTN network, the pulse code modulation (PCM) method of encoding voice signal (as defined in ITU-T?s G.711 standard) generates a bit stream of 64Kbps. On the other hand, in VoIP applications low bit rate voice encoding algorithms are widely used. For instance, the most popular frame-based vocoders that utilize linear prediction are the G.723 standard, generating a bit stream of 5.3 Kbps, and the G.729 standard, producing a bitstream of 8Kbps. However, this reduction in rate using vocoders does not come for free. There is a coding delay associated to each vocoder, for instance, in G.723 systems there is approximately a 37.5 ms delay due to the algorithmic portion of codec delay (the coder process the voice signal in 30 ms frames). As we will see, this delay when added to other delays introduced by the network will result in an end-to-end delay that greatly increases the perception of echo in VoIP networks. Buffers are another cause of delay in VoIP networks. IP based networks employ buffers for several reasons. At the access domain, a buffer provides temporary storage for packets before they are routed to the appropriate transport network. The amount of delay suffered by packets at this level of the network depends on buffer size, traffic density and packet priority. At the transport domain, buffering is needed to support proper routing and multiplexing of packets. In this domain, the total amount of delay depends on several aspects such as propagation time, transmission capacity and header processing delays. Finally, at the packet delivery domain, the packets that arrive earlier than the expected time need to be stored temporarily before being delivered. For VoIP applications, delayed packets may become useless after a specified amount of time. The delay jitter buffer holds these packets that arrived earlier and also delayed packets in an attempt to neutralize the effects of packet inter- arrival jitter. This helps maintaining the liveliness of real-time communication over IP networks, increasing the voice quality. These playout buffers at the receiving side of a VoIP call can be used to compensate for the effects of jitter based on a tradeoff between delay and packet loss. Besides these ?designed? delays described above, IP networks are susceptible to several network scenarios that can drastically increase the amount of delay in one or more paths of the network. These scenarios, such as link failure and routing reconfiguration, were described in Section 1.1.1. In an ideal VoIP network we would have a one-way delay that would be less than 150 ms. 16 2.1.3 Echo in VoIP Networks As described in the previous section, in VoIP systems the delays introduced by coding the speech into packets and removing network jitter are long enough to make the system susceptible to echo problems even for short distance calls. Echo cancellation is therefore likely to be needed in most VoIP systems. This is in contrast to the PSTN where echo cancellation is only necessary on long-haul connections. In general, short-delay echoes are rarely distinguished from side-tone unless either the round-trip delay exceeds 30 ms or the echo level is extremely high. For this reason echo cancellation is not required on short PSTN connections. However, round-trip delays of VoIP systems are unlikely to be less than 30 ms, ensuring that some form of echo cancellation is invariably required. If a VoIP system connects to a local PSTN, echo cancellation is probably needed to cancel the local hybrid reflections. If the system does not connect to a local PSTN, echo cancellation should still be included to remove any acoustic echo. As was mentioned in the last paragraph, in general there are two possible origins for echo in telecommunications networks. Echo can originate from hybrid reflections in the network or from poor acoustic isolation. Depending on how the echo originated it is called line echo or acoustic echo. These two types of echo are described in Sections 2.2 and 2.3 respectively. 2.2 Line Echo In a traditional network, line echo is caused by a mismatch in impedance from the four-wire network switch conversion to the two-wire local loop. Echo in the PSTN is regulated with echo cancellers and a tight control on impedance mismatches at the common reflection points. The 2-wire local loop consists of a single pair of wires that carry both directions of the conversation. At the local telephone exchange, this 2-wire pair is connected to a 4-wire trunk by using a device called a hybrid. The hybrid splits the 2-wire local loop into two separate pairs of wires, one for the send path and one for the receive path as described by the following figure. 17 Figure 2-2 The hybrid device and line echo generation Because the hybrid cannot be made to split the 2-wire loop perfectly, some of the receive signal is erroneously leaked into the send path and is called echo. Telephony networks in parts of the world where analog voice is primarily used employ echo suppressors. This is not the best mechanism to use to remove echo. For instance, a line that has an echo suppressor can not use ISDN because the echo suppressor cuts off the frequency range that is used by the ISDN. Our proposed algorithm does not deal with echo suppressors. On the other hand, in IP networks, echo cancellers can be built into the codecs and operate on each DSP. In our proposed algorithm we take advantage of the measurements made by those echo cancellers present in the DSP to draw conclusions about the echo quality in the call and more generally the voice quality of the call. Note that once the echo canceller has already computed such measurements (that will be describe in more detail on Chapter 4) there is no extra computational effort required by the algorithm for the DSP. 2.2.1 Line Echo in VoIP Networks In VoIP networks line echo is generated from the telephone network (PSTN) toward the packet network. Normally the IP portion of the VoIP solution adds more than 50 ms of round trip delay and for this reason line echo cancellers are essential for VoIP networks when they interface with the PSTN. The echo-cancellation tail length varies among different VoIP applications. The tail-length requirement is determined by the distance between the gateway equipment and the four-to-two line hybrid. Typically this ranges from an 8 ms tail length for residential applications to 128 ms tail length for carrier applications. The following figure is a very simplified block diagram of a TDM-IP gateway with a line echo canceller. Far end 4-wire Hybrid Near end 2-wire Echo Send path Receive path 18 Figure 2-3 Simplified block diagram of a TDM-IP gateway 2.3 Acoustic Echo While not as prevalent as echo caused by the hybrid (line echo), acoustical echo can also be encountered in the telecommunications networks. Acoustical echo is caused by poor isolation between the microphone and speaker of some telephone sets. Most hands free speakerphone systems incorporate special echo control circuitry to ensure that echo is not a problem. Another example is the need for acoustic echo cancellation to protect the landline subscriber from acoustic echo originating from digital wireless networks. In the case of VoIP networks, acoustic echo is normally present when at least one of the callers is using a computer with a loudspeaker and a microphone. As is the case for line echo, acoustic echo becomes audible when there is long delay. On the other hand, differently from line echo, acoustic echo usually is not severe enough to make the conversation impossible. The methodology for canceling acoustic echo differs in many aspects from the methodology used for canceling line echo. In this work we will not be dealing with acoustic echo in VoIP networks. Our proposed algorithm is valid only for line echo signals and in the next section we describe with more details the ideas behind the line echo cancellation. 2.4 Line Echo Cancellation In this section we describe an overview of the building blocks of a line echo canceller. Our proposed algorithm to evaluate the echo component of the voice quality is based in measurements realized by the echo canceller. This section also PCM interface Voice Signaling Echo canceller Signal processing Voice codec Network encapsulation PCM Packet interface 19 defines the notation and some parameters that will be used when we describe our algorithm. In this work we will adopt the notation used by the ITU-T Recommendation G.165 [R 35] and by most books and articles on echo cancellation. A line echo canceller has four ports, two on the near end side and two on the far end side. The four ports are described in the next figure, which was again extracted from ITU-T Recommendation G.165. Figure 2-4 Circuit with a line echo canceller The four ports of the echo canceller are denoted as follows: ? Receive-in (Rin) ? Receive-out (Rout) ? Send-in (Sin) ? Send-out (Sout). An echo canceller monitors speech from the far end that passes through its receive path and uses this information to compute an estimate of the echo that is then subtracted from its send path. If the estimation is good, the echo is cancelled and only the near end speech is sent to the far end. Good echo cancellation is essential for the quality of the voice in the network. Echo cancellation occurs between the send-in and send-out ports, reducing the echo present in the send path. The total amount of echo attenuation that an echo canceller provides is called echo return loss enhancement (ERLE). ERLE is the difference in the echo level between the send-in and send-out ports and it is measured in dB. An echo canceller normally consists of three major building blocks: Far end 4-wire Hybrid Near end 2-wire Echo Send path Receive path Echo canceller Rout Sin Rin Sout 20 ? Adaptive filter ? Double-talk detector ? Nonlinear processor In the next figure, we expand the echo canceller that was represented in Figure 2-4 with its major building blocks listed above. Figure 2-5 Block diagram of a line echo canceller We now introduce two parameters that are commonly defined and used for echo cancellers as described by Figure 2-5. These parameters will also be used as inputs of our proposed algorithm. They are ? Echo return loss (ERL): The amount of echo attenuation provided by the hybrid (Figure 2-4). That is, the attenuation of the signal from the Rout port to the Sin port of the echo canceller. The ERL is measured in dB. ? Combined loss (ACOM): It is defined by the sum (in dB) of the ERL, the attenuation provided by the adaptive filter (cancellation loss) and the attenuation provided by the nonlinear processor (nonlinear processing loss) Figure 2-5. We now give a high level description on how the blocks shown in Figure 2-5 interact to realize the echo cancellation. Subtractor Adaptive filter Nonlinear processor Double talk detector error signal echo estimate Sout Sin Rout Rin Far end Near end Send path Receive path 21 The first step in echo cancellation occurs when the signal coming from the Rin port is sampled and given to the adaptive filter. The signal then travels from the Rout port of the echo canceller, to the hybrid, where most of the signal is transferred to the 2-wire loop connected to the near-end telephone. A portion of the signal is leaked by the hybrid to the Sin port of the echo canceller. This is the echo that needs to be cancelled by the adaptive filter. The echo path (Figure 2-4) is highly variable, so the filter that is required to realize the echo cancellation can not be a fixed filter. In fact, the echo path must be estimated for the particular local loop to which the hybrid gets connected. One option to derive the filter is to measure the impulse response of the echo path and then approximate it by a tapped delay line. However, in general the echo path is not stationary. Therefore, such measurements would have to be made repeatedly during a conversation. To eliminate the need of such measurements the filter is made adaptive. An algorithm is implemented which uses the residual error to adapt the filter to the characteristics of the local loop (Figure 2-5). The adaptive filter computes an estimate of the echo. The resulting estimation of the echo is then subtracted from the signal coming from the Sin port, which is composed by the echo and possibly some near end speech and noise. The resulting output is residual echo that is passed on to the nonlinear processor and is also fed back to the adaptive filter as the error signal. However, this error signal is truly an error signal only when there is no near end speech. If there is near end speech, the ?error signal? does not accurately indicates the degree of success of the cancellation and the adaptation algorithm will not converge, resulting in a failed attempt to cancel the echo. For this reason, there is a need to have double talk detection, so that the adaptation would only occur when there is no double talk (both callers speaking simultaneously). When the echo canceller?s double talk detector senses that both the near end and far end callers are speaking at the same time, it informs the adaptive filter so that the filter can ignore the error signal that comes from the subtractor, freezing the filter adaptation. As we said before, near end speech can distort the error signal and confuse the adaptation process, for this reason adaptation is halted when double talk is detected. Of course, the echo canceller still continues to cancel echo during double- talk. As soon as the double talk detector senses that double talk is no longer present, it informs the adaptive filter so that it can, once again, use the error signal to adapt to the impulse response of the hybrid. The quantization noise introduced by the PCM representation of speech samples and nonlinear echoes make it difficult for the adaptive filter to develop an absolutely perfect echo estimate. Nonlinear echoes can be caused by clipped speech signals, speech compression or poor quality speakerphones. It is extremely difficult to develop an accurate echo estimate of these nonlinear echoes because the echo canceller?s linear impulse response model cannot be correlated with these nonlinear 22 echoes. Consequently, residual echo from the subtractor is reduced to an inaudible level by some nonlinear processing. The nonlinear processor has a suppression threshold that is typically adaptive, based on the Rin and Sin signal levels. The threshold is made adaptive because, if the nonlinear processor simply blocked all signals in the send path, there would be noticeable clipping of speech. For a more detailed description about a nonlinear processor see [R 35]. 23 3 Fuzzy Logic 3.1 Introduction to Fuzzy Logic "A fuzzy design is an attempt to systematize the natural variations in human perception of truth and to imitate rudimentary skills of approximation" [R 29]. In other words, a fuzzy model of a system is a set of fuzzy rules (Section 3.4.4) by which the behavior of the system is approximately emulated. We will discuss with more details the principles of fuzzy logic and how we can use it in the following sections. The next table was extracted from reference [R 29] and it shows some of the existing applications that use fuzzy logic. Application Product Automatic train operation (Sendai subway system, Japan) Industrial Nuclear reactor control (Art Fugen, Japan) Industrial Home heating system (Viessmann-INFORM, Germany) Commercial Fingerprint classification (NIST, USA) Research Camera tracking (NASA, USA) Industrial Target tracker in Patriot missile (MMES, USA) Industrial Autofocus still camera (Sanyo, Japan) Commercial Fire detector (Cerberus, Switzerland) Industrial Table 3-1 Some successful fuzzy systems 3.2 Advantages of Using Fuzzy Logic Representing a solution with fuzzy sets generally reduces the computational requirements of the system. Approximating a group of related data points by a few fuzzy categories serves this purpose. In some cases, fuzzy methodology makes a solution possible that would otherwise be unthinkable due to cost of computing every single crisp data point. By selecting the number of fuzzy representative sets, there is a way of adjusting the precision level of a solution. If more fuzzy sets are used in design, systems will require more memory and faster CPUs. At the limit, the number of fuzzy sets becomes equal to the number of crisp data points. That represents the most precise and costly solution. Two important characteristics of successful fuzzy systems are: 1. The fuzzy systems are simple in terms of their objective and structure - which we call a fuzzy inference system (Section 3.4.8). 24 2. The fuzzy systems employ solutions articulated in daily language by means of IF-THEN fuzzy rules (Section 3.4.4). A successful fuzzy system is robust, has adjustable precision and when compared with traditional systems of computation they are more practical and cost effective. We have seen that the measure of how much the echo is contributing to decrease the quality of a call is a subjective measure. We will see in Chapter 4 that our proposed algorithm to evaluate the echo component of the voice quality in a VoIP system (Figure 4-2) is based on several parameters that are not precise values, but estimated values. Finally, as described in Section 3.4.2, fuzzy logic is useful to model approximate reasoning and is tolerant to imprecise data. In light of the two observations in the beginning of this paragraph, it seems that fuzzy logic would be an adequate tool for the implementation of our proposed algorithm. 3.3 Fuzzy Logic and Embedded Systems An algorithm to evaluate the voice quality in a VoIP system or network should be a real-time algorithm in order to give operators the precise current voice quality in their network and a chance to react as fast as possible when the quality drops. There are two distinct approaches for were such algorithm should run: 1. It can run inside the embedded system that processes the call. 2. It can run in a network server. In approach one, there is a disadvantage that embedded systems normally have limited processing power, which is used for high priority tasks like call control, speech compression and echo cancellation. It is common to have a situation that such embedded systems are working very close to their processing capacity. An advantage of this approach is that each embedded system can take action based on the real-time results of the algo rithm and try to improve its performance without having to rely in decisions based by a remote server that may even be offline for some reason. On the other hand, approach two seems to relieve the embedded system of such high processing requirement, once the algorithm would be running in a server somewhere in the network. This approach has a tremendous disadvantage of requiring all the embedded systems in the network to send information about each of their calls to this centralized server (which can be one or more servers). In this case, the bandwidth of the network is compromised. It also has the disadvantage of removing from the network device the ability of monitoring its own voice quality, generate alarms or even try some self-fixing action. It should also be noticed that even in this approach there is some extra processing required from the embedded systems, once they will have to code the required information and access the network in order to send it to the server. 25 Of course we can mix the ideas of these two approaches and try to find a compromise between network usage and required processing power from the embedded systems. In this work, we focus on the first approach and we use the ideas of fuzzy logic to develop an algorithm that requires low processing power from the embedded system that carries the VoIP application. 3.4 Fundamentals of Fuzzy Logic This section briefly describes some fundamental ideas of fuzzy logic that will be used in the development of our proposed algorithm. The description here of the fuzzy logic tools is far from complete. In general, every step of the fuzzy algorithm (also called fuzzy inference system) has several variants. For each step of the fuzzy algorithm, such as fuzzy rules (Section 3.4.4), fuzzy operations (Section 3.4.3), fuzzy implication (Section 3.4.5) and defuzzification (Section 3.4.7), we will describe only the variant that will be used in our proposed algorithm (Chapter 4). For instance, there are several proposed methods of defuzzification, but in Section 3.4.7 we will describe only the method used in our proposed algorithm. Before going into the details of each step of a generic fuzzy inference system, we will give an overview of such a system. Generally, in introductory books on fuzzy logic such description is given only after all steps have been described. We think there might be some gain in having a first look of the whole system before the parts are explained. The next figure, extracted from [R 29], depicts a generic fuzzy inference system. The figure will later be analyzed and explained in more details in Section 3.4.8. 26 Figure 3-1 Block diagram of a generic fuzzy inference system In the next sections we give a brief introduction to fuzzy logic and describe with some detail the blocks present in Figure 3-1. 3.4.1 Fuzzy Logic Versus Boolean Logic Some books describe fuzzy logic as an extension of Boolean logic. In this section we briefly draw some comparisons between Boolean logic and fuzzy logic. Boolean logic consists of three elements: truth values, linguistic connectors and reasoning types. In Boolean logic, truth values are either 1 or 0. In fuzzy logic, truth is a matter of degree and truth values can range between 1 and 0 in a continuous manner. In fact, this idea of continuum variation of truth values constitutes the most outstanding difference between Boolean logic and fuzzy logic. This will be discussed in more details in Section 3.4.2. The other two elements that compound the theory of Boolean logic - linguistic connectors (union, intersection, negation) and modes of reasoning (such as syllogism) Input data processing Evaluation of input fuzzy variables Raw input data Fuzzy inference system Evaluation of fuzzy rules Aggregation of fuzzy rule consequences Defuzzification Output (single number) 27 - function in the same way for fuzzy logic and Boolean logic. However, their properties and interpretation are affected and they are analyzed with more details in Sections 3.4.3 and 3.4.4 respectively. The first important definition in fuzzy logic is what is called a fuzzy set and it describes exactly this idea of degree that makes fuzzy logic different from Boolean logic. 3.4.2 Fuzzy Sets and Membership Functions In classical set theory, an element either belongs to a set or not. The characteristic function of a subset A of a set X is the indicator function ?A(x) : X ? {0, 1}, with domain X which has value 1 at points of A and 0 at points of X - A, that is ?A(x) = ?? ? ? ? A x0, x1, A , Using the characteristic function defined above, we can also express the set A as A = { x | ?A(x) = 1} = { x | x?A } As our proposed algorithm only takes real values as inputs, we will from now on consider only the case where the universe X = ?, the set of real numbers. Motivated by the ideas of classical set theory described above, we can define a fuzzy set as collection of elements with a varying degree of inclusion. In classical set theory the characteristic function defines a crisp boundary between the elements that belong to a set A and the elements that do not belong to A. In fuzzy theory a function that plays a similar role to the characteristic function is called membership function. A membership function can take values in the interval [0, 1]. For a fuzzy set A, the membership function is defined as ?A(x): ? ? [0, 1] That is, a membership function is a curve that defines how each point in ? is mapped to a membership value (or degree of membership) between 0 and 1. If ?A(x1) > ?A(x2), then "x1 belongs more to A than x2". In a similar way as described before for the characteristic function, we can express the fuzzy set A as A = { (x, ?A(x)) | x?? }, 28 where (x, ?A(x)) is a singleton. As an example, suppose that all possible temperatures cover the real line ? (imagine that there is no limit for negative temperatures). The following figure depicts a possible membership function ?A(x) for the fuzzy set A = ?good temperature to have the ECE annual barbecue?. Figure 3-2 Example of a fuzzy member function For instance, we can say that any temperature between 70? F and 80? F is very good for a barbecue. We can also say that if we have a temperature between 50? F and 60? F it won?t be so good, some people will complaint that it?s cold. That is why temperatures between 70? F and 80? F have a high value for ?A(x) and temperatures between 50? F and 60? F have lower values for ?A(x). However, it doesn?t make too much sense to say that 60? F is a good temperature to have a barbecue, but 59? F is not. So it makes more sense to define the set A = ?good temperature to have the ECE annual barbecue? as a fuzzy set instead of a classical, crisp set. The difference between crisp (or classical) sets and fuzzy sets is exemplified by the difference between a characteristic function and a membership function as described in the next figure. The membership function ?A(x) describes the membership values for the fuzzy set A = ?good temperature to have the ECE annual barbecue?. The characteristic function ?B(x) describes the set B = ?temperature is equal or larger than 30? F and equal or smaller than 90? F?. -10 0 10 20 30 40 50 60 70 80 90 100 ? F 1 0 ?A(x) 29 Figure 3-3 Comparing a characteristic function to a membership function It is clear that the fuzzy set A is imprecise, subjective and it is strongly based in human interpretation. Different temperatures x?? belong to A to certain extent. On the other hand, the set B is completely determined and precise, with a sharp boundary. Some temperatures are in B and all remaining temperatures are not. The geometrical shape of the membership function should reflect the uncertainty in the corresponding fuzzy variable. For this reason we should not design such membership functions with a high level of detail. In our proposed algorithm we used only membership functions that are piece-wise-linear. In our work, we explore these ideas of fuzzy sets as a model of approximate reasoning for our proposed algorithm. 3.4.3 Fuzzy Set Operations Compared to classical set theory, fuzzy set theory offers a family of set operations due to the nature of fuzzy sets. For example, the intersection between two crisp sets such as ?ECE students with GPA greater than 3.0? and ?ECE students that live in college park? will be a deterministic set. On the other hand, the intersection between two fuzzy sets such as ?ECE students with good GPA? and ?ECE students who live close to the campus? will be a matter of degree that should be determined from the individual membership functions of the fuzzy sets. In this section we define the intersection and union of fuzzy sets. These operations (together with fuzzy implication - Section 3.4.5) will be used to compose fuzzy rules, which will be the main logic behind our proposed algorithm. a) Union of fuzzy sets There are many possible ways to define the union operator in fuzzy logic and each definition potentially produces a different outcome. -10 0 10 20 30 40 50 60 70 80 90 100 ? F 1 0 ?B(x) ?A(x) 30 In our work we used the following definition for the union of fuzzy sets: Suppose we have N fuzzy sets that are represented by the membership functions ?1(x) , ?2(x) , ? , ?N(x) x?? then, the union of these fuzzy sets is a new fuzzy set with a membership function ?U(x) given by ?U(x) = max ( ?1(x) , ?2(x) , ? , ?N(x) ) for all x?? b) Intersection of fuzzy sets Again, there are many possible ways to define the intersection operator in fuzzy logic and each definition potentially produces a different outcome. In our work we used the following definition for the intersection of fuzzy sets: Suppose we have N fuzzy sets that are represented by the membership functions ?1(x) , ?2(x) , ? , ?N(x) x?? then, the intersection of these fuzzy sets is a new fuzzy set with a membership function ?I(x) given by ?I(x) = min ( ?1(x) , ?2(x) , ? , ?N(x) ) for all x?? The fuzzy operations described above are the main blocks that build what is called composite fuzzy rules. A fuzzy system has its behavior dictated by the fuzzy rules, which are described in the next section. 3.4.4 Fuzzy Rules The basic principles of inference in fuzzy logic are adaptations of the classical inference principles to the fuzzy domain. Fuzzy reasoning is based on inference rules of the form IF , THEN similarly to classical logic, but now we use fuzzy sets instead of classical sets. As fuzzy sets define linguistic variables, fuzzy inference rules can model a system linguistically. In fact, our proposed algorithm does just that, modeling linguistically 31 for instance, what is a good and a bad echo signal in terms of the overall voice quality as we will see in Chapter 4. An example of a simple fuzzy rule is: IF x is A, THEN y is B where A and B are fuzzy sets. Fuzzy algorithms are just a set of fuzzy rules (which are also called IF/THEN rules). In Chapter 4, the main logic behind our proposed algorithm is described in terms of fuzzy rules and as we will see, those fuzzy rules allow degrees of association between the inputs of the algorithm to better reflect what would be, for instance, a good and bad echo signal with respect to the voice quality of the call. Interpreting an IF/THEN rule involves two distinct parts: ? Evaluate the premise, that is, use membership functions to map the input values into fuzzy sets ? Apply that result to the consequence - also known as implication. In the next section we discuss how to compute this implication. However, before proceeding to the next section we will show in a block diagram how a set of fuzzy rules are used by the fuzzy algorithm to produce an output (extracted from [R 29]). This will help to clarify the discussions in the next sections (fuzzy implication, aggregation and defuzzification). 32 Figure 3-4 Set of rules being aggregated and the final defuzzification If we now compare Figure 3-4 with the figure that shows a block diagram of a fuzzy inference system (Figure 3-1 in the beginning of Section 3.4) we see that Figure 3-4 shows with some more details what is described by the last (bottom) three blocks of Figure 3-1. Later, in Section 3.4.8, we will come back to these figures to put all the blocks together. 3.4.5 Fuzzy Implication The fuzzy implication is a mechanism that performs the inference of a fuzzy rule. In the conditional proposition (fuzzy rule) described in the previous section we need to define how the consequence is affected by the premise. The idea here is that the consequence specifies a fuzzy set to be assigned to the output. The fuzzy implication then modifies that fuzzy set to the degree specified by the premise. As shown in Figure 3-4, the implication process is the first step towards computing the output of the fuzzy inference system. That is, after we have computed the consequences of all fuzzy rules, which is done by the implication method, we aggregate them and defuzzify. There are several ways to specify the fuzzy implication operator. Some of the most used implication operators are known as Lukasiewicz, Zadeh, Larsen, Mamdani, standard and drastic product implication operator. There is an almost exhaustive list of such operators and their mathematical definitions in terms of membership functions in reference [R 29]. Also, this same reference [R 29] analyzes which set of Rule 1 Premise Consequence Rule 2 Premise Consequence Rule k Premise Consequence THEN Aggregation Defuzzification Output (single number) 33 operators is probably best adjusted to the characteristics of a problem. In our proposed algorithm we first tried the Mamdani implication operator. The simulation results, after the tuning of the membership functions, were quite good. However, we got even better results using the Larsen implication operator, which acts like a scaling operator, that is, the output membership function is scaled by some factor that is determined by the premise. The Larsen implication operator is the one chosen for our proposed algorithm. In order to give an example of how this operator works, suppose we have a fuzzy rule like IF x is A, THEN y is B where the fuzzy sets A and B are respectively described by the fuzzy membership functions ?A(x) and ?B(x) as shown in the next figure. Then, for the Larsen implication method, if we have the input variable x set to the value 3 then the output fuzzy set of this rule is given by the membership function ?C(x) as shown in the next figure. Figure 3-5 Example of the fuzzy implication method chosen for our proposed algorithm That is, the degree of which the particular value of the input variable x belongs to the fuzzy set A is used to scale the membership function of the output (consequence) of the rule. 3.4.6 Aggregation Operator For a set of inputs to the fuzzy algorithm, several fuzzy rules may be used to provide the final output of the algorithm. However, in an intermediate step of the algorithm (Figure 3-4) we are required to aggregate the few fuzzy output sets (or membership functions) that are the result of the few fuzzy rules that were used. Aggregating two or more fuzzy output sets yields a new fuzzy set (or membership function) in the fuzzy algorithm. 0 1 2 3 4 5 6 7 8 9 10 x 1 0 ?A(x) 1 0 ?B(x) ?C(x) x 34 There are few different ways to specify the fuzzy aggregation operator. In our proposed algorithm we will use the fuzzy union (Section 3.4.3) as the aggregation operator. In order to exemplify how the aggregation of membership functions work using the fuzzy union operator, suppose that two fuzzy rules generate two membership functions ?A(x) and ?B(x) as described in the Figure 3-6 below. Then the aggregation of these two outputs using the union operator is given by ?C(x) as shown in the figure below. Figure 3-6 Example of the fuzzy aggregation method used in our proposed algorithm 3.4.7 Defuzzification In the proposed fuzzy system for our algorithm we require a final crisp output. That is, the output of the algorithm is a number between zero and one that informs how good was the echo signal with respect to the voice quality in the call. In order to convert a result from a fuzzy set to a crisp result we use a process that is called defuzzification. Defuzzification is the process that selects a single value to represent the information contained within a fuzzy set. It is the final treatment to obtain a scalar that is often suitable for the operation of systems in practice (Figure 3-4). There are several methods suggested in the literature for the defuzzification process. Different methods may lead to different results. For our proposed algorithm we used what is called the centroid method (also called center of mass). In fact this is probably the most commonly used defuzzification method. In the centroid method, the defuzzified output xc is defined by 1 0 ?A(x) ?B(x) 1 0 ?C(x) x 35 ? ? ? ?? ? ??= dxx xdxx xc )( )( m m where ?(x) is the output membership function after the aggregation of individual IF/THEN rules. So, for instance, in the example described in Figure 3-6, the output membership function after aggregation is ?C(x). The defuzzification in that example is then computed as the center of mass of ?C(x) as shown in the figure below. Figure 3-7 Example of a defuzzification by center of mass 3.4.8 Fuzzy Inference System In this section we put together all the ideas that we developed in the previous sections of this chapter. Fuzzy inference is the process of formulating the mapping from a given input to an output using fuzzy logic. The mapping then provides a basis from which decisions can be made. A fuzzy inference system has a simple input-output relationship as was shown in Figure 3-1. Input data is collected from the external world. Then it is processed by the fuzzy inference system to produce the output data to be used back in the external world. 1 0 ?C(x) x xC 36 Figure 3-8 Detailed fuzzy inference system Raw input data Fuzzy inference system Mapping data to input fuzzy variables ?A(x) ?B(y) ?C(w) Rule 1 IF x is A OR y is B THEN z is D1 Rule 2 IF x is A AND w is C THEN z is D2 Rule N IF y is B THEN z is DN ?D1?(x) ?D2?(x) ?DN?(x) Aggregation ?(x) Defuzzification Output (single number) 37 4 A Fuzzy Algorithm to Evaluate the Echo Component of the Voice Quality in a VoIP Network 4.1 Objective Evaluation of the Voice Quality In Section 1.1 we described the several parameters that can be used to evaluate the performance of a VoIP system or network (Table 1-1). In this work we are focusing on measuring the voice quality in a VoIP system or network and in Section 1.1 we showed three different perspectives for voice quality evaluation in a VoIP network. We decided to adopt the objective evaluation of the voice quality (Section 1.1.3) for our proposed algorithm for two reasons: ? Differently from the subjective quality method (Section 1.1.2), an objective method can be automated once it doesn?t require human intervention or feedback. If well designed, such a method can even estimate the quality in real-time. ? Differently from the network quality method (Section 1.1.1), an objective method takes into consideration the user?s perspective of the call and not only parameters that qualify the performance of the IP network. As is described in [R 5], there are three classes of objective voice quality evaluation metrics: network-parameter based metrics, psycho-acoustic metrics, and elementary metrics. ? Parameter-based metrics do not consider the actual voice signal. Instead, these metrics sum impairment factors that characterize the individual components of the communication system. For instance, in the E-model (Section 1.1.3) the packet loss and delay in a VoIP system are translated into impairment factors. Parameter-based metrics such as the E-model hold promise for predicting subjective voice quality but still require extensive refinements and verifications. ? Psycho-acoustic metrics transform voice signals to a reduced representation to retain only perceptually significant aspects. These metrics aim to predict the subjective quality over a wide range of voice signal distortions. One example of such metric is the PESQ algorithm (Section 1.1.3). ? Elementary objective voice quality metrics rely on low-complexity signal processing parameters and techniques to predict sub jective voice quality. Elementary metrics generally have smaller correlations with subjective voice quality than highly complex psycho-acoustic metrics and do not provide the perception modeling needed for psycho-acoustic coder algorithm 38 development. However, elementary metrics represent a good engineering tradeoff for communication and networking system researchers and developers in that they allow for fairly detailed conclusions about voice quality while having low computational complexity. The algorithm that we propose in this chapter is a fuzzy algorithm that estimates the amount of echo present in a VoIP call after echo cancellation. Based on the descriptions given in Section 1.1 about the different perspectives for evaluating the voice quality in VoIP networks and based on the objective voice quality metrics described above we have the following diagram: Voice quality { ? Network quality ? Subjective voice quality ? Objective voice quality fi { ? Parameter-based metrics ? Psycho-acoustic metrics ? Elementary metrics Figure 4-1 Classification of voice quality algorithms for VoIP systems So, the fuzzy algorithm proposed in this chapter estimates the echo quality factor of the voice quality and it is a building block of an objective, passive, voice quality algorithm based on elementary metrics that can run in real-time and estimate the voice quality for live calls in a VoIP system or network. As we mentioned in Chapter 1, the critical issues in delivering good voice quality over IP networks are: packet loss, delay, echo and jitter. These issues are all correlated, but there is a stronger correlation between jitter, delay and packet loss. Jitter in VoIP systems is normally compensated for by using a playout buffer at the receiving end, which introduces delay and additional packet loss. So we can imagine a fuzzy inference system that evaluates the voice quality in a VoIP network described by the following blocks: 39 Figure 4-2 Fuzzy inference system to estimate the voice quality in an VoIP network We note that in the above figure, the block that combines the echo quality and network quality into a final voice quality estimative can be a fuzzy inference system or not. We can go even further and imagine a fuzzy inference system that estimates the performance of a VoIP network (not only the voice quality - see Table 1-1) for a call. Based on Table 1-1 and Figure 1-4 we can have such fuzzy inference system described as follows Figure 4-3 Fuzzy inference system that estimates the performance of a VoIP network for a call Echo quality fuzzy inference system Packet loss / delay / jitter fuzzy inference system Combines echo quality and network quality Input parameters Input parameters Estimated voice quality Echo quality Network quality Voice quality fuzzy inference system Call setup (before and during) fuzzy inference system Voice quality fuzzy inference system Combines the call setup quality and the voice quality Input parameters Input parameters Estimated VoIP network performance for a call Call setup quality Voice quality VoIP performance fuzzy inference system 40 That is, the fuzzy inference system that estimates the voice quality in a VoIP network (Figure 4-2) is a building block of the larger system described in Figure 4-3. However, this seems to be a long shot and in this chapter we have a much simpler objective. In the next section we propose an echo quality fuzzy inference system, which is a component of a larger system (Figure 4-2) that can estimate the voice quality in a VoIP system or network. 4.2 Channel Based Algorithm A VoIP system or network may carry a huge number of VoIP calls simultaneously. In fact there are high density platforms available in the market that can handle dozens of VoIP calls per card. In this section we propose an algorithm to evaluate the quality of the echo signal for a single VoIP channel. There are many possible ways of aggregating the echo component of several channels in the network to provide an evaluation of the echo in the network as whole and this will be discussed later. The objectives of the proposed algorithm have already been described in the previous sections and chapters. The following list is a summary of such objectives. ? Obtain the echo quality component of an objective voice quality algorithm based on elementary metrics to estimate the voice quality in a VoIP system (Section 4.1 and Figure 4-2). ? The algorithm must have a computational complexity low enough such that it can run in an embedded module inside every VoIP channel in the VoIP system or network (Section 1.1). ? The algorithm must be able to run for live VoIP calls without the need of a reference signal, which is one of the limitations of several objective voice quality methodologies including the PSQM (Section 1.1.3). ? The algorithm must give a real-time estimation of the echo signal by outputting a few parameters (or scores). ? The algorithm?s output scores will not be as reliable and precise as the MOS (Section 1.1.2) or the PSQM / PESQ (Section 1.1.3) scores. As is common for elementary metrics methodologies of evaluating objective voice quality (Section 4.1), there will be a tradeoff between precision and computational complexity that we should be willing to accept. In order to achieve the objectives listed above we propose the following design characteristics for the algorithm. ? Use a fuzzy inference system to estimate the echo component of the voice quality. This should give the algorithm a low computational complexity, the ability to run in real-time for live calls embedded in every VoIP channel. The usage of fuzzy logic will also result in some imprecision for the final scores. ? In order to obtain a real-time, low computational complexity algorithm we chose to use as inputs to our fuzzy inference system parameters that are already being 41 computed or estimated by the echo canceller, such as estimates for the ERL and ACOM (Section 2.4), speech powers estimations and noise powers estimations. We should note that although the fuzzy logic implementation results in low computational complexity it has the disadvantage of not being precise. As we said before, it reflects approximate human reasoning and it will never be as good as the subjective MOS (Section 1.1.2) and it won?t be as precise as the PSQM or PESQ. Based on the ideas described in Chapter 3, we will now design a fuzzy inference system to evaluate the echo component of the voice quality. The first step is to define the input parameters that will be used by the fuzzy inference system. In order to achieve minimum computational power, we used only parameters that are already estimated and used by the echo canceller (Figure 2-5). Of course, different echo cancellers may estimate a different set of parameters and in this case we should need some extra computations to estimate the required parameters for the algorithm proposed here. The input parameters that we will use are: ? Echo return loss (ERL) - Section 2.4 ? Combined loss (ACOM) - Section 2.4 ? Receive speech power - an estimate of the speech power in the receive path (Figure 2-5) ? Receive noise power - an estimate of the noise in the receive path (Figure 2-5) ? Transmit speech power - an estimate of the speech power in the send path (Figure 2-5) after the echo cancellation ? Transmit noise power - an estimate of the noise in the send path (Figure 2-5) Then we need to define the fuzzy sets and membership functions for each input so that we can associate the raw input parameters to a fuzzy set (Section 3.4.2). Based on the input parameters define above, we will use the fuzzy sets described in the next table. Fuzzy set Description Good ERL Represents values of ERL that will help the echo canceller to realize a good echo cancellation. Bad receive speech power The receive speech powers in this set are either to low or to high, making it difficult for the echo canceller to generate the signal that must be subtracted in the send path. Bad transmit noise power Represents values of the transmit noise that may disrupt the convergence of the adaptive filter. Bad ACOM With high probability, VoIP systems with ACOM values in this set will have echo problems and the voice quality will be bad. 42 Moderate ACOM Represents values of ACOM that may indicate that the echo cancellation was not good enough and some echo may be leaked to the far end. Good ACOM VoIP systems with ACOM values in this set are able to cancel most of the echo in the calls. Table 4-1 Fuzzy sets associated to the input parameters The output of the fuzzy inference system will be an estimate of the echo component of the voice quality. We defined the following three output membership functions: ? The membership function for the fuzzy set ?bad echo (be)? ?be(x) = ?? ??? ??? otherwise xx ,0 2 10,21 ? The membership function for the fuzzy set ?moderate echo (me)? ?me(x) = ? ? ? ? ?? ? ? ? ? ??? otherwise x xx ,0 30,1 3020,1020 45 Graphically we have Figure 4-5 ERL fuzzy membership function Combined Loss (ACOM) ? The membership function for the fuzzy set ?bad ACOM (bacom)? ?bacom(x) = ? ? ? ? ? ? ? < ??? otherwise x xx ,0 6,1 236,1723 ? The membership function for the fuzzy set ?moderate ACOM (macom)? ?macom(x) = ? ? ? ? ?? ? ? ? ? ??? otherwise x xx ,0 40,1 4023,1723 Graphically we have Figure 4-6 ACOM fuzzy membership functions Receive speech power ? The membership function for the fuzzy set ?bad receive speech power (brsp)? ?brsp(x) = ? ? ? ? ?? ? ? ? ????+ ?< ????+ otherwise x xx ,0 36,1 3645,945 Graphically we have Figure 4-8 Transmit noise fuzzy membership function Now we show an example of a fuzzy rule computation carried by the fuzzy inference system proposed above. Suppose we have the following set of inputs (raw data) ERL = 23 dB ACOM = 28dB Receive speech power = -27dBm 1 0 ?(x) ?btnp(x) -60 -50 -40 -30 x (dBm) 1 0 ?(x) ?brsp(x) -30 -25 -20 -15 -10 -5 x (dBm) 48 Transmit noise power = -50dBm Suppose we want to compute the fuzzy rule ?IF ACOM is moderate AND ERL is good THEN echo is moderate? as described in Table 4-2. The fuzzy set ?moderate ACOM? is described by the membership function ?macom(x), the fuzzy set ?good ERL? is described by the membership function ?gerl(x) and finally, the fuzzy set ?moderate echo ? is described by the membership function ?me(x). All these membership functions were previously defined in this section. Now we map the input data into the fuzzy sets: ?macom(28) = 138132836 =? ?gerl(23) = 103102023 =? The fuzzy AND operator (intersection of fuzzy sets) that we are using is defined in Section 3.4.3., the fuzzy implication operator that we are using is the Larsen operator, so the output of this rule is given by: ?(x) = 103 ?me(x) = ? ? ? ? ?? ? ? ? ?