ABSTRACT
Title of Dissertation: Digital Smart Health Via
Physiological Signal Sensing And Learning
Xin Tian
Doctor of Philosophy, 2022
Dissertation Directed by: Professor Min Wu
Department of Electrical and Computer Engineering
Periodic blood volume change underneath a person?s skin induces subtle color variations in the
skin area. These subtle changes can be captured by a noninvasive and low-cost optical technique
called photoplethysmography (PPG). In this dissertation, we study the modeling of contact-based
and contact-free PPG signals to facilitate its promising applications in physiological signal sens-
ing and learning for digital smart health.
In the first part of the dissertation (Ch. 2), we propose a user-friendly and continuous electrocar-
diogram (ECG) measurement with the help of contact-based PPG sensors for long-term cardio-
vascular health monitoring. ECG is a clinical gold standard for non-invasive cardiac diagnosis
but continuous ECG monitoring is challenging. PPG provides a low-cost alternative, although
the clinical knowledge and use of applying PPG directly for cardiovascular health management
are considerably less than those for ECG. How to leverage the advantages of these two mea-
surement modalities for better and easier healthcare? We first study the physiological and signal
relationship between PPG and ECG and then infer the waveform of ECG via PPG based on their
relationship. Joint dictionary learning frameworks are proposed to learn the mapping that re-
lates the sparse domain coefficients of each PPG cycle to those of the corresponding ECG cycle.
This line of research has the potential to fully utilize the easy measurability of PPG and the rich
clinical knowledge of ECG for better preventive healthcare.
In the second part of the dissertation (Ch. 3), a physiological digital twin for personalized and con-
tinuous cardiac monitoring is developed. Using our proposed dictionary learning based frame-
work as the backbone model, this part of the dissertation focuses on the problem of inferring
ECG signals from PPG signals under realistic conditions where available ECG data are scarce.
With transfer learning, a generic digital twin model learned from a large portion of paired ECG
and PPG data is fine-tuned to precisely infer the ECG from the PPG of a target participant whose
available ECG data are scarce. Experimental results validate the feasibility of using the proposed
method to learn a reliable digital twin for precision continuous cardiac monitoring. Exploration of
causality-inspired neural network models using conditional variational autoencoder is also con-
ducted based on the underlying physiological process of ECG generation for better explainability
with more flexibility in transfer learning.
In the third part of the dissertation (Ch. 4 and Ch. 5), we present contactless methods of blood
oxygen saturation measurement from remote PPG signals captured by regular RGB smartphone
cameras. Both a principled signal processing based method and a data-driven neural network
based method are proposed for blood oxygen estimation by either explicitly or implicitly ex-
tracting features from multi-channel skin color signals with color channel mixing and temporal
analysis. Experimental results show that our proposed methods achieve better accuracy of blood
oxygen estimates compared to traditional methods using only two color channels and prior art.
DIGITAL SMART HEALTH VIA
PHYSIOLOGICAL SIGNAL SENSING AND LEARNING
by
Xin Tian
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
2022
Advisory Committee:
Professor Min Wu, Chair/Advisor
Professor Behtash Babadi
Professor Furong Huang
Professor Chau-Wai Wong
Professor Sushant M. Ranadive
Professor Guodong (Gordon) Gao, Dean?s Representative
? Copyright by
Xin Tian
2022
To My Parents, My Family, and Shoujing
ii
Acknowledgment
Memory brings me back to the spring of 2018 when I was thrilled to begin my Ph.D.
journey in Professor Wu?s group. Several memorable years went by quickly, and I have come
to this final stage of pursuing my Ph.D. degree with a more mature, humble, and grateful heart.
Looking back, I owe my gratitude to all the people who have made this dissertation possible and
made my graduate life a precious experience to reflect on in the future.
First of all, I?d like to express my appreciation to my advisor, Prof. Min Wu, for providing
me with invaluable opportunities to work on challenging and meaningful projects over the past
years. Prof. Wu is always supportive and has been a role model from whom I not only gain
expertise but also learn to take ownership and step out of comfort zones with broader visions,
curiosity, courage, and enthusiasm. It is a pleasure to have such an extraordinary and encouraging
advisor, without whose professional expertise and wise guidance, this dissertation could only
have been a distant dream.
I want to thank Prof. Behtash Babadi, Prof. Furong Huang, Prof. Chau-Wai Wong, Prof.
Sushant Ranadive, and Prof. Guodong Gao for agreeing to serve on my committee and providing
invaluable insights for the dissertation. I also want to thank all the professors at the University
of Maryland who taught me courses that helped pave a foundation when I pursue subsequent re-
search. In particular, I?d like to thank Prof. K. J. Ray Liu, Prof. Behtash Babadi, Prof. Alexander
Barg, and Prof. Min Wu from whom I learned the key signal processing and machine learning
courses and received their encouragement and support. I would also like to acknowledge the
tremendous help, insightful discussions, and construction feedback I have had from Prof. Chau-
iii
Wai Wong, Prof. Sushant Ranadive, Dr. Clifton Watt, and visiting scholar Prof. Yuenan Li on
multiple collaborative projects.
I am sincerely thankful for the group members at the MAST-UMD group who have greatly
helped me: Dr. Qiang Zhu, Dr. Mingliang Chen, Mr. Zachary Lazri, Mr. Fakai Wang, Mr. Ashira
Jayaweera, Ms. Yiqi Li, and Mr. Carl Steinhauser. Especially Dr. Zhu patiently helped me a lot
when he introduced and guided me to my first research project. I would also like to acknowledge
the help from graduate students Ms. Sara Mascone and Ms. Emily Blake at the school of public
health in Prof. Sushant Ranadive?s lab and Ms. Joshua Mathew and Ms. Jisoo Choi from North
Carolina State University in Prof. Chau-Wai Wong?s lab.
Part of the research in this dissertation has been supported by the U.S. National Sci-
ence Foundation under ECCS#2030502 and #IIS-2124291 and a UM Venture COVID Challenge
Award, for which I am grateful. Any opinions, findings, conclusions, or recommendations ex-
pressed in this dissertation and the related publications are those of the author(s) and do not
necessarily reflect the views of the funding agencies.
Words cannot express the gratitude I owe to my family that has always stood by me and
pulled me through seemingly impossible challenges at times. I would also like to thank Shoujing
who gives me unconditional support, encouragement, care, and company during those bitter-
sweet times in life. I dedicate this dissertation to them. I want to give myself a cheer for the
perseverance, courage, and passion to complete this journey with flying colors and memory to
cherish.
Last but not least, I offer my regards and blessings to all who supported me in all aspects
during my graduate life!
iv
Table of Contents
Dedication ii
Acknowledgments iii
Table of Contents v
List of Tables viii
List of Figures x
Chapter 1: Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Photoplethysmography (PPG) and Remote PPG . . . . . . . . . . . . . . 1
1.1.2 Electrocardiogram (ECG) and Its Physiological and Signal Relation With
PPG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Digital Twins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.4 Blood Oxygen Saturation . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.1 Cross-domain Joint Dictionary Learning for ECG Inference from PPG . . 11
1.2.2 Never-Miss-A-Beat: A Physiological Digital Twins Framework for Car-
diovascular Health . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.3 Noncontact Hand Video Based SpO2 Monitoring Using Smartphone Cam-
eras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Chapter 2: Cross-domain Joint Dictionary Learning for ECG Inference from PPG 15
2.1 Motivation and Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 ECG reconstruction from PPG . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.2 Dictionary learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Proposed Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.1 Signal Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.2 Cross-domain Joint Dictionary Learning (XDJDL) . . . . . . . . . . . . 22
2.3.3 Label Consistent XDJDL (LC-XDJDL) . . . . . . . . . . . . . . . . . . 27
2.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.2 Metrics for Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4.3 Overall Morphological Reconstruction . . . . . . . . . . . . . . . . . . . 33
v
2.4.4 Subwave Morphological Reconstruction . . . . . . . . . . . . . . . . . . 37
2.4.5 Time Interval Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5.1 Result Using PPG-based Segmentation Scheme . . . . . . . . . . . . . . 41
2.5.2 Evaluation on the Capnobase TBME-RR Dataset . . . . . . . . . . . . . 42
2.5.3 Feasibility Analysis of The Proposed Method for The Internet-of-Healthcare-
Things (IoHT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.5.4 Limitations of The Proposed Method . . . . . . . . . . . . . . . . . . . . 46
2.5.5 Future Work Towards Explainable AI . . . . . . . . . . . . . . . . . . . 51
2.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Chapter 3: Never-Miss-A-Beat: A Physiological Digital Twins Framework for Cardio-
vascular Health 53
3.1 Digital Twins Relating PPG and ECG Sensing: Motivation and Problem Formu-
lation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2 Related Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3.1 Backbone Model for ECG Inference from PPG . . . . . . . . . . . . . . 56
3.3.2 Transfer Learning for Building Precision Healthcare Digital Twins . . . . 58
3.3.3 Testing Modes for ECG Inference . . . . . . . . . . . . . . . . . . . . . 61
3.4 Experimental Results Using XDJDL as The Backbone For The Personalized Dig-
ital Twin Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.4.2 Hyperparameters Selection . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.4.3 Performance of ECG Inference . . . . . . . . . . . . . . . . . . . . . . . 66
3.5 Discussions for XDJDL-based Personalized Digital Twin Model . . . . . . . . . 72
3.5.1 Results Based on PPG Segmentation Scheme . . . . . . . . . . . . . . . 72
3.5.2 Performance Evaluation for Long Time Scale Data . . . . . . . . . . . . 74
3.6 Using Neural Networks as The Backbone for ECG Inference from PPG to Build
Digital Twins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.6.1 A Retrospect: The Physiological Process Behind PPG and ECG Generation 80
3.6.2 Conditional Variational Autoencoder (CVAE) for PPG-to-ECG Inference 81
3.6.3 Transfer Learning to Build Personalized Digital Twin for Cardiovascular
Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.7 Incorporating Causality into CVAE Model Based on Structural Causal Model
(SCM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.7.1 Importance of Incorporating Causality into Machine Learning Algorithms
and Structural Causal Model . . . . . . . . . . . . . . . . . . . . . . . . 88
3.7.2 Causal CVAE Model for PPG-to-ECG Inference . . . . . . . . . . . . . . 90
3.7.3 ECG Reconstruction Performance of Personalized Digital Twins . . . . . 93
3.7.4 Intervention Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.8 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Chapter 4: A Multi-Channel Ratio-of-Ratios Method for Noncontact Hand Video Based
SpO2 Monitoring Using Smartphone Cameras 102
vi
4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.1.1 Contact-based SpO2 measurement using smart devices . . . . . . . . . . 102
4.1.2 Noncontact SpO2 measurement using cameras . . . . . . . . . . . . . . . 103
4.2 Ratio-of-ratios (RoR) Model for Noncontact SpO2 Measurement . . . . . . . . . 104
4.3 Proposed Multi-Channel RoR Method . . . . . . . . . . . . . . . . . . . . . . . 107
4.3.1 ROI Localization and Spatial Combining . . . . . . . . . . . . . . . . . 108
4.3.2 rPPG Extraction and HR Estimation . . . . . . . . . . . . . . . . . . . . 109
4.3.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.3.4 Regression and Postprocessing . . . . . . . . . . . . . . . . . . . . . . . 111
4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.4.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.4.2 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.4.3 Results From Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . 116
4.4.4 Ablation Study of Proposed Pipeline . . . . . . . . . . . . . . . . . . . . 120
4.4.5 Leave-One-Out Experiments . . . . . . . . . . . . . . . . . . . . . . . . 124
4.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.5.1 Performance on Contact SpO2 Monitoring . . . . . . . . . . . . . . . . . 125
4.5.2 Resilience Against Blurring . . . . . . . . . . . . . . . . . . . . . . . . 127
4.5.3 Limitations and Further Verification with Intermittent Hypoxia Protocols . 128
4.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Chapter 5: Optophysiological Model Guided Neural Networks for Contactless Blood
Oxygen Estimation From Hand Videos 137
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.2 Proposed Optophysiology-Guided Neural Network Method for Estimating SpO2
From Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.2.1 Extraction of Skin Color Signals . . . . . . . . . . . . . . . . . . . . . . 140
5.2.2 Neural Network Architectures . . . . . . . . . . . . . . . . . . . . . . . 141
5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.3.1 Dataset and Capturing Conditions . . . . . . . . . . . . . . . . . . . . . 144
5.3.2 Participant-Specific Results . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.3.3 Leave-One-Participant-Out Results . . . . . . . . . . . . . . . . . . . . . 152
5.3.4 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
5.4 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
5.4.1 Contact-based Dataset Testing . . . . . . . . . . . . . . . . . . . . . . . 156
5.4.2 Ability to Track SpO2 Change . . . . . . . . . . . . . . . . . . . . . . . 157
5.4.3 Visualizations of RGB Combination Weights . . . . . . . . . . . . . . . 158
5.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Chapter 6: Conclusions and Future Perspectives 163
Bibliography 166
vii
List of Tables
2.1 Comparison of different ECG sensing techniques. . . . . . . . . . . . . . . . . . 16
2.2 Composition of the collected mini-MIMIC-33 dataset. . . . . . . . . . . . . . . . 31
2.3 Configurations of the models implemented for comparison of ECG reconstruction
performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4 Quantitative performance comparison for ECG waveform inference. . . . . . . . 34
2.5 Numerical comparison of ECG signal inference performance among XDJDL,
LC1-XDJDL, and LC2-XDJDL methods. . . . . . . . . . . . . . . . . . . . . . 36
2.6 Comparison of subwave reconstructions in the mean of ? and rRMSE. . . . . . . 38
2.7 Comparison of timing interval recovery accuracy in MAE. . . . . . . . . . . . . 40
2.8 Quantitative comparison of different segmentation schemes. . . . . . . . . . . . . 42
2.9 Quantitative performance comparison for ECG waveform inference using the
Capnobase TBME-RR database. . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.10 Computational resources consumed to reconstruct test ECG cycles using the pro-
posed XDJDL method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.1 A research review of the dataset and its split method used by the emerging tech-
nologies for ECG waveform inference from continuous PPG. . . . . . . . . . . . 55
3.2 Numerical results for each group of the mini-MIMIC-127 dataset and overall
groups using transfer learning and baseline models. . . . . . . . . . . . . . . . . 69
3.3 Comparison of different segmentation schemes for ECG inference in numerical
results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.4 The data collection time stamps for the participant during a week. . . . . . . . . 75
3.5 The personalized digital twin performance of different learning and evaluation
schemes for the self-collected dataset. . . . . . . . . . . . . . . . . . . . . . . . 77
3.6 The results using vanilla CVAE as the backbone model for the inferred ECG of
each group in the mini-MIMIC-127 dataset. . . . . . . . . . . . . . . . . . . . . 85
3.7 The results from the proposed causal CVAE as the backbone model for the in-
ferred ECG of each group in the mini-MIMIC-127 dataset. . . . . . . . . . . . . 93
3.8 The comparison for the subwave amplitude and intervals from each cycle of the
inferred ECG and intervened ECGs. . . . . . . . . . . . . . . . . . . . . . . . . 96
4.1 Numerical results of the proposed method. . . . . . . . . . . . . . . . . . . . . . 118
4.2 Configurations for the ablation study of the proposed pipeline. . . . . . . . . . . 121
4.3 Testing results of leave-one-participant-out and leave-one-session-out experiments. 125
4.4 Comparison of the proposed algorithm in both contact and contact-free SpO2
estimation settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
4.5 Results for adding Gaussian blurring effect on hand videos. . . . . . . . . . . . . 128
viii
5.1 Performance comparison of each model structure for participant-specific experi-
ments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
5.2 Performance comparison of each model structure in leave-one-participant-out ex-
periments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
5.3 Numerical results of the ablation studies for Model 1 in the leave-one-participant-
out mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
5.4 Experimental results of proposed methods on a contact-based video SpO2 dataset. 157
ix
List of Figures
1.1 PPG measurement in both contact and contactless methods. . . . . . . . . . . . . 2
1.2 Formation of PPG signal according to the light-tissue-interaction. Figures modi-
fied from [156] and [135]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Association between the electrical and mechanical activities of the heart and the
simultaneous blood flow dynamics represented by ECG and PPG, respectively.
The heart images are adopted from Servier Medical Art [126]. . . . . . . . . . . 5
1.4 Spectrograms of PPG and ECG manifesting the route to discover and model their
relationship from PPG to ECG. . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Extinction coefficient curves of hemoglobin. Figure reproduced based on [40,108]. 10
2.1 Ambulatory and portable/wearable ECG devices (from left to right): Holter moni-
tor, Zio Patch, KardiaMobile, and Apple Watch. Images in this figure are from [39,
66, 76, 138]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Illustration of the proposed joint dictionary learning based framework for ECG
inference from PPG. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Qualitative comparison of the ECG signals inferred by different approaches. . . . 35
2.4 Fiducial points and intervals in an ECG cycle. . . . . . . . . . . . . . . . . . . . 37
2.5 Comparison of subwave reconstruction performance in boxplots. . . . . . . . . . 39
2.6 Block diagram for RLS algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.7 ECG reconstruction performance comparison before and after PPG denoising for
motion artifact removal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.1 Never-miss-a-beat framework illustration. . . . . . . . . . . . . . . . . . . . . . 54
3.2 Recapitulation of the XDJDL model. . . . . . . . . . . . . . . . . . . . . . . . . 58
3.3 Flowcharts for the transfer learning pipeline and baseline pipelines for compari-
son including mixed learning and leave-N-out. . . . . . . . . . . . . . . . . . . . 59
3.4 The two testing modes that we examine for the learned digital twin model. . . . . 61
3.5 Composition of the mini-MIMIC-127 dataset. . . . . . . . . . . . . . . . . . . . 64
3.6 The validation performance regarding the selection of dictionary size. . . . . . . 65
3.7 Performance comparison between transfer learning and baseline models in boxplots. 68
3.8 Qualitative comparison of the ECG signals inferred in different modes. . . . . . . 71
3.9 Qualitative comparison of different segmentation schemes for ECG inference. . . 73
3.10 Experimental setup for the self-collected PPG and ECG database. . . . . . . . . . 76
3.11 The breakdown of everyday performance from the three learning and evaluation
schemes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.12 The ECG and PPG signal generation paths during heartbeats considering the orig-
inating impulses from the heart. . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
x
3.13 The training and testing process of CVAE. The illustration is adopted from [41]. . 82
3.14 The vanilla CVAE model as the backbone for ECG inference from PPG. . . . . . 83
3.15 The overall performance comparison using XDJDL and CVAE as the backbone
models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.16 Illustration of causal representation learning for ECG inference. . . . . . . . . . 87
3.17 The proposed causal CVAE architecture. . . . . . . . . . . . . . . . . . . . . . . 91
3.18 The learned DAG adjacency matrix and the corresponding DAG for a subject
from the cardiac young group. . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.19 Distributions of the difference between the inferred ECG and the intervened ECGs
for each evaluation metric, showing the intervention impact of the latent causal
representation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.20 Visualization of the inferred ECG and intervened ECGs after tuning Nodes 3, 7,
and 6, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.1 System illustration for the SpO2 prediction using the smartphone captured hand
videos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.2 HR tracking performance of AMTC and baseline algorithms, showing the supe-
riority of AMTC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.3 Fitzpatrick skin types [10]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.4 Experimental setup for data collection of hand videos and reference signals using
an oximeter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.5 Predicted SpO2 signals for all participants using SVR when the palm is facing the
camera. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.6 Boxplots for comparison of results between different regression methods, sides
of the hand, and skin tones. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.7 Results for the ablation study of the proposed method. . . . . . . . . . . . . . . . 121
4.8 Illustration of blurring effects using different blurry levels. . . . . . . . . . . . . 127
4.9 Experimental setup for the intermittent hypoxia protocol. . . . . . . . . . . . . . 129
4.10 Illustration for the intermittent hypoxia (IH) protocol. . . . . . . . . . . . . . . . 130
4.11 Hand images from participants using the IH protocol. . . . . . . . . . . . . . . . 131
4.12 Comparison of the distributions of SpO2 collected using the breath-holding pro-
tocol and the intermittent hypoxia protocol. . . . . . . . . . . . . . . . . . . . . 132
4.13 Comparison of the correlations between HR and SpO2 from the breath-holding
protocol and the intermittent hypoxia protocol. . . . . . . . . . . . . . . . . . . . 132
4.14 Predicted SpO2 signals using SVR for all participants from the IH protocol. . . . 133
5.1 Proposed neural network based contactless SpO2 estimation method. . . . . . . . 139
5.2 Proposed network structures for predicting SpO2 levels from skin color signals. . 142
5.3 Illustration of two hand-video capturing positions. . . . . . . . . . . . . . . . . . 145
xi
5.4 Overview of the breathing protocol and the distribution of SpO2 in the collected
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.5 Predicted SpO2 signals in training, validation, and test cases. . . . . . . . . . . . 147
5.6 Boxplots comparing distributions of correlations between different skin tones and
sides of the hand. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
5.7 Posterior distribution of the difference of group means of an undecided case of
the Bayesian statistical test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
5.8 Histograms of correlation values demonstrating the ability of the proposed neural
networks to track SpO2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
5.9 Learned RGB channel weights demonstrating the alignment between the neural
network and the underlying optophysiological model. . . . . . . . . . . . . . . . 159
xii
Chapter 1
Introduction
1.1 Background
Health is an important aspect of living a happy life, especially due to the pandemic, we are
taking more care of our health. We check our vital signs and body signs to monitor our health
status in different scenarios. For example, our blood pressure and blood oxygen level [114] are
measured when we visit the clinics. We check our heart rate and/or breathing rate during exercise
via contact sensing, such as fitness watch [101], or contactless sensing [25, 27, 87, 171]. Those
are all applications of digital health in our daily lives. One of the key technologies behind those
applications is photoplethysmography.
1.1.1 Photoplethysmography (PPG) and Remote PPG
Photoplethysmography (PPG) is a low-cost, user-friendly, and noninvasive optical tech-
nique that measures the periodic change of blood volume in the microvascular bed of tissue in
the pace of heartbeat, and can be obtained from an optoelectronic device clipped to a person?s
fingertip or by other wearable devices such as an Apple Watch, as shown in the ?contact PPG?
1
Figure 1.1: PPG measurement in both contact and contactless methods.
Figure 1.2: (a) Anatomical cross-section structure of human skin tissues and transmitted light
captured by a detector when the skin is illuminated by a light source. (b) Variations in light
attenuation by tissue, illustrating the rhythmic effect of arterial pulsation. Figures modified
from [156] and [135].
2
part of Fig. 1.1. PPG has become a common modality for heart activity monitoring in clinics,
hospitals, and homes for healthcare and fitness purposes [7]. As illustrated in Fig. 1.2(a), the
measurement of PPG requires a light source to illuminate the tissue and a photodetector to re-
ceive the light transmitted or reflected by the tissue. During each cardiac cycle, the blood is first
pumped into the body so that the blood volume increases in the capillaries in the skin, which
causes increased light absorption. Then as the blood travels back to the heart via the venous net-
work, the light absorption at the capillaries decreases. Therefore, as shown in Fig. 1.2(b) the PPG
signal is composed of the ?AC? component which reflects the cardiac change with each heartbeat,
and the ?DC? component which contains the information including respiration, venous flow, and
thermoregulation [135].
To facilitate long-term and comfortable sensing, advances in video signal processing, com-
puter vision, and artificial intelligence have opened up opportunities to use a camera captured
video to monitor a person?s health related vital signs remotely as illustrated in the ?remote
PPG? part of Fig. 1.1. This technology is commonly referred to as remote PPG (or rPPG),
which is first proposed by Verkruysseet et al. [152]. The basic principle of rPPG is to illumi-
nate tissue and to use the camera as the receiving sensor to capture the re-emitted light from
the tissue. The captured video contains the periodically variational light absorption in the mi-
crovascular bed underneath the skin, thus can convey information about the cardiovascular sys-
tem. The rPPG has been utilized to monitor important physiological parameters, including heart
rate [37,91,148,152,157,176], breathing rate [27,28,115], heart rate variability [47,69,100,115],
blood pressure [71], and blood oxygen saturation [80].
In this dissertation, we study the following two application directions of contact and con-
tactless PPG, one is electrocardiogram waveform inference from contact PPG waveform and its
3
application and contribution to the emerging digital twin technology (Ch. 2 and Ch. 3), and the
other is noncontact blood oxygen saturation measurement using remote PPG captured by RGB
cameras (Ch. 4 and Ch. 5).
1.1.2 Electrocardiogram (ECG) and Its Physiological and Signal Relation With
PPG
Cardiovascular diseases (CVDs) have become a leading cause of death globally. From
alarming reports of the World Health Organization, an estimated 17.9 million people died from
CVDs in 2019, representing 32% of all global deaths [20]. However, some CVDs, such as heart
muscle dysfunction, show no obvious symptoms in the early stage. The presence of symptoms
usually indicates the onset of heart failure. A study conducted on the aged population shows that
around one-third to one-half of heart attacks are clinically unrecognized [38]. The unawareness
of diseases makes some patients miss the opportunities of receiving an early medical intervention.
Electrocardiogram (ECG) is a widely-used clinical gold-standard for the detection of ir-
regular heart rhythms, cardiovascular diseases, and determination of how certain heart disease
treatments are working in a painless and noninvasive manner. By measuring the electrical activ-
ity of the heartbeat and conveying information regarding heart functionality, timely and continu-
ous ECG monitoring is proven to be beneficial for the early detection of CVDs [120, 134]. The
clinical standard ECG measurement device is the 12-lead ECG monitor that is commonly seen
in the health care provider?s office, a clinic, or a hospital room. It sensitively picks up electrical
potential changes spread in the heart from the skin during a cardiac cycle from 12 different per-
spectives by attaching ten electrodes with sticky patches to each of the limbs and six positions
4
Figure 1.3: Association between the electrical and mechanical activities of the heart and the
simultaneous blood flow dynamics represented by ECG and PPG, respectively. The heart images
are adopted from Servier Medical Art [126].
across the chest of the patient.
A normal ECG waveform and the corresponding electrical and mechanical activities in the
heart are shown in Fig. 1.3. Specifically, different phases in one cardiac cycle progress as fol-
lows [8, 60]: A cardiac cycle begins with the atria depolarization and systole triggered by the
heart?s pacemaker at the sinoatrial (SA) node, represented by the P-wave of ECG. The electrical
impulse then spreads across the atria to the atrioventricular (AV) node and proceeds to the ven-
5
tricular walls through the bundle of His to initiate ventricular contraction that is recorded by the
QRS complex of ECG. After the ventricles are completely activated, they start to repolarize (re-
turn to the resting electrical state) and relax, and the T-wave in ECG depicts this phase. Finally,
both the atria and ventricles complete repolarization and a new cycle is about to start.
As a dynamically involved system, the electrical stimulus that spreads in the heart drives
the orderly contraction and relaxation of the heart muscles, leading to blood pumped into the
vessels and peripheral ends that can be captured by PPG. As a result, the dynamics of blood flow
are coupled with the transmission of electrical stimulus throughout the heart, indicating that PPG
and ECG represent the same cardiac process measured in different signal sensing modalities. As
shown in Fig. 1.3, the ascending slope of PPG is caused by ventricle contraction represented by
the QRS wave complex in an ECG cycle that pumps blood to the vessels and microvasculature and
increases the blood volume there correspondingly [7]. And the descending slope of PPG forms
when the blood flows back from the body towards the heart during the ventricle repolarization
and relaxation represented by the T-wave and the P-wave of the next cycle.
From the signal processing perspective, we view the ECG as the source signal and PPG as
the signal on the receiver side through our cardiovascular system. The electrical cardiac activity
caused our heart to beat, followed by the variations of the aortic blood pressure and the blood
circulation all over our body, which includes the peripheral site used to measure the PPG signal.
The system from ECG to PPG can be treated as an equivalent lowpass filter given the time-
frequency representation of the two signals shown in Fig. 1.4, which also inspires a way to model
the relationship from PPG to ECG as an inverse engineering. In particular, the low frequency
component of ECG can be recovered via some inverse filtering process from the low frequency
of PPG. And the high frequency part of the ECG signal can be reconstructed by examining the
6
correlation between the high and low frequency parts of ECG. In our previous work, the transition
from PPG to ECG (route (a) + (b) in Fig. 1.4) in the frequency domain can be characterized
by a linear transform relating PPG and ECG in the DCT domain [174, 175] that embodies the
underlying electrical, biomechanical, and optophysiological principles.
Figure 1.4: Spectrograms of PPG and ECG manifesting the route to discover and model their
relationship from PPG to ECG. (a) The low frequency component of ECG can be recovered via
some inverse filtering process from the low frequency of PPG. (b) The high frequency part of
the ECG signal can be reconstructed by examining the correlation between the high and low
frequency parts of ECG.
1.1.3 Digital Twins
Accompanying the industrial revolutions over the past several centuries have been four eras
of healthcare revolutions [86, 88, 137], the most recent of which is just beginning. Technological
advances have enabled improvements in patient care and monitoring by introducing portable
and affordable devices such as pulse oximeters [88], as well as communication and computing
infrastructures that make telehealth and remote care possible, enabling people to obtain service
from the comfort of their homes [88, 137]. In addition, precision health takes various types of
7
patient-specific information into account, enabling personalized monitoring for preventative care,
early detection of diseases, and individualized treatment(s) with potentially improved outcomes.
The digital twin is a promising paradigm toward realizing precision health. As a digital
representation of a physical artifact to facilitate the monitoring of the artifact?s status [16], the
notion of a digital twin was introduced by Michael Grieves in his 2002 presentation for product
life cycle management [54, 55] and adopted by the U.S. National Aeronautic Space Administra-
tion in its aerospace missions [48, 51]. Broadening the scope of digital twins to healthcare has
the potential of producing fine-grained tailor-built models of the biological phenomena relating
to an individual?s health [16].
Given the high anatomical complexity of human bodies, it is nearly impossible to build one
digital twin model to account for all aspects of health needs. Thus, we apply a common engi-
neering strategy of divide-and-conquer to cluster digital twins for healthcare into the following
categories:
1. Organ and structural level digital twins [3, 16]: a major application of these digital twins
is to build models to help clinicians understand an individual patient?s organ structure,
support surgery planning, and reduce treatment risks and uncertainties;
2. Omics level digital twins [16]: these digital twins relate genome and other omics informa-
tion with patients? health risks, reactions to drugs, and other high-level health conditions;
3. Physiology level digital twins [16, 95]: these digital twins leverage sensing technologies
and data science to facilitate the monitoring, analysis, and management of a patient?s health
conditions on-demand and/or at a chosen time scale.
We focus on the physiological digital twin in this work. One representative prior work
8
in this area [16] describes a digital twin as a system for providing fast, accurate, and efficient
medical simulation, which consists of a physical object, a virtual object/model, and healthcare
data [95]. A physical object can be represented as a medical or wearable device for monitoring
a person?s health; and the healthcare data can include data from wearable devices, real-time
monitoring data, and simulation data from digital models. While increasingly large amounts
of data are becoming available to potentially support data-driven approaches, we believe that
healthcare digital twins research should strive for constructing explainable digital twin models
when considering both the complex ethics and social aspects of healthcare.
1.1.4 Blood Oxygen Saturation
Peripheral blood oxygen saturation (SpO2) shows the ratio of oxygenated hemoglobin to
total hemoglobin in the blood, which serves as a vital health signal for the operational functions
of organs and tissues [131]. The normal range of SpO2 is 95% to 100% [106]. Abnormality in
the SpO2 level can serve as an early warning sign of respiratory diseases [106]. The estimation
and monitoring of SpO2 are essential for the assessment of lung function and the treatment of
chronic pulmonary diseases. It has become increasingly important in the COVID-19 pandemic,
where many patients have experienced ?silent hypoxia,? a low level of SpO2 even before obvious
breathing difficulty is observed [35, 130, 145]. The vulnerable population with a high possibility
of infection is recommended to monitor their oxygen status continuously for early COVID-19
detection [130, 141].
Pulse oximeters have been widely used for SpO2 measurement at home and in hospitals in
the form of a finger clip [127, 159], which adopts the principle of ratio of ratios (RoR) that was
9
104
HbO2
Hb
103
102
101
Blue Green Red Infrared
100
250 350 450 550 650 750 850 950 1050
Wavelength (nm)
Figure 1.5: Extinction coefficient curves of hemoglobin, figure reproduced based on [40, 108].
The difference between oxygenated and deoxygenated hemoglobin at the red and blue wave
lengths means these color channels contain most of the useful information for SpO2 prediction.
first proposed by Aoyagi in the early 1970s [127]. The RoR principle is based on the different
optical absorption rates of the oxygenated hemoglobin (HbO2) and deoxygenated hemoglobin
(Hb) at 660 nm (red) and 940 nm (infrared) wavelengths as indicated in Fig. 1.5. By illuminating
red and infrared lights on the fingertip, the more oxygenated hemoglobin in the blood, the less in-
frared light and the more red light are received by the detector after transmission. In other words,
the relative AC and DC amplitudes between red and infrared PPG contains pulsatile information
to derive SpO2.
The gold standard for measuring blood oxygen saturation is blood gas analysis, which
is invasive and painful and requires well-trained healthcare providers to perform the test. In
contrast, the pulse oximeter is noninvasive and provides readings in nearly real time, and is
therefore more tolerated and convenient for daily use. The pulse oximeter is known to have a
deviation of ?2% from the gold standard when the blood oxygen saturation is in the range of
70% to 99% [114], which is well-known and accepted in clinical use.
10
Absorption coefficient (cm-1)
1.2 Main Contributions
In this dissertation, we study the modeling of contact and contactless PPG signals to fa-
cilitate its promising applications in cardiovascular signal and vital sign sensing and learning
for digital health. First, we explore the potential of user-friendly and continuous electrocardio-
gram (ECG) monitoring with the help of fingertip PPG sensors in Ch. 2. Next, we develop a
physiological digital twin for personalized continuous cardiac monitoring in Ch. 3. Last, we
study the noncontact methods of blood oxygen saturation (SpO2) monitoring from remote PPG
signals captured by smartphone cameras. Both principled signal processing method and neural
network based method for explicit (handcrafted) and implicit (data-driven) feature engineering
from multi-channel color signals are proposed in Ch. 4 and Ch. 5, respectively. Below are the
detailed key contributions of this dissertation research.
1.2.1 Cross-domain Joint Dictionary Learning for ECG Inference from PPG
The inverse problem of inferring clinical gold-standard electrocardiogram (ECG) from
photoplethysmogram (PPG) that can be measured by affordable wearable internet-of-healthcare-
things (IoHT) devices is a research direction receiving growing attention. It combines the easy
measurability of PPG and the rich clinical knowledge of ECG for long-term continuous car-
diac monitoring. The prior art for reconstruction using a universal basis, such as discrete cosine
transform (DCT), has limited fidelity for uncommon ECG waveform shapes due to the lack of
representative power. To better utilize the data and improve data representation, in Ch. 2, we
design two dictionary learning frameworks, the cross-domain joint dictionary learning (XDJDL)
and the label-consistent XDJDL (LC-XDJDL), to further improve the ECG inference quality and
11
enrich the PPG-based diagnosis knowledge. Building on the K-SVD technique, our proposed
joint dictionary learning frameworks largely extend the expressive power by optimizing simulta-
neously a pair of signal dictionaries for PPG and ECG with the transforms to relate their sparse
codes and disease information. The proposed models are evaluated with a variety of PPG and
ECG morphologies from benchmark datasets that cover various age groups and disease types.
The results show that the proposed frameworks achieve better inference performance than pre-
vious methods, suggesting an encouraging potential for ECG screening using PPG based on the
proactively learned PPG-ECG relationship. By enabling the dynamic monitoring and analysis
of the health status of an individual, the proposed frameworks contribute to the emerging digital
twins paradigm for personalized healthcare.
1.2.2 Never-Miss-A-Beat: A Physiological Digital Twins Framework for Car-
diovascular Health
Digital twins are emerging as a promising framework for realizing precision health for
their ability to represent an individual?s health status. Ch. 3 of the dissertation work introduces
a physiological digital twin for personalized and precision continuous cardiac monitoring in the
form of modeling the PPG-ECG relationship. Using the dictionary learning algorithm proposed
in the previous chapter as the backbone model, the work in this chapter focuses on the problem
of inferring ECG signals from PPG signals for continuous precision cardiac monitoring under
realistic conditions in which available ECG data is scarce. By performing transfer learning, a
generic digital twin model learned from a large portion of paired ECG and PPG data is fine-
tuned to precisely infer the ECG from the PPG of a target participant whose available ECG data
12
are scarce. Experimental results for interpolation and extrapolation testing scenarios show that
the proposed transfer learning method yields better ECG reconstruction accuracy compared to
other baseline comparison models. This suggests that it can be used as a reliable digital twin
for precision continuous cardiac monitoring. In parallel, neural network and causality based
backbone model designs are also proposed based on the underlying physiological process of
ECG generation for better explainability.
1.2.3 Noncontact Hand Video Based SpO2 Monitoring Using Smartphone Cam-
eras
SpO2 is an important indicator of pulmonary and respiratory functionalities. It is recom-
mended, especially for the vulnerable population, to regularly monitor the blood oxygen level for
precaution. Recent works have investigated how ubiquitous smartphone cameras can be used to
infer SpO2. Most of these works are contact-based, requiring users to cover a phone?s camera
and its nearby light source with a finger to capture reemitted light from the illuminated tissue.
Contact-based methods may lead to skin irritation and sanitary concerns, especially during a pan-
demic. In this dissertation, we propose a noncontact method for SpO2 monitoring using remote
PPG signals in the hand videos acquired by smartphones. The whole algorithm pipeline includes
1) receiving video of the hand of a subject captured by a regular RGB camera of a smartphone;
2) extracting a region of interest of the hand video; 3) performing feature extraction of the region
of interest based on spatial and temporal data analysis of more than two color channels; and 4)
estimating a blood oxygen saturation level of the subject from the features.
The contributions of this dissertation mainly focus on the feature engineering and estima-
13
tion parts of the pipeline. Considering the optical broadband nature of the red (R), green (G),
and blue (B) color channels of the smartphone cameras, we exploit all three channels of RGB
sensing to distill the SpO2 information beyond the traditional ratio-of-ratios (RoR) method that
uses only two wavelengths. In the principled signal processing method (Ch. 4), the features
are explicitly extracted based on the multi-channel RoR after adaptively narrow bandpass filter-
ing based on accurately estimated heart rate to obtain the most cardiac-related AC component
for each color channel. Experimental results show that our proposed blood oxygen estimation
method can reach a mean absolute error of 1.26% when a pulse oximeter is used as a reference,
outperforming the traditional RoR method by 25%. With the understanding of the multi-channel
based principled signal processing method, convolutional neural network based schemes (Ch. 5)
are further proposed for implicit data-driven feature extraction by skin color channel mixing and
temporal analysis. The neural network architectures are designed inspiring by the optophysio-
logical models for SpO2 measurement. Through the visualization of the weights for the RGB
channel combinations, we demonstrate the explainability of our model and that the choice of the
color band learned by the neural network is consistent with the suggested color bands used in the
optophysiological methods.
14
Chapter 2
Cross-domain Joint Dictionary Learning for
ECG Inference from PPG
2.1 Motivation and Problem Formulation
Asymptomatic and intermittent abnormalities in the heart functionality could be missed
without continuous ECG monitoring, which plays an important role in the early detection and pre-
vention of life-threatening cardiovascular diseases. However, the conventional continuous ECG
equipment (e.g., the Holter monitor for 24 to 48 hours of recording) is bulky and can be restrictive
on users? activities, making it impossible to wear in a long term. Newer clinical ambulatory ECG
monitoring devices, such as the Zio patch [39], are made to be light-weighted and have alleviated
the above-mentioned issues, although potential skin irritation during long-term adhesive wear
remains, especially for people with sensitive skin. In addition, a prescription is needed to obtain
the Zio patch, thus not easily accessible to the general public. Apple Watch [138] and wear-
able devices alike, such as Omron KardiaMobile [76], are moderately affordable and can show
real-time ECG without adhesion to the skin, but they generally require active user participation
and is usually for sporadic and short measurement of 30-second periods, making it infeasible for
15
Figure 2.1: Ambulatory and portable/wearable ECG devices (from left to right): Holter monitor,
Zio Patch, KardiaMobile, and Apple Watch. Images in this figure are from [39, 66, 76, 138].
ECG Sens. Need No Active Long-Term &
Cost Accessibility
Tech. Participation? Conts. Monitoring
Standard ECG High* Low ? ?
Apple Watch Medium High ? ?
KardiaMobile Low High ? ?
Zio patch Medium Low ? ?(skin irritation)
Our Proposed Low High ? ?(little side effect)
*High cost in the U.S. if without medical insurance.
Table 2.1: Comparison of different ECG sensing techniques.
long-term continuous ECG monitoring. Table 2.1 summarizes the comparison of different ECG
sensing techniques discussed above and shown in Fig. 2.1.
Given the constraints of the ECG sensors, researchers have made efforts toward long-term
continuous ECG monitoring by inferring full ECG waveform from optical sensors, such as the
photoplethysmogram (PPG) sensors [143, 174, 175]. PPG sensors are ubiquitously seen in the
wearable internet-of-healthcare-things (IoHT) devices and have become a common modality for
monitoring heart conditions due to the maturity of the technology and low cost [58]. It measures
the optical response of the blood volume changes at the peripheral ends, including fingertips [7],
and provides valuable information about the cardiovascular system via daily use of the pulse
oximeter. Compared to ECG, PPG is more user-friendly in long-term continuous monitoring
without constant user participation.
16
PPG and ECG are physiologically related as they embody the same cardiac process in two
different signal sensing domains. As explained in Chapter 1.1.2, the peripheral blood volume
change recorded by PPG is influenced by the contraction and relaxation of the heart muscles,
which are controlled by the cardiac electrical signals triggered by the sinoatrial node [75]. The
waveform shape (i.e. signal morphology), pulse interval, and amplitude characteristics of PPG
provide important information about the cardiovascular system [7], including heart rate, heart rate
variability [50], respiration [73], and blood pressure [33]. Therefore, inferring the medical gold-
standard ECG signal using the PPG sensor provides a solution to achieve a low-cost, long-term
continuous cardiac monitoring, which facilitates further diagnosis and leads to early intervention
opportunities, especially for the low-income, disadvantaged populations, who have limited ac-
cess to affordable preventive care. Our proposed technique embodies the trend of digital twins in
healthcare [16], which is an emerging technology that plays a pivotal role in advancing personal-
ized healthcare. The aspects of digital twins that our work contributes to are on developing a rich
representation of an individual supported by data and models, through which the physiological
status of this individual can be dynamically monitored and analyzed over time.
2.2 Related Work
2.2.1 ECG reconstruction from PPG
There are many prior arts extracting physiological parameters [13, 167] or classifying ar-
rhythmia [1, 14, 61, 111] from the input ECG or PPG signals using machine learning methods.
However, direct parameter estimation or automatic diagnosis is insufficient for medical practi-
tioners to interpret. The ECG signal, rather than the derived results via black-box models, is
17
still the gold-standard tool on which cardiologists rely and make further decisions. Our proposed
technique in this chapter providing the reconstructed ECG waveform offers complementary sup-
port and allows the manual check from cardiovascular experts with their medical expertise and
clinical experiences.
Very limited prior work has been devoted to the PPG-based ECG inference. The pilot
study [174, 175] proposed to relate the waveforms of PPG and ECG in the discrete cosine trans-
form (DCT) domain by a linear model. In the participant-specific case where a linear model is
trained from and tested for the same individual, this DCT method achieved a mean reconstruc-
tion correlation of 0.94. In contrast, for the group-based model, the achieved mean correlation
degraded to 0.79. This suggests that there is still substantial room for improvement when ex-
tending to the group-based model case where a universal mapping needs to be trained by a wider
variety of ECG morphologies from multiple people. To address these above-mentioned issues,
we consider dictionary learning based sparse representation for ECG and PPG as it provides a
richer and more adaptive representation than the universal dictionary DCT by better leveraging
data. And we will use this as a foundation to develop joint dictionary learning models for recon-
struction.
2.2.2 Dictionary learning
Algorithms that learn a single dictionary for signal representation have been well-studied [2,
44, 97]. They have been successfully applied to cardiac signal processing, including recent re-
search showing that ECG signals can be well-represented as a sparse linear combination of atoms
from an appropriately learned dictionary for such applications as ECG classification and com-
18
pression [36, 94, 98].
In the domain of image processing and computer vision, these single dictionary learning
strategies have been extended to joint dictionary learning tasks. For image super-resolution [161?
163], coupled dictionary learning frameworks are proposed to learn a dictionary pair for low- and
high-resolution image patches while enforcing the similarity of their sparse codes with respect
to their dictionaries. One assumption from this model is that the transform matrix between the
two sparse codes is an identity matrix. In person re-identification [89] and photo-to-sketch [155]
problems, a linear mapping between the codings of input and output images is introduced into
the objective function for semi-coupled dictionary learning. In both training schemes, the updates
of the mapping and dictionaries are separately done within each iteration, making the dictionary
computation less aware of the signal transform.
Our method aims at boosting reconstruction performance from PPG to ECG by using a
joint dictionary learning framework. Unlike the super-resolution problem [161?163] where the
input and output reside in the same signal domain and are highly correlated, XDJDL introduces
a PPG-to-ECG mapping, which spans the two sensing modalities with low waveform correla-
tion, providing more flexibility and generalization for the two learned dictionaries. Different
from [89, 155], we update the linear transform and the dictionary in the same step, which can
optimize the capability of the obtained dictionaries for both signal representation and transfor-
mation. This kind of transform-aware joint dictionary learning formulation is one of the major
differences from other coupled dictionary learning frameworks. This framework can also be eas-
ily generalized to different constraints. For instance, in the proposed LC-XDJDL model, we add
a label-consistency regularization term to the objective function of the XDJDL model, which
encourages the transformed sparse codes from the same class to be similar.
19
2.3 Proposed Methods
The previous work of ECG reconstruction from PPG using a universal, data-independent
basis of the discrete cosine transform (DCT) [174, 175] has limited fidelity to represent uncom-
mon ECG waveform shapes, especially for the group-based case with a broader range of signal
morphologies [162]. We focus on such group-based cases in this chapter and consider data sci-
ence and learning techniques with richer representative power to answer the following research
question:
? Group-based model: Can a single model, trained from a group of subjects with a certain de-
terminant of physiology (e.g., age, weight, disease type, etc.), predict the ECG waveforms
from unseen PPG measurements for individuals in the training group?
To overcome the limitation of the DCT method and develop the synergy of model and data,
our work aims at improving data representation through a more versatile and adaptive framework
based on dictionary learning to demonstrate the feasibility of ECG waveform inference from
PPG signal as an inverse filtering problem. In addition to the algorithmic improvement, sparse
coding and dictionary learning frameworks are proven to perform efficiently in IoT platforms in
terms of cutting down power consumption and computation cost [6, 93]. Thus, by investigating
the dictionary learning based approach in this chapter, we strike a balance between the model
complexity and practical cost in IoT applications.
Our proposed cross-domain joint dictionary learning (XDJDL) method for ECG recon-
struction from PPG is summarized in Fig. 2.2. A further-developed label-consistent XDJDL
model (LC-XDJDL) is also proposed when the label information for the ECG/PPG cycles is
20
PPG Preprocessing PPG Cycles Training Phase (80% samples) Reconstructed ECG
Alignment & ... ... (Label-Consistent) Cross-domain Joint ...
Detrending Dictionary Learning
Dp De
W
ECG Cycle PPG ECGSegmentation
R
...
... ...
T Inverse TemporalP Temporal Analysis Transform Synthesis Scaling
Q S Scaling
P wave ECG Cycles Test Phase (20% samples)
QRS complex
T wave
(a) (b)
Figure 2.2: Illustration of the proposed framework. The ECG and PPG signals are first prepro-
cessed to obtain temporally aligned and normalized pairs of cycles. 80% pairs of ECG and PPG
signal cycles from each subject are used for training paired dictionaries Dp, De, and a linear
transform W which will be applied in the test phase to infer the ECG signals.
available. The PPG and ECG signals are first preprocessed into normalized signal cycles to fa-
cilitate the subsequent training. In the training phase, the ECG/PPG dictionary pair is jointly
updated with a stable linear mapping that relates the sparse representations of the two measure-
ments. In LC-XDJDL, one more linear mapping that enforces the label consistency for the PPG
sparse codes will be learned to further improve the ECG reconstruction performance and enrich
the PPG diagnosis knowledge base.
2.3.1 Signal Preprocessing
To establish the quantitative relationship between the corresponding cycles of ECG and
PPG, we preprocess the two signals during the training phase to obtain temporally aligned and
normalized pairs of signals, so that the critical temporal features of both waveforms are synchro-
nized for learning and evaluation. The preprocessing method we adopt is rooted in the aforemen-
tioned underlying physiological relationships between PPG and ECG signals in Chapter 1.1.2,
which is independent of the dataset selection. First, considering the synchronization issue be-
tween separate ECG and PPG devices, we align the whole ECG and PPG sequences according to
21
{
{
the moment when the ventricles contract and the blood flows to the vessels, which corresponds to
the R peaks of ECG and the onsets of PPG in the same cycle. Both the onset and R peaks are de-
tected by the beat detection functions from the PhysioNet Cardiovascular Signal Toolbox [153].
Then we detrend the aligned signals by a second-order difference operator based algorithm [174]
to eliminate the baseline drift related to respiration, motion, vasomotor activity, and change in
contact surface [7]. To prepare for the learning of the cycle-wise relation during one heartbeat,
the detrended PPG and ECG signals are partitioned into cycles by the R2R [174] segmentation
scheme, where R-peaks of the concurrent ECG waveform are used to partition the signals on a
heartbeat-by-heartbeat basis. After the segmentation, each cycle is linearly interpolated to length
d to mitigate the influence of the heart rate variation. Finally, we normalize the amplitude of
each cycle by subtracting the sample mean and dividing by the sample standard deviation. The
preprocessed PPG and ECG signal cycles are stored in data matrices P and E, respectively.
2.3.2 Cross-domain Joint Dictionary Learning (XDJDL)
We denote the PPG and ECG datasets as P = [X ,T ] ? Rd?(n+m)p p and E = [Xe,Te] ?
Rd?(n+m) respectively. Each column of P and E is denoted as p ? Rd?1 and e ? Rd?1i i ,
representing one PPG/ECG cycle during the same cardiac cycle. The goal is to learn the patterns
(in terms of dictionaries, mappings, etc.) from the training data X ? Rd?np and X ? Rd?ne to
infer the test ECG dataset T d?m d?me ? R from PPG Tp ? R .
We formulate our XDJDL framework as:
min ?Xe ?DeA ?2e F + ? ?Xp ?DpAp?
2
F + ? ?Ae ?WA ?
2
p
De,A
F
e,Dp,Ap,W (2.1)
s.t. ?ap,j?0 ? tp, and ?ae,j?0 ? te, j = 1, ..., n.
22
where D ? Rd?kp and D ? Rd?kep e are dictionaries learned for Xp and Xe, respectively;
A ? Rkp?np and A ? Rke?ne are the corresponding sparse coding matrices related with the
data matrices Xp,Xe when Dp,De are the current dictionaries. Each column of Ap and Ae is
denoted as ap,j and ae,j with the sparsity upper bounded by tp and te.
For the objective function in Eq. (2.1), ?Xe ?D A ?2e e F and ?Xp ?D
2
pAp?F are the data
fidelity terms for ECG and PPG cycle sets, respectively. The term ?Ae ?WA ?2p F represents the
mapping error between the sparse coding coefficients of ECG and PPG signals, which enforces
the transformed sparse codes of PPG to approximate that of ECG. Intuitively, we can enforce
the two sparse representations for ECG and PPG from the same cycle to be the same and set the
regularization term as ?Ae ?A ?2p F . However, since ECG and PPG are from two different signal
sensing modalities and the waveform difference between the two signals is significant, directly
pushing their sparse representations to be similar could compromise the generalization of the two
learned dictionaries.
From the formulation in Eq. (2.1), we can jointly learn the dictionaries for ECG and PPG
datasets, which produce a good representation for each sample in the training set with strict
sparsity constraints. Meanwhile, we learn the linear approximation of the transform that relates
the sparse codes of PPG and ECG and use it to entail the intrinsic relationship between certain
PPG atoms and ECG atoms from their corresponding dictionaries.
23
The optimization process is described as follows. Eq. (2.1) can be rewritten as:
???? ? ? ? ?? ?
2
??????
Xe ??? ??? De 0 ????? ????Ae??
min ?? ??X ??? 0 ?D ?? ??
De,Ae,D ,
???? p?? ?? p??? ??p ?
Ap,W ???? ? ? ? ? ? Ap ??? (2.2)0 ? ?I ?W
F
s.t. ?ae,j?0 ? te, and ?ap,j?0 ? tp, j = 1, ..., n.
where I is an identity matrix and 0 is a zero matrix, with valid dimensions for matrix multiplica-
tion.
? ? ? ?
Let X ? (X , ?X ,0)T ? R(2d+ke)?ne p , D ? (De,0,? ?I;0, ?D , ?W)T ? R(2d+ke)?(ke+kp)p ,
and A ? (A ,A )T ? R(ke+kp)?ne p . The optimization of (2.2) can be written as the following
problem:
min ?X?DA?2 ,
D,A F (2.3)
s.t. ?a+,j?0 ? te, and ?a?,j?0 ? tp, j = 1, ..., n.
where a?,j represents the column of A?, and A+ is defined as the first ke rows of sparse matrix
A while A? is the last kp rows of sparse matrix A. The formulation in Eq. (2.3) is now similar
to the original K-SVD formulation [2], suggesting that K-SVD can be adapted for this optimiza-
tion. The difference is the local sparsity constraint, which will be addressed in the following
optimization procedures.
Step 0: Initialization
To initialize D and A, we need to initialize their components: De,Dp,W,Ae, and Ap.
First, we randomly select a subset of columns from training data Xe and Xp to form De and Dp.
Then, we initialize the sparse codes Ae and Ap by solving Eq. (2.6) with respect to {De, Xe, te}
24
and {Dp, Xp, tp}, respectively. Finally, we use the ridge regression model to initialize W:
min ?Ae ?WA 2 2p?F + ? ?W?F . (2.4)W
This has a closed-form solution as:
W = AeA
T
p(A A
T
p p + ?I)
?1. (2.5)
After the initialization, we use a two-step iterative optimization to minimize the energy in
(2.3), whereby step one is sparse coding and step two is dictionary updating by SVD.
Step 1: Sparse coding
Given D, the step of sparse coding finds the sparse representation aj for xj , for j = 1, ..., n,
by solving
min ?xj ?Daj?22 s.t. ?aj?0 ? t. (2.6)aj
where aj is the jth column of the sparse representation matrix A and x is the jthj training sample
in matrix X.
Many approaches were proposed to solve Eq. (2.6) [168]. Here we adopt the orthogonal
matching pursuit (OMP) method [146], which is a greedy method that provides a good approxi-
mation. As mentioned earlier, the local sparsity constraints imposed on Eq. (2.3) will affect the
direct application of OMP. One workaround is to solve the following problem in Eq. (2.7) in place
of Eq. (2.3),
min ?X?DA?2F s.t. ?aj?0 ? te + tp, j = 1, ..., n. (2.7)D,A
25
where aj is the vertical concatenation of a+,j and a?,j in Eq. (2.3), and te and tp are the sparsity
constraints for the upper and bottom parts of aj , respectively. During the OMP process in each
iteration, we will only keep the largest sparse coefficients in aj to ensure the local sparsity con-
straints.
Step 2: Dictionary update
To update the kth atom, dk, in dictionary D and its corre?sponding coefficients, a
k
R, in the
kth row of A, we apply SVD to the residue term Rk ? X? jj ?=k djaR. In practice, we only
select the training samples that use the atom dk and avoid filling in the zeros entries of akR during
the update. We do so by denoting the nonzero entries in akR as a?
k
R, and correspondingly, Rk as
R?k. The updated atom dk and the related coefficients a?kR will then be computed by:
?? ??2
min ?R?k ? d a?kk R? . (2.8)
dk,a?
k F
R
To solve Eq. (2.8), we use the SVD method on the residue term [2], i.e. R?k = U?VT.
And then, dk and a?kR can be updated as follows:
dk = U(:, 1), a?
k
R = ?(1, 1)V
T(1, :). (2.9)
? ? ?
Note that taking D ? (D ,0,? ?I;0, ?D , ?W)Te p as a whole in the dictionary up-
date phase does not solve this optimization problem because the zero matrices part and the iden-
tity matrix part in D cannot be guaranteed in the update of the dictionary by SVD. A remedy to
the above problem is to decompose the dictionary update problem for D into the following two
26
subproblems by revisiting the matrix form of the optimization problem in Eq. (2.2).
(i) Update De,Ae:
< D?e,A
?
e >= argmin ?Xe ?DeAe?
2 (2.10)
F .
De,Ae
We use SVD to update all atoms in De and the corresponding nonzero entries in Ae by
solving Eq. (2.10) with the same procedure as in Eq. (2.8) and (2.9). The columns of De are l2
normalized.
(ii) Update Dp,Ap, and W:
The updated ECG sparse representation matrix A?e from the subproblem (i) then serves as
an input to the second subproblem here to update W, Dp, and Ap in Eq. (2.11).
???? ? ? ?? ? ???2?X ?D
< D?,A?,W?
?
p p >= argmin ??
Dp,Ap,W ??
??? p??? ??? p?? ??? A . (2.11)? ? ? ? p????Ae ?W
F
? ?
We treat ( ?D , ?W)Tp as a whole dictionary, and use the SVD method in Eq. (2.8) and
(2.9) to update it together with the nonzero entries in Ap. The linear transform and the dictionary
are updated simultaneously, which addresses the problem of isolated update raised in [89, 155]
and is one of the major differences from other coupled dictionary learning models. After solving
the two subproblems, D and A can be assembled by filling in the submatrices. The main steps
of XDJDL are summarized in Algorithm 1.
2.3.3 Label Consistent XDJDL (LC-XDJDL)
For cases where the disease type is known or can be predicted, such as from the PPG signals
that we have, we can further leverage the disease label. In this section, we examine the effect of
27
Algorithm 1 Cross-domain joint dictionary learning
Input: Training data Xe and Xp of ECG and PPG cycles, Testing data Te and Tp, and sparsity
constraints te, tp
Training phase:
Initialization:
? Initialize {De,Dp} by randomly selecting atoms from the training data.
? Initialize Ae,Ap by solving Eq. (2.6) with OMP.
? Initialize W by Eq. (2.5).
while not converged do
? Update D,A by combining updated submatrices.
? Sparse coding: compute A in Eq. (2.6) with OMP. Zero out the smallest nonzero
entries in the columns of A if any local sparsity constraint does not hold.
? Dictionary update:
? Update De,Ae in Eq. (2.10) by the SVD method illustrated in Eq. (2.8)(2.9).
? Update Dp,Ap,W in Eq. (2.11) by the SVD method illustrated in
Eq. (2.8)(2.9).
end while
Testing phase:
for each sample tjp ? Tp do
? Compute sparse code sjp of t
j
p under Dp using Eq. (2.6).
? Calculate sje = Ws
j
p.
? Compute the reconstructed ECG sample as rj je = Dese, and store it in matrix Re.
end for
Output: Re
28
adding a label consistency regularization term to the objective function in Eq. (2.1) as follows:
min ?X 2e ?DeAe? + ? ?X ?D A ?2p p p + ? ?Ae ?WAp?2 + ? ?Q?HA ?2p
D F F F Fe,Ae,Dp,Ap,W
s.t. ?ap,j?0 ? tp, and ?ae,j?0 ? te, j = 1, ..., n.
(2.12)
where Q ? [q1, q2, ..., qn] ? Rr?n is a discriminative representation matrix [72] in which each
column q = [0, 0, ..., 0, 1, 1, 0, .., 0]T ? Rr?1i corresponds to a discriminative coding for an input
signal. The nonzero elements in qi occur at the corresponding disease label, which is similar to
the one-hot encoding with the number of ones as a tunable parameter. The additional regular-
ization term ?Q?HAp?2F represents the discriminative sparse code error, which enforces the
transformed sparse codes of PPG to approximate the discriminative codes in Q. It yields such
dictionaries that the signals from the same class have very similar sparse codes, i.e. enforcing the
label consistency in the sparse representations.
We add the label-consistency regularization term for two main purposes: One is to improve
the ECG reconstruction quality by using additional class information to constrain the degrees
of freedom of the PPG sparse codes. The other is to enrich the knowledge base of PPG for
the diagnosis of a certain set of diseases of interest. CVDs weaken the heart functionality, which
further impacts the blood circulation in the body, thus PPG manifests certain disease information.
By enforcing the consistency between the sparse codes of PPG and disease labels, one can gain
insights into how the disease is revealed on PPG by inspecting the specific columns of the PPG
sparse coding matrix Ap and the label matrix Q.
29
Similarly, Eq. (2.12) can be written in the matrix form:
?????? ?? ??? X ?
? ??
D 0 ? ?? e e ?
2
????
??? ? ? ?? ??
min ??????
?
?Xp?????
??
??
??? ? ? ?
D ,A ,D , ?? ? ?
0 ?D ?p??Ae???
e e p ?? ?? ??? ? ? ??? 0 ? ?? ?I ?W????
? ?
A ??Ap,W,H p ??? (2.13)?? ? ? ??Q 0 ?H ?
F
s.t. ?ae,j?0 ? te, and ?ap,j?0 ? tp, j = 1, ..., n.
The two-step optimization method in Chapter 2.3.2 can still be applied to find the optimal solution
to both the dictionary pair and the linear mappings W and H. In the test phase, the PPG sparse
representation matrix Ap is obtained by applying sparse coding with the learned Dp, H, the test
sample matrix Tp, and the label matrix Q.
2.4 Experimental Evaluation
2.4.1 Dataset
The Medical Information Mart for Intensive Care III (MIMIC-III) [52, 74] is a publicly-
available database assembled by researchers at MIT. It comprises a large number of ICU patients
with de-identified health data from their hospital stays. To evaluate our proposed framework and
algorithm, we have extracted a small subset of the MIMIC-III database as follows. First, we
select waveforms that contain both lead-II ECG and PPG signals sampled at 125Hz from the
MIMIC-III waveform database. Then the selected waveforms are cross-referenced with the cor-
responding patient profile by subject ID in the MIMIC-III clinical information database. Patients
30
Cardiovascular Diseases # of Patients # of Cycles
Congestive Heart Failure (CHF) 7 7075 (20.6 %)
Myocardial Infarction ST-Segment Elevated (STEMI) 3 2962 (8.7 %)
(MI) Non-ST Segment Elevated (NSTEMI) 4 4144 (12.1 %)
Hypotension (HYPO) 7 8281 (24.2 %)
Coronary Artery Disease (CAD) 12 11781 (34.4 %)
Total 33 34243 (100 %)
Table 2.2: Composition of the collected mini-MIMIC-33 dataset.
with the four types of CVDs are further selected: congestive heart failure (CHF), myocardial in-
farction (MI) including ST-segment elevated (STEMI) and non-ST segment elevated (NSTEMI),
hypotension (HYPO), and coronary artery disease (CAD). These diseases are all included in the
?diseases of the circulatory system? in the ICD-9 international disease classification codes. After
that, we analyze the signal pair quality using the PPG SQI function from the PhysioNet cardio-
vascular signal toolbox [153] and keep the pair segments that are evaluated as ?acceptable? or
?excellent.?
The resulting mini-MIMIC-33 dataset consists of 33 patients, with each patient having only
one of the four diseases in the record. Each patient has three sessions of 5-min ECG and PPG
paired recordings collected within several hours, resulting in 34243 ECG/PPG cycle pairs in total.
Table 2.2 shows the composition of the collected dataset.
2.4.2 Metrics for Evaluation
As shown in Fig. 2.4 (a), a complete ECG cycle contains five major points, including P, Q,
R, S, and T, which segment the ECG cycle into P wave, QRS complex, and T wave. The shape
information of those waves is useful for further diagnosis. The interval parameters (PR interval,
QRS interval, QT interval) defined by those five fiducial points are also important for examining
31
a patient?s heart conditions. Thus, to evaluate the quality of the reconstructed ECG, we consider
both morphological metrics and the accuracy of time interval recovery.
Evaluation of Waveform Morphology: We apply the Pearson correlation (?) and relative root
mean squared error (rRMSE) as the metrics for evaluating the ECG morphological reconstruc-
tion. They are defined as follows:
(x? x?)T??(x?? x????) ?x? x??? = ?? , rRMSE = 2 . (2.14)?x? x?? ?x?2 x?? x 2 2
where x, x?, x?, and x?? denote the ground-truth ECG cycle, the recovered ECG cycle, and the
average of all coordinates of the vectors x and x?, respectively.
Evaluation of Time Interval Recovery: Three important ECG interval parameters are studied
in this work, including the PR interval, the QRS duration, and the QT interval. Normally, the
PR interval lasts 0.12-0.20 seconds, which begins from the onset of the P wave and ends at the
beginning of the QRS complex. We use the segment from P point to R point of ECG as the
approximated PR interval. A prolonged PR interval can indicate the possibility of first-degree
heart blockage [60]. The duration of the QRS complex is normally 0.12 seconds or less, for
ventricular depolarization. A prolonged QRS complex indicates impaired conduction within the
ventricles. The QT interval is from the onset of the QRS complex to the end of the T wave,
which is normally less than 0.48 seconds. A prolonged QT interval may lead to ventricular
tachycardia [60].
We apply a combination of several established algorithms [110, 124, 125] to detect the
major fiducial points of both the ground-truth ECG and the reconstructed ECG to obtain the
above-mentioned interval parameters. We apply the mean absolute error (MAE) in Eq. (2.15) to
32
Configuration
Reconstruction
Sparsity Linear Mapping
Scheme
Constraint? Between Representations?
DCT [175] n.a. ?
CPDL [89] n.a. n.a.
ScSR [163] ?1 n.a.
SCDL [155] ?1 ?
CDL [161] ?0 n.a.
XDJDL (proposed) ?0 ?
Table 2.3: The configuration comparison of the models implemented for ECG reconstruction
includes sparsity constraint on the representations and the learnable linear mapping between the
representations of PPG and ECG.
evaluate the time recovery accuracy:
?N1
MAE = |Lrec ? Lref |. (2.15)
N
n=1
where the Lrec and Lref are the interval length (in seconds) of the reconstructed ECG and ground-
truth ECG signals, respectively, and N is the total number of cycles for evaluation.
2.4.3 Overall Morphological Reconstruction
We compare our proposed XDJDL method with the state-of-the-art in ECG reconstruction
from PPG, which used DCT based method [175]. In addition, we apply several representative
and state-of-the-art models of coupled or semi-coupled dictionary learning, including CPDL [89],
ScSR [163], SCDL [155], and CDL [161], to compare with the proposed XDJDL method on the
ECG reconstruction task. The codes for the prior art are downloaded from the respective authors?
websites. The configurations of the prior art methods are listed in Table 2.3. The characteristics
of these models can be concluded as (1) the way they represent the signals with any sparsity
constraints and (2) whether the cross-domain signal representations are assumed to be identical
33
Reconstruction ? rRMSE
Scheme ?? med ?? ?? med ??
DCT [175] 0.71 0.83 0.31 0.67 0.60 0.26
CPDL [89] 0.74 0.85 0.31 0.63 0.56 0.35
ScSR [163] 0.82 0.89 0.23 0.54 0.52 0.21
SCDL [155] 0.83 0.89 0.21 0.52 0.49 0.22
CDL [161] 0.85 0.95 0.25 0.49 0.34 0.51
XDJDL (proposed) 0.88 0.96 0.23 0.39 0.29 0.31
Table 2.4: Quantitative performance comparison for ECG waveform inference.
or linearly related by a learnable mapping.
To make a fair comparison, we evaluate the DCT-based reconstruction system in the subject-
independent training mode where a linear transform WDCT is learned using training data from
all patients. The normalized PPG/ECG cycle length is chosen as d = 300. For XDJDL, the
dictionary size for ECG cycles is ke = 320, and the dictionary size for PPG cycles is kp = 9000.
The sparsity parameters are set to be te = 10 and tp = 10. The weights for regularization terms
are ? = 1 and ? = 1. For other dictionary learning models, we have also done the grid-search
for hyperparameter tuning to achieve the best performance. We split the data from each patient
into training and test sets, and the training data ratio is 80%.
Table 2.4 shows the quantitative comparison of the ECG morphological reconstruction per-
formance. From the statistics of the sample mean, standard deviation, and median of ? and
rRMSE, we can see that our proposed XDJDL method outperforms both the DCT-based algo-
rithm and other representative coupled/semi-coupled dictionary learning models. Specifically,
the average rRMSE is reduced from 0.49 to 0.39, or 20.4% lower than CDL [161], which is the
second-best among all competing models.
In Fig. 2.3, we present visualization examples of ECG waveform reconstruction using all
34
(a) Patient 1 (Female, 71, CHF) Recon. ECG (b) Patient 2 (Female, 87, MI) (c) Patient 3 (Female, 82, CAD)
Ref. ECG
Input
PPG
DCT
CPDL
ScSR
SCDL
CDL
Proposed
XDJDL
241 241.5 242 242.5 243 243.5 244 244.5 233.5 234 234.5 235 235.5 259 259.5 260 260.5 261 261.5
Time (s) Time (s) Time (s)
Figure 2.3: Qualitative comparison of the ECG signals inferred by different approaches. Exam-
ples are from (a) a 71-year-old female with congestive heart failure, (b) an 87-year-old female
with myocardial infarction, and (c) an 82-year-old female with coronary artery disease. From top
to bottom: the input PPG signal from which the ECG is inferred in subject-independent mode,
results by DCT method [174], CPDL [89], ScSR [163], SCDL [155], CDL [161], and our pro-
posed XDJDL.
the competing models and our proposed XDJDL model. The three patients have different types
of disease diagnosis. We observe that even though the waveform variances between the PPGs
are relatively smaller than those between the ECGs, our proposed XDJDL method can recover
most of the details well in the ECG signal from the PPG signal, suggesting that our method
has preserved the intrinsic relation between the atoms from PPG and ECG dictionary pair. In
particular, for the second-best CDL [161] method that can reconstruct the overall shape of ECG
cycles reasonably well, it has glitches in recovering the details, such as the P wave of the first and
last cycle of Patient 2 and the QRS complex of the first cycle of Patient 3.
When the cycle-wise disease information is available, we can apply the proposed label-
consistent XDJDL (LC-XDJDL) model from Chapter 2.3.3 to leverage the label information for
more accurate monitoring of ECG from the PPG signal. We consider the following scenarios:
1) For cases where the disease information is not directly provided in the test phase, we can
35
Reconstruction ? rRMSE
Scheme ?? med ?? ?? med ??
XDJDL 0.88 0.96 0.23 0.39 0.29 0.31
LC1-XDJDL 0.90 0.96 0.20 0.36 0.27 0.28
LC2-XDJDL 0.92 0.97 0.17 0.33 0.26 0.25
Table 2.5: Numerical comparison of ECG signal inference performance among XDJDL, LC1-
XDJDL, and LC2-XDJDL methods.
first predict that from the PPG signals. Here, we have trained an SVM classifier for the PPG
multi-class disease classification and chosen the best hyperparameters with a five-fold cross-
validation method. The classification accuracy for the PPG test set reaches 92%. We denote the
corresponding label-consistent model as LC1-XDJDL. It will take the predicted labels to build the
discriminative representation matrix Q. 2) When we have the ground-truth disease labels in the
test phase, we can leverage that disease information directly as matrix Q and the corresponding
model is named LC2-XDJDL.
We list the comparison of ECG reconstruction performance using the XDJDL, LC1-XDJDL,
and LC2-XDJDL models in Table 2.5. On average, the Pearson coefficient improves from 0.88
to 0.90 with the predicted label information, and to 0.92 with the ground-truth disease type as
input. The improvement in terms of the rRMSE is also consistent with the Pearson coefficient.
In addition to the reconstruction performance improvement, the label-consistent mapping that
relates the PPG sparse codes to disease type in LC-XDJDL helps us understand the role of PPG
in diagnosis with a rich ECG knowledge base.
36
QT interval
(a) Ground-truth ECG waveform QRS duration
PR interval
P wave QRS complex T wave
Border P Q R S T
(b) Inferred ECG waveform from PPG
0.04s
0.20s
Figure 2.4: (a) shows two cycles of the reference ECG signal and (b) shows two cycles of the
inferred ECG signal. In the first cycle of (a), the green curve represents the P wave, the red curve
is the QRS complex, and the dark blue curve shows the T wave. The PR interval, QRS duration,
and the QT interval are all labeled in the second cycle of (a).
2.4.4 Subwave Morphological Reconstruction
In the above section, we have shown that our proposed XDJDL outperforms the DCT model
and other representative dictionary learning models, and its performance can be better if the
disease label (LC-XDJDL) can be utilized for ECG reconstruction and monitoring.
In this section, we zoom into the reconstruction performance of the subwave of ECG cycles
using XDJDL and LC-XDJDL methods. Because each subwave refers to different atrial and
ventricular depolarization and re-polarization activities, by zooming in, we can have a better
idea of how our methods behave on the inference for different phases of the heart activities.
37
Reconstruction ?? rRMSE
Scheme P QRS T P QRS T
wave complex wave wave complex wave
XDJDL 0.81 0.92 0.84 0.53 0.33 0.41
LC1-XDJDL 0.83 0.93 0.86 0.49 0.30 0.37
LC2-XDJDL 0.86 0.94 0.89 0.45 0.28 0.34
Table 2.6: Comparison of subwave reconstructions in the mean of ? and rRMSE.
A combination of the ECG major point detection algorithms [110, 124, 125] is used to locate
P/Q/R/S/T points of ECG waveform, which helps segment the ECG cycle into subwaves for the
evaluation of morphological reconstruction.
Fig. 2.4 shows an example of the major points detection results on two cycles of the ref-
erence ECG (Fig. 2.4(a)) and the reconstructed ECG (Fig. 2.4(b)) from a patient with coronary
artery disease. In this example, we observe that the locations of the detected major points in both
signals are close, indicating a good reconstruction of the ECG waveform. We empirically sepa-
rate the adjacent ECG cycles at a point that splits the neighboring R-R peaks at the ratio of six
to four. After that, a complete ECG cycle is divided into three subwaves, including the P wave
that starts from the border point on the left of the ECG cycle and ends at the Q point, the QRS
complex from the Q to S point, and the T wave from the S point to the right border point. Only
a very small portion of reference and reconstructed ECG cycle pairs cannot be detected with a
consistent set of fiducial points. The overall number of effective cycles for subwave evaluation
is around 92% out of all test cycles, and those effective cycles only have a slightly improved
Pearson coefficient (1% on average) compared to the original test dataset.
Table 2.6 lists the reconstruction performance on the three subwaves of the ECG cycle in
terms of the mean of Pearson coefficient and rRMSE using XDJDL, LC1-XDJDL, and LC2-
38
(a)
1
0.9
0.8
0.7 XDJDL
LC1-XDJDL
LC2-XDJDL
0.6 P wave QRS complex T wave
(b) XDJDL
1.2 LC1-XDJDLLC2-XDJDL
0.8
0.4
0
P wave QRS complex T wave
Figure 2.5: Comparison of subwave reconstruction performance across XDJDL, LC1-XDJDL,
and LC2-XDJDL models. The statistics of (a) Pearson coefficient ? and (b) rRMSE are summa-
rized using the boxplots.
XDJDL models. The comparison of results across models is consistent with the results of the
overall comparison in Table 2.5. We also observe that the reconstruction for the QRS complex is
better than that for the T wave, which is better than that for the P wave. The mean Pearson coef-
ficient of the QRS complex by LC2-XDJDL is 0.94, higher than the overall cycle reconstruction
of 0.92, while that of the T wave is slightly lower than the overall performance with the mean
Pearson coefficient as 0.89 and that of the P wave is 0.86.
In addition to the mean of Pearson coefficient and rRMSE, Fig. 2.5 shows the comparison
of the statistics of Pearson coefficient and rRMSE in boxplots for the three subwaves of ECG so
that we can see the overall result distribution of the two metrics. We observe that the medians
of ? and rRMSE for each of the three subwaves are very similar across the proposed models.
39
rRMSE ?
Reconstruction Mean (in seconds) MAE (in seconds)
Scheme PR QRS QT PR QRS QT
XDJDL 0.164 0.115 0.331 0.030 0.012 0.030
LC1-XDJDL 0.166 0.116 0.331 0.026 0.011 0.027
LC2-XDJDL 0.167 0.115 0.331 0.025 0.010 0.025
Reference 0.172 0.113 0.328 - - -
Table 2.7: Comparison of timing interval recovery accuracy in MAE.
Specifically, the medians of ? of P wave are 0.95, 0.96, and 0.96, respectively, those of QRS
complex are all 0.98, and those of T wave are all 0.97; the medians of rRMSE of P wave are 0.35,
0.33, 0.32, those of QRS complex are 0.25, 0.23, 0.22, and those of the T wave are 0.27, 0.25, and
0.24, respectively. Analysis of these boxplots suggests that our proposed models can preserve the
relation between PPG and QRS complex well. The overall reconstruction performance can be
improved if the relations between PPG and P and T waves are better learned.
2.4.5 Time Interval Recovery
In addition to the morphological reconstruction evaluation, we evaluate whether the time
intervals are well preserved. The labeling of those intervals is shown in Fig. 2.4.
From columns 2-4 in Table 2.7, we can compare the average of the reconstructed intervals
and the reference intervals. For PR intervals, the difference between the reconstructed and refer-
ence is approximately 4%; for QRS durations, such difference is within 3%; and for QT intervals,
the difference is less than 1%. This suggests that, on average, the timing information of the inter-
vals is preserved well. From columns 5-7 in Table 2.7, we also notice that the MAEs of the PR
interval are 0.030s, 0.026s, and 0.025s using XDJDL, LC1-XDJDL, and LC2-XDJDL models,
respectively. The relatively large error in the timing of PR interval recovery is consistent with the
40
result of the P wave reconstruction performance shown in Chapter 2.4.4. Nevertheless, the MAE
of the timing for the QRS complex is around 11ms, which is just a quarter of the smallest grid
on the conventional hand copy of ECG recorders (40 ms) and is negligible given the sampling
rate (125 Hz) of the ECG signal in the MIMIC III dataset. The MAE of the QT interval is around
27ms, which is less than three-quarters of the smallest grid on ECG graph paper and is around
8% of the QT interval (0.331s).
2.5 Discussions
2.5.1 Result Using PPG-based Segmentation Scheme
In Chapter 2.4, we have evaluated our proposed models based on the assumption that the
cycle information from ECG signals is available to separate the ECG/PPG time-series signals
into training and test cycles. But in practice, we may not have the ground truth of cycle segmen-
tation from ECG. Thus, we consider such realistic scenarios of reconstructing the ECG from the
?estimated cycles? of PPG that are segmented by the PPG onsets instead of the R peaks of ECG
signals. The PPG onsets are used for segmentation rather than the PPG peaks because of the
underlying physiological meaning as we have mentioned in Chapter 2.3.1. For ease of notation,
we denote:
? R2R: segmentation scheme based on R peaks of ECG for both training and test data, which
is used in Chapter 2.4;
? O2O-1: segmentation scheme based on PPG onsets for both training and test data;
? O2O-2: segmentation scheme based on R peaks of ECG for training data and based on
41
Reconstruction ? rRMSE
Scheme ?? med ?? ?? med ??
XDJDL (O2O-1) 0.70 0.84 0.32 0.66 0.57 0.39
XDJDL (O2O-2) 0.80 0.88 0.24 0.55 0.48 0.32
XDJDL (R2R) 0.88 0.96 0.23 0.39 0.29 0.31
Table 2.8: Quantitative comparison of different segmentation schemes.
PPG onsets for test data.
Due to the discrepancy between the detected locations of PPG onset and R peak of ECG
from the same cycle, the ?estimated PPG cycles? using O2O schemes slightly vary from the PPG
cycles which are segmented by R2R. To single out the contribution to the ECG reconstruction
error due to the discrepancy in the waveform shape rather than the misalignment of the ECG
peaks, we evaluate O2O schemes after compensating for the time offset between the reconstructed
ECG and original ECG signals. This is done by shifting each reconstructed ECG cycle in time
so that the original and reconstructed ECG signals are matched according to their R peaks. The
comparison result is shown in Table 2.8. Compared to R2R, when using the O2O-1 scheme,
the average Pearson coefficient drops from 0.88 to 0.70, and the average rRMSE rises from 0.39
to 0.66. And using the O2O-2 scheme can help improve the performance compared to O2O-1,
where the mean Pearson coefficient becomes 0.80 and the mean rRMSE becomes 0.55.
2.5.2 Evaluation on the Capnobase TBME-RR Dataset
In this section, we experimented with the Capnobase TBME-RR database [77] that con-
tains forty-two eight-minute sessions from 29 children and 13 adults during elective surgery and
routine anesthesia. Each session corresponds to a unique participant and contains simultaneously
recorded PPG and ECG signals. The signals are recorded with a sampling frequency of 300
42
Reconstruction ? rRMSE
Scheme ?? med ?? ?? med ??
DCT [175] 0.902 0.919 0.066 0.427 0.413 0.128
CPDL [89] 0.956 0.968 0.049 0.282 0.247 0.150
ScSR [163] 0.967 0.976 0.039 0.286 0.247 0.165
SCDL [155] 0.971 0.978 0.038 0.191 0.166 0.101
CDL [161] 0.980 0.991 0.062 0.219 0.145 0.296
XDJDL (proposed) 0.979 0.990 0.048 0.146 0.105 0.122
Table 2.9: Quantitative performance comparison for ECG waveform inference using the Cap-
nobase TBME-RR database.
Hz. The dataset covers a wide range of participants? ages, which is from one-year-old to sixty-
three-year-old with the median age being fourteen. Thus, this dataset is used for a supplementary
evaluation of the proposed method from the angle of age variety in addition to disease variety in
the mini-MIMIC-33 dataset.
We first pruned the signals according to the artifact labels provided in the dataset and pre-
processed the signals using the method in Chapter 2.3.1 to obtain aligned and normalized signal
pairs. To be consistent in the evaluation, like what we did in Chapter 2.4, we selected the first
80% of the data from each subject as the training set and the rest for testing.
Table 2.9 summarizes the performance comparison using the Capnobase TBME-RR dataset.
Our proposed XDJDL method outperforms all the other groups in terms of the mean and median
rRMSE by a large margin. Even though the CDL [161] method is 0.1% better than our proposed
method in mean and median correlation coefficient ?, our method achieves a 26% smaller ?? of ?
than the CDL method, showing that our proposed method achieves a good performance of ECG
reconstruction more consistently for all participants.
43
2.5.3 Feasibility Analysis of The Proposed Method for The Internet-of-Healthcare-
Things (IoHT)
In this section, we analyze two important practicality issues when applying our proposed
ECG reconstruction techniques to healthcare IoT devices. One issue is energy consumption. The
sensors used to capture physiological signals, e.g., PPG signals, are mostly wearable devices,
which are powered by batteries [58]. Thus, being energy-efficient is necessary to ensure continu-
ous signal acquisition, data transmission, and monitoring. The other issue is computational cost.
As is mentioned in [9, 58], applications that require lower latency need higher computational ca-
pabilities. Thus, the computational load of the algorithms needs to be considered in real-world
scenarios.
The first issue about energy consumption in wearable devices can be resolved by the exist-
ing mature technologies like the Bluetooth low-energy module commonly applied for low-power
wireless communication in wearable healthcare devices [160]. In the test phase of our proposed
XDJDL and LC-XDJDL frameworks, PPG signals acquired by the wearable devices can be trans-
mitted to the IoT devices, such as smartphones, at low power with the help of those modules. For
the second issue about computational cost, with the dictionary pairs constructed locally and stored
in the cloud or edge devices, the computational cost is mainly from sparse coding and lightweight
matrix multiplication. Since sparse coding via OMP in our proposed methods is proven to be able
to be executed on the IoT platform in real-time [6], we envision that our proposed frameworks
can satisfy the practical requirements well.
To further evaluate quantitatively the feasibility of applying our proposed method to IoHT
platforms, we examine the following metrics to measure the usage of computational resources to
44
reconstruct one ECG cycle:
1. Computational time
2. Memory space
3. Energy consumption
The specifications of the laptop we used for the experiment are as follows: Processor: i7-8650U;
Architecture: Intel x86; CPU Frequency: 1.90GHz; Cores: 4; RAM: 24GB. Our test here is
designed to resemble an online inference scenario in which new sequences of continuous ECG
waveform need to be inferred by the IoHT system with the input PPG waveform. The experiment
is repeated 100 times to evaluate the memory space and the average computational time for each
cycle. Note that the actual energy consumption estimation can be complex, as it depends on
the operating system, the temperature inside and outside the device, and the efficiency of the
power supply. Thus, we use FLOP (Floating-point Operations) here as the measure for energy
consumption, as it is independent of hardware configurations given the algorithm. With FLOP,
the energy in joule can be estimated as it is proportional to FLOP given FLOPS (FLOP per
Second) per watt, i.e., FLOPS/W, specified by the IoT device.
We list the computational resources consumed by the proposed XDJDL method in Ta-
ble 2.10. The average computational time to reconstruct each ECG cycle is 15.7ms, which is
one to two orders of magnitude shorter than a heart cycle (around 0.5s to 1s per beat at rest),
suggesting that the processing can be done in real-time. In addition, the 31.4 MB memory space
and 60.2 MFLOP required by the proposed XDJDL method are well within the capability of such
commonly seen IoT platforms as the Raspberry Pi 3B (RAM: 1 GB, 0.73 GFLOPS/W) [5] for
45
Reconstruction Computational Memory FLOP
Scheme Time (ms) Space (MB) Consumption
XDJDL (proposed) 15.7? 0.9 31.4? 1.3 60.2M
Table 2.10: Computational resources consumed to reconstruct test ECG cycles using the proposed
XDJDL method.
the research prototype that has not been optimized for deployment. Considerable reductions in
computing resources are possible with industry-grade implementation.
2.5.4 Limitations of The Proposed Method
2.5.4.1 Performance of Leave-One-Out Experiment
As a proof of concept and considering the current moderate amount of available data, we
have so far split each patient?s data into training and test sets. This corresponds to the trend of
?precision medicine? to tailor the healthcare practice to individual patients. Meanwhile, we are
curious how the algorithm would behave if the test patient is never seen in the training phase,
corresponding to the situation of training models for the whole population or patient groups
categorized by gender, age, race, or other ways. We will examine this through leave-one-out
experiments.
We apply a pre-clustering process based on the ECG data to select a sub-group of patients
with similar ECG features for the leave-one-out experiment. First, we reduce the dimension of
the ECG cycles by principal component analysis (PCA), and then we use K-means to cluster the
ECG features after PCA. Based on the clustered ECG features, we select the largest cluster of
ECGs from 19 patients. The mean Pearson coefficient for the leave-one-out experiment on the
19 patients is 0.74 (std: 0.15, median: 0.77).
46
From the result, we can see that as expected, the leave-one-out experiment is a more chal-
lenging case given the large variability of ECG data morphologies of ICU patients and the limited
number of patients in the collected dataset. Based on the results in Chapter 2.4.3, we see the en-
couraging capability of recovering large variations in ECG from relatively small variations in
PPG across cycles and patient populations. This suggests a strong potential for predicting ECG
from PPG of unseen patients through further research and larger data collection. In our follow-
up work, we are considering an improved problem definition and data collection procedure to
enhance the generalization capability of learning.
2.5.4.2 Performance Evaluation on A Motion Dataset
So far, we have demonstrated the feasibility and improved accuracy of ECG waveform
inference from PPG using the proposed methods on two benchmark datasets [74, 77] in Chap-
ter 2.4.3 and Chapter 2.5.2. Those datasets were collected under a resting condition with rela-
tively small movement artifacts. Noises and artifacts were still present in those datasets but in a
controlled manner, which leads to good quality of data acquisition and is beneficial for the fea-
sibility study and accuracy improvement of reconstructing ECG from PPG. In this section, we
consider a more challenging scenario where IoHT devices are worn during exercise and show the
preliminary results with the motion-contaminated signals.
Dataset Description:
We adopt the 2015 IEEE Signal Processing Cup dataset [169] for evaluation, which consists
of paired PPG and ECG signals from 13 participants during physical exercises. This dataset
provided by the Samsung Research Lab in the U.S. aimed to facilitate the study of accurate heart
47
rate (HR) monitoring of PPG signals from wrist-type sensors and included ECG signals as a
reference. The PPG signals were collected from the wrist while the subjects ran on a treadmill at
speeds of 6 km/h, 8 km/h, 12 km/h, or 15 km/h, respectively. Simultaneously, the ECG signals
were collected from the chest and the acceleration signal was recorded from the wrist by a three-
axis accelerometer. All signals were sampled at 125 Hz. Each subject ran once and the total
length of the recording was 5 minutes per subject.
Dataset Preprocessing With HR Reference and Adaptive Filtering:
Since the quality of PPG signals is crucial to ECG reconstruction, we first use the absolute
error of the PPG estimated HR as a metric to exclude the participants with extremely corrupted
PPG signals. Because HR represents the frequency characteristic of PPG that affects the accuracy
of determining a PPG cycle. The HR is estimated from the PPG by a state-of-the-art adaptive
multi-trace carving (AMTC) [173] algorithm that tracks the HR from the spectrogram of PPG
by dynamic programming and adaptive trace compensation. The reference HR values are given
in the dataset. Three out of the thirteen participants are excluded as their HR estimation error is
quite off likely due to data collection issues and the remaining ten participants? data are used for
learning and testing the XDJDL model.
In addition, to improve the quality of noise-contaminated PPG, we conducted recursive
least square (RLS) adaptive filtering [78]. We view the contaminated PPG as the sum of the
underlying cleaned PPG and motion-induced noise. Suppose the motion artifact corrupted PPG
signal at time n ? [1, N ] is p(n) = d(n) + m(n), where d(n) is the underlying cleaned PPG
and m(n) is the noise introduced by motion that is unknown and can be modeled and estimated
with the acquired accelerometry signals a = [ax; ay; az]. In this way, the estimated m? may be
subtracted from p to form an estimate of d for PPG motion artifact compensation. The block
48
Figure 2.6: Block diagram for RLS algorithm.
diagram for the RLS adaptive filter structure is shown in?Fig. 2.6 and the RLS algorithm is
described in Algorithm 2 where the object function is ? = N ?N?nn=0 |e(n)|2, in which e(n) is
the prior estimation error of p by the RLS adaptive filter and ? is the forgetting factor that is set
to be 1, i.e., assuming infinite memory. Since there are three channels of accelerometer data, we
adopt a series processing method in which the raw PPG is denoised with ax first, then ay is used
to denoise the PPG after ax, and lastly, az is input into the RLS to further denoise the PPG signal
after ax and ay.
Algorithm 2 RLS algorithm [63]
Variables: w?M is the M-tap weight vector; P is the inverse of the correlation matrix; k is
the gain vector; aM is the input accelerometer data in an M-length window; p is the raw PPG
signal to be filtered.
Initialize: w?M(0) = 0; P(0) = ??1I
for n = 1, 2, ... do
k(n) = P(n?1)aM(n)
?+a HM (n)P(n?1)aM(n)
e(n) = p(n)? w? HM (n? 1)aM(n)
w?M(n) = w?M(n? 1) + k(n)e?(n)
P(n) = ??1P(n? 1)? ??1k(n)a HM (n)P(n? 1)
end for
return w?M, e
In the next part, we will compare the ECG reconstruction performance from PPG signals
49
before and after denoising.
Experimental Performance:
Fig. 2.7(a) shows the comparison of the statistics of Pearson coefficient and rRMSE in box-
plots for ECG reconstructed from the PPG signal without denoising (referred to as ?raw PPG?)
and RLS filtered PPG signal (referred to as ?cleaned PPG?). The average Pearson coefficient of
the reconstructed ECG using raw PPG is 0.49 (median: 0.69, std: 0.51) and using cleaned PPG
is improved to 0.61 (median: 0.72, std: 0.37). This improvement can be attributed to that the
spurious peaks and waves in the motion-contaminated PPG are removed by the RLS filtering.
While the noise due to motion is mitigated, distortions in PPG and even ECG waveforms are
still present as shown in Fig. 2.7(b). Treating potentially corrupted ECG as the reference and
distorted PPG as the input might misguide the learning system, and produce unreliable waveform
reconstruction.
1.8 1.8 Raw PPG Signal
1.4 1.4
ECG Inference from Raw PPG Recon. ECGRef. ECG
1 1
0.6 0.6 Cleaned PPG Signal After Motion Artifacts Removal
0.2 0.2
ECG Inference from Cleaned PPG Recon. ECG
Ref. ECG
-0.2 -0.2
RawPPG
CleanedPPG
-0.6 -0.6 229 229.5 230 230.5 231 231.5 232
Time (s)
(a) (b)
Figure 2.7: (a) Statistical distribution of Pearson coefficient (?) and rRMSE for reconstructed
ECG from PPG signals before denoising (?raw PPG?) and PPG signals after denoising (?cleaned
PPG?). (b) Qualitative comparison of raw PPG, cleaned PPG, and the ECG signals inferred from
them.
50
rRMSE
Fig. 2.7(b) shows an example with close to average performance. We observe that on one
hand, the cleaned PPG has clearer cycle shapes than the raw PPG; and on the other hand, some
of the physiological characteristics representing the blood flow process are irregular after RLS
motion artifacts removal, such as the peak in the third cycle and the ascending and descending
slopes in the fifth cycle. Also, the reference ECG signals contain varying ST segment elevations
over consecutive cycles during motion. We expect such limitations can be addressed with the
development of more advanced PPG and ECG denoising and waveform preserving approaches
for preprocessing and the availability of a larger dataset under different types of activities (such
as walking, running, driving, climbing stairs, etc).
2.5.5 Future Work Towards Explainable AI
Our proposed XDJDL and LC-XDJDL models accomplished to infer the ECG based on
PPG by leveraging the biomedical and statistical relationship between the signals. This is an
initial effort to demonstrate a potential benefit from our ?explainable? AI, rather than black-
box data-driven AI, to provide more user-friendly PPG measurements inferred ECG data for the
medical professionals to interpret and offer medical insights. Our framework also helps transfer
the rich ECG knowledge base from decades of medical practice to augment the PPG diagnosis
for public health.
Given the challenge of making the ECG inference more accurate for an unseen group of
subjects, e.g., by age, gender, or other medical and health condition, we are extending our cur-
rent work with a neural network to further enrich the representation and learn the relation when
sufficient data is available. Our ongoing efforts have been focused on both developing a data
51
collection pipeline for more diversity and coverage of training data and exploring an explainable
generative model with strong expressive power to improve the generalization performance. With
the step-by-step capturing of the complex models by utilizing the biomedical, statistical, and
physical meanings, as well as harnessing the power of the data, we aim to provide explainable AI
with our ongoing efforts.
2.6 Chapter Summary
We have proposed a cross-domain joint dictionary learning (XDJDL) framework and the
extended label-consistent XDJDL (LC-XDJDL) model for ECG waveform inference from the
PPG signal. Compared to the prior art using the DCT method, our proposed method better lever-
ages the data to improve data representation while extending over a model-based approach. The
promising experimental results validate that our proposed models can learn the relation between
PPG and ECG and reconstruct ECG well. From the analysis for subwave reconstruction and
timing of interval recovery, we observe that we can restore the QRS complex and the QT interval
in high precision, which is essential for ECG monitoring and to gain more PPG-based diagnosis
knowledge. This work reveals the potential of long-term and user-friendly ECG screening from
the PPG signals that we can acquire from the daily use of low-cost, low-power wearable devices
for IoT and digital twins applications in healthcare.
Part of the research in this chapter was published in [143] and submitted for journal publi-
cation [144].
52
Chapter 3
Never-Miss-A-Beat: A Physiological Digital Twins Framework for
Cardiovascular Health
3.1 Digital Twins Relating PPG and ECG Sensing: Motivation and Problem
Formulation
Under the umbrella of physiological digital twins as described in Chapter 1.1.3, the con-
tribution of this chapter focuses on a particular application of a digital twin in healthcare for
monitoring a person?s cardiac activity. Cardiovascular disease (CVD) is the leading cause of
mortality worldwide, accounting for 18.6 million deaths in 2019 [121] and clinical data suggest
that the susceptibility to outcomes of COVID-19 is strongly associated with CVD [105]. Thus,
the ability to consistently and accurately monitor cardiac activity is extremely important. Two
commonly used cardiac sensing modalities that we are already familiar with are electrocardio-
gram (ECG) [49] and photoplethysmogram (PPG) [7]. ECG and PPG each have strengths and
limitations in clinical practice: most notably, the clinical gold standard of ECG is monitored spo-
radically (commonly for 30-second intervals and, even with specialty devices, rarely over two
weeks) and requires a user?s attention and cooperation as summarized in Table 2.1, while PPG
53
can be monitored continuously but has a significantly smaller clinical knowledge base than ECG
and tends to be noisy (although denoising is possible [175]). The ability to leverage the advan-
tages of both technologies could have major impacts on the healthcare system, leading to easier
everyday health monitoring.
ECG and PPG represent different but closely related physiological quantities, but how they
are related is not well understood quantitatively. In this work, we have made some early-stage
efforts toward understanding this relationship depending on age groups and cardiovascular con-
ditions, as well as individualized nature. Thus, developing an explainable cardio-physiological
digital twin model provides an excellent opportunity for monitoring a person?s cardiac activities.
Targeted Continuous PPG Monitoring
Patient
Continuous 
PPG Personalized
Digital Twin Continous ECG Monitoring
Sporadic ? ? ? For PPG-to-ECG
ECG
Figure 3.1: Our goal in this work is to build a personalized digital twin model for a targeted
patient with his/her limited sporadic paired PPG and ECG cycles, such that his/her ECG can be
faithfully and continuously inferred from the continuous PPG measured by the daily wearable
devices.
More specifically, we pose the following question: is it possible to leverage continuous
PPG monitoring, build a digital twin model to establish a patient?s personalized PPG and ECG
relationship during sporadic ECG sensing sessions, and use digital twins to infer continuous
ECG waveforms? Through smart interpolation or extrapolation enabled by the digital twins, we
can support continuous ECG monitoring, never missing a beat, as illustrated in Fig. 3.1. This
can be particularly valuable for helping patients and physicians capture details of cardiac events
54
that are not commonly exhibited during a patient?s clinical visits. By monitoring the digital
twin that represents the real-time heart electrical activities and blood circulation in cyberspace, a
centralized server or cardiologists can identify sudden cardiac risks so that high-risk populations
can receive early medical intervention and even prevent premature mortality.
3.2 Related Background
Dataset Split
TBME-RR [77] MIMIC-III [74] BIDMC [113] SPCup 2015 [169]
8-min from each Three 5-min sessions from
of the 42 patients; each of the 103 patients;
Zhu et al. [175] 80%:20% train/test 2:1 train/test split n/a n/a
split from each patient from each patient
Three 5-min sessions from 5-min from each
each of the 33 patients; of the 13 participants;
Tian et al. [144] Same as Zhu et al. [175] 80%:20% train/test n/a 80%:20% train/test
split from each patient split from each patient
8-min from each
of the 53 patients;
Li et al. [92] n/a Same as Tian et al. [144] 80%:20% train/test Same as Tian et al. [144]
split from each patient
8-min from each
of the 276 patients;
Vo et al. [154] n/a 80%:20% train/test n/a n/a
split from each patient
Chiu et al. [31] n/a n/a Didn?t mention n/a
Table 3.1: A research review of the dataset and its split method used by the emerging technologies
for ECG waveform inference from continuous PPG.
In recent years, researchers have begun to bridge the ECG-PPG knowledge gap by model-
ing the relationship between these two signals [92, 143, 144, 154, 174, 175], including the work
presented in Chapter 2. Among those works, Vo et al. [154] used randomly selected 8-minute-
long PPG-ECG signals from 276 patients in the MIMIC-II database [52] to analyze their models.
Zhu et al., 2019 [174]; Zhu et al., 2021 [175]; Tian et al., 2020 [143]; Tian et al., 2021 [144],
and Li et al. [92] evaluated their models on PPG-ECG signal pairs taken from the MIMIC-III
database. In all of these works, results are provided in which 80% of the pairs were used for
55
training and validation, while the remaining 20% of the data were used for model testing. Ta-
ble 3.1 provides an overview of the emerging technologies for ECG waveform inference from
continuous PPG. Because this split is carried out over all the data, these results are discussed
in an average/generic sense that is out of context with precision healthcare. On the other hand,
Zhu et al., 2021 [175] also trained subject-specific models in which the data from a particular
subject is used to train a personalized model for analyzing the ECG reconstruction performance
from PPG. Even though the subject-specific results are more relevant for precision healthcare,
they also used the 80-20% training-testing splits to train their model. In real-world application
scenarios, subject-specific data may be more scarce, which may cause these models to break
down.
3.3 Methodology
3.3.1 Backbone Model for ECG Inference from PPG
Among the prior arts [92, 143, 144, 154, 174, 175] dedicated to the PPG-based ECG in-
ference problem summarized in Chapter 3.2, the pilot study [174] first proved the feasibility of
inferring ECG waveforms from PPG sensors by relating the two signals in the discrete cosine
transform (DCT) domain using linear regression. Despite its computational efficiency, the DCT
method [174, 175] lacks enough data representation power to faithfully reproduce ECG from
PPG signals when the morphology of ECG waves becomes complex due to cardiovascular com-
plications. Neural networks, with strong expressive power and high structural flexibility, are also
adopted to solve this problem [92,154]. However, the computational cost of deep neural networks
hinders their widespread deployment in practical applications. Also, black-box large neural net-
56
work models are difficult for cardiologists to interpret and be receptive by the results. To strike a
balance between the accuracy of ECG inference and computational resources in real-world sce-
narios, we first start with the dictionary-learning-based framework XDJDL proposed in Chapter 2
as a backbone model for PPG-to-ECG inference that provides a proper solution: compared to the
DCT method, it improves the data representation with versatile and adaptive models; and it can
perform efficiently in terms of power consumption and computational cost [6, 93]. The neural
network based backbone models will be proposed and evaluated in Chapter 3.6 and Chapter 3.7.
Here is a summary and recapitulation of the key points in the XDJDL model that we adopt
as the backbone model. Two dictionaries, D ? Rd?kp and D ? Rd?kep e , are learned jointly to
estimate sparse representations (Ap ? Rkp?N and Ae ? Rke?N ) for PPG and ECG datasets Xp ?
Rd?N and X d?Ne ? R , respectively. Each column of X and X is denoted as p ? Rd?1p e i and
e ? Rd?1i , representing one PPG/ECG signal pair from the same cardiac cycle. Simultaneously,
a linear transformation W is learned to map the sparse codes from the PPG to the ECG. The
problem of solving for Dp, De, and W is formalized in Eq. (3.1).
min ?X 2e ?DeAe?F + ??Xp ?DpAp?2 + ??A ?WA 2F e p?F
De,Ae,Dp,Ap,W (3.1)
s.t. ?ap,j?0 ? tp, ?ae,j?0 ? te.
The first two terms in Eq. (3.1), coupled with the constraints on the upper limits for sparsity,
are used to learn the dictionary pair and sparse PPG and ECG representations iteratively by the
two-step optimization strategy explained in Chapter 2.3.2 that is composed of sparse coding and
dictionary update, while the third term in the equation facilitates learning the mapping between
the two sparse domains simultaneously. In this way, the representation error in the first two
57
terms and the mapping error in the third term are minimized. Fig. 3.2 summarizes the learning
procedure of XDJDL.
Training Phase of XDJDL 
PPG ECG
... Two-Step Optimization: ...
Sparse Coding and 
Dictionary Update
Dp W De
Validation/Testing Phase of XDJDL 
... ...
Analysis Transform Synthesis
Figure 3.2: The XDJDL model proposed in Chapter 2 is adopted here as the backbone model for
ECG inference from PPG. The dictionary pair Dp, De, and the linear mapping W are learned
during the training phase, which are later applied to infer ECG from PPG in the validation or
testing phase.
3.3.2 Transfer Learning for Building Precision Healthcare Digital Twins
Considering a group of people with abundant paired PPG and ECG signals from their in-
hospital stay or annual physical examinations, we denote their corresponding PPG and ECG
datasets as X ? Rd?N and X ? Rd?Np e , respectively. Each column of Xp and Xe represents one
PPG/ECG signal pair from the same cardiac cycle. Given Xp and Xe, we can learn a generic
digital twin model to simulate the PPG-to-ECG mapping. The XDJDL backbone model we
adopt from Chapter 2 has shown that this group-based model (referred to as the generic digital
twin model in this chapter) can be applied to predict the future ECG waveforms well from the
PPG waveforms of people in the same group.
58
In this chapter, we consider a more practical scenario in which we would like to perform
continuous ECG monitoring for a new target participant who only provides sporadic short (mostly
30-second segments) PPG/ECG paired signals acquired from his/her wearable devices like the
Apple Watch [138], AliveCor [76], Zio patch [39], and Empetica E4 watch [43]. We denote the
corresponding PPG and ECG datasets as Tp ? Rd?M and T ? Rd?Me (M ? N), respectively.
Our aim in this work is to propose a method to fully utilize the sporadic data of the new partici-
pant, so that a precision healthcare digital twin model can be learned for the specific participant
to infer and monitor his/her ECG from PPG wearable devices.
Training Targeted
PPG ECG Participants Patient
... Training and Validation ...
of XDJDL
PPG (Sporadic) ECG (Sporadic) Included in Mixed Learning
Generic Digital Twin ... ... Excluded in Leave-N-Out
(Dp, De, W)
Initialization
Transfer Learning (Proposed) PPG ECG
...
PPG (Sporadic) ECG (Sporadic) Training and Validation ...of XDJDL
...
Training Phase of XDJDL ...
Generic Digital Twin
Personalized Digital Twin (Dp, De, W)
(Dp?, De?, W?)
Testing Phase of XDJDL
Testing Phase of XDJDL
PPG (Continuous) ECG (Inferred)
PPG (Continuous) ECG (Inferred)
(a) Transfer Learning (Proposed) (b) Baseline Comparisons
Figure 3.3: Flowcharts for (a) proposed transfer learning and (b) baseline comparisons including
mixed learning and leave-N-out training scenarios. For the proposed transfer learning mode, a
generic digital twin model is initially trained and validated using data from training participants,
yielding the paired dictionaries Dp, De, and a linear transform W. This model is then refined
to be a personalized digital twin (D? ?p, De, and W
?) with the sporadic ECG and PPG pairs of the
target patient. In the baseline comparisons, the generic digital twin is learned solely from the
data of training participants in the leave-N-out mode and additional sporadic pairs from the target
patient are used for training and validation in the mixed learning mode.
To address the challenge of data scarcity from the target participant, we propose to transfer
the knowledge inherited in the generic digital twin model learned from the training participants
with abundant PPG/ECG recordings from in-hospital stays or annual examinations, so that the
59
generic digital twin can be refined and tailored to the new participant. In this study, we learn the
healthcare digital twin model in the following training modes:
1. Transfer Learning Mode (Proposed): As visualized in Fig. 3.3(a), during the transfer
learning phase, the generic digital twin model learned from training participants serves as
the initialization model. This is followed by continued training of the model on sporadic
PPG/ECG pairs from the target participants, which updates the generic model variables
Dp, De, and W to D?p, D
?
e, and W
?. These updates result in the proposed personalized
digital twin model tailored to the target participants for precision healthcare.
2. Mixed Learning Mode (Baseline Comparison 1): As illustrated in Fig. 3.3(b), on top
of using the long PPG/ECG paired recordings from the training participants, sporadic
PPG/ECG pairs from the target patient are also included to learn the generic digital twin
model Dp, De, and W.
Compared to the transfer learning mode, the mixed learning mode requires model training
from scratch with mixed data from training and target participants, which can be time-
consuming and not realistic if the training data is not accessible. While in transfer learning
mode, the generic digital twin model is used in a plug-and-play form that does not require
the data from the training participants to retrain the model.
3. Leave-N-Out Mode (Baseline Comparison 2): As displayed in Fig. 3.3(b), in this mode,
we apply the generic digital twin model learned solely from the training participants to the
new target patient. This mode provides the baseline performance without making use of the
sporadic data from the target patients and reveals the adaptation capability of the generic
digital twin model to unseen participants.
60
Figure 3.4: The two testing modes that we examine for the learned digital twin model. (a)
Interpolation mode where we can ?rewind? the ECG during its detachment between two sporadic
time stamps that contain known paired PPG and ECG signals. (b) Extrapolation mode where we
can check the past ECG or predict the future ECG.
3.3.3 Testing Modes for ECG Inference
To detect symptoms of underlying heart conditions (like elevated heartbeat [4]) early for
proper intervention, continuous long-term ECG monitoring is critical to pick up on subtle devi-
ations from a person?s normal ECG patterns. Discontinuous ECG signals may not fully capture
critical deviating behavior which can lead to a wrong evaluation, deteriorating the effectiveness
of treatments [19]. For this reason, once the digital twin model is learned, we present two test-
ing modes in our analysis for addressing the issue of discontinuous ECG monitoring in realistic
situations: interpolation and extrapolation.
1. Interpolation Mode (Illustrated in Fig. 3.4(a)): Suppose we have two short pairs of
PPG/ECG signals with some time interval in between from the target participant and we
aim to ?interpolate? the ECG waveforms from the continuous PPG signal acquired between
61
the two sporadic time stamps. This interpolation mode corresponds to realistic situations
where the participant wishes to detach the ECG nodes from his/her body for some time.
The ECG information before detachment and after the reattachment can be used to ?rewind
time? to reconstruct the signal that was lost during the detached period.
2. Extrapolation Mode (Illustrated in Fig. 3.4(b)): Suppose we have two sporadic short
pairs of PPG/ECG signals from the target participant and we aim to ?extrapolate? the ECG
waveforms from the continuously acquired PPG signal before and after the two sporadic
time stamps. This extrapolation mode corresponds to realistic situations where a medical
practitioner wants to know what the ECG signal looked like in the past or to predict what
will happen in the future. In the former case, physiological abnormalities that otherwise
would have been missed may be detected. In the latter case, preventative measures can be
taken should the predicted future signal display physiological abnormalities that could lead
to health concerns.
3.4 Experimental Results Using XDJDL as The Backbone For The Personal-
ized Digital Twin Model
3.4.1 Dataset
Medical Information Mart for Intensive Care III (MIMIC-III) [74] is a large database com-
prised of health information related to patients admitted to the intensive care unit at the Beth
Israel Deaconess Medical Center in Boston, Massachusetts. Timestamped bedside vital sign
measurement is provided for each of the 53, 423 patient hospital admissions.
62
The analysis in this study is performed on a subset of data from the MIMIC-III database
that was collected using the methodology outlined as follows. Patients that had paired lead II
ECG and PPG signals in the record were selected from the waveform database and were linked
to their patient profiles (sex, disease, etc) according to the subject IDs. Of these signals, only
those of high quality and belonged to patients with specific cardiovascular/non-cardiovascular
diseases were retained for analysis. Cardiovascular diseases were chosen from the list of ?dis-
eases of the circulatory system? based on the ICD-9 codes of the patients and the following
cardiovascular diseases are included in the collected dataset: atrial fibrillation, myocardial in-
farction, cardiac arrest, congestive heart failure, hypotension, hypertension, and coronary artery
disease. For non-cardiovascular diseases, we selected sepsis, pneumonia, gastrointestinal bleed,
diabetic ketoacidosis, and altered mental status under other categories of ICD-9 codes. The result
was a set of 127 subjects as displayed in Fig. 3.5 with the age distributions of the cardiovascu-
lar disease and non-cardiovascular disease subjects. Each subject has three 5-minute sessions
of paired ECG and PPG recordings which were collected within a few hours of each other. To
differentiate this dataset from the mini-MIMIC-33 dataset evaluated in Chapter 2, we denote it as
the mini-MIMIC-127 dataset.
3.4.2 Hyperparameters Selection
In the XDJDL framework described in Eq. (3.1), dictionary sizes for PPG and ECG signals
are important hyperparameters to be chosen for good data representation. In this section, we
explain how we select the best sizes of the PPG and ECG dictionaries by examining their impact
on the performance of the ECG reconstruction in terms of the Pearson coefficient. Different
63
Figure 3.5: Distribution of the 127 patients collected from the MIMIC-III database in different
age groups and disease types (mini-MIMIC-127 dataset). Within each age group, the patients
with cardiovascular-related diseases are marked in blue on the left, and the patients with non-
cardiovascular-related diseases are marked in orange on the right.
64
Figure 3.6: The validation performance in terms of Pearson correlation coefficient with respect
to different combinations of PPG and ECG dictionary sizes.
combinations of dictionary sizes are used to train the XDJDL models with training data from
the first two sessions of each training participant. The trained models are later evaluated on the
validation set built from the third session of each training participant to select the proper size of
the PPG and ECG dictionary pair.
From Fig. 3.6, we observe that given the same ECG dictionary size, the Pearson coefficient
in the validation set improves and becomes saturated as the PPG dictionary size grows towards
10000. The trend of convergence suggests potential model overfitting. Another observation is that
given the same PPG dictionary size, the performance remains almost unchanged and deteriorates
as the ECG dictionary size increases. Hence, using fewer atoms for the ECG dictionary is a good
choice. The experimental results indicating that the number of atoms in a PPG dictionary needs
to be much greater than the number of atoms in an ECG dictionary suggest that there are more
65
detailed differences among PPG signals than ECG signals in the collected dataset.
This phenomenon that PPG needs far more atoms than ECG can be counterintuitive at first
glance. Because from frequency analysis, people may find that ECG has more high frequency
components and thus needs more atoms to be represented. But if we view ECG as the source and
PPG as the downstream signal, according to information theory, we know that the entropy of the
system will increase as the information flows from the heart to the peripheral vasculature after
the processing of all the blood vessels along the way. As a result, PPG contains more subtlety
and nuances and needs more atoms to present. One example that reinforces this assumption
is that during severe hemorrhage (blood volume loss) caused by trauma injury, ECG contains
fewer useful features for early detection of hemorrhage until irreversible harm or cardiovascular
collapse but PPG senses this extreme medical situation sooner and is often used as an important
bio-marker to detect blood loss in its early stage [30, 117, 118].
3.4.3 Performance of ECG Inference
We split the overall dataset with 127 patients into four groups according to their health-
related physical attributes (age and disease type). These groups include 16 cardiac young patients
(age less than 60 with cardiac diseases), 54 cardiac old patients (age greater than or equal to 60
with cardiac diseases), 24 noncardiac young patients (age less than 60 with noncardiac diseases),
and 33 noncardiac old patients (age greater than or equal to 60 with noncardiac diseases). In
this way, the generic digital twin model corresponding to each attribute group can be learned
separately and applied to the target patients with the same attribute.
For each group, three patients are randomly selected as the target participants and the data
66
from the rest of the patients in this group are used for training and validation. The training set
is composed of the first two sessions from each patient and the validation set consists of the last
session from the same patient. Thus, the current data split is training:validation:testing = 6:3:1 on
average. This corresponds to the realistic setting of building a precision healthcare digital twin
model with the generic digital twin learned from a large portion of patients to be applied to a few
target patients.
For the interpolation test mode, the first 45 cycles (approximating a 30-second segment)
from the first session and the last 45 cycles from the last session of the target participants are
regarded as the known sporadic pairs for either transfer learning or mixed learning. The second
session of the target participants is used to evaluate the interpolation performance. For the extrap-
olation test mode, the first and last 45 cycles from the second session of the target participants
are regarded as the known sporadic pairs for either transfer learning or mixed learning. The first
and last sessions are used to evaluate the extrapolation performance. It is worth noting that each
participant only has three 5-min sessions in a sequence collected at most within a few hours,
meaning the interpolation and extrapolation results in this work are for a relatively short period.
For longer time window results, such as daily or weekly, a preliminary performance evaluation is
shown in Chapter 3.5.2.
We use the Pearson correlation coefficient (?) and the relative root mean squared Error
(rRMSE) to evaluate the morphological fidelity of the inferred ECG e?:
(e? ?[e])T(e?? ?[e?])
? = , (3.2)
?e? ?[e]?2 ?e?? ?[e?]?2
67
?e? e??
rRMSE = 2 , (3.3)
?e?2
where e denotes the reference ECG cycle, and ?[?] represents the element-wise average of a
vector.
1 2 Leave-N-Out
Mixed Learning
Transfer Learning
0.8
1.6
0.6
1.2
0.4
0.2 0.8
0
Leave-N-Out 0.4
-0.2 Mixed Learning
Transfer Learning
-0.4 0
Interpolation Extrapolation Interpolation Extrapolation
(a) (b)
Figure 3.7: Statistical distribution of (a) Pearson correlation coefficient (?) and (b) rRMSE for
the inferred ECG signals in both interpolation and extrapolation testing modes using different
training modes (leave-N-out, mixed learning, and transfer learning).
Fig. 3.7 depicts the overall distribution comparison of the reconstruction performance sum-
marized in the boxplots using XDJDL as the PPG-to-ECG inference model. Each boxplot is
composed of the results from all groups. We observe that the medians and spreads of ? and
rRMSE improve from leave-N-out mode to mixed learning mode, and transfer learning mode
achieves the best result in both interpolation and extrapolation testing scenarios. Specifically, the
medians of ? in the interpolation testing mode are 0.70, 0.86, and 0.94 across the three training
modes, respectively, while those in the extrapolation testing mode are 0.72, 0.87, and 0.95, re-
spectively. The median rRMSE values in the interpolation testing mode are 0.77, 0.54, and 0.36
68
rRMSE
across the three training modes, respectively, while those in the extrapolation testing mode are
0.74, 0.53, and 0.31, respectively. Analysis of these boxplots suggests that the transfer learning
mode can both interpolate and extrapolate ECG signals for the target participants from their spo-
radic PPG/ECG pairs with high fidelity, indicating the effectiveness of our proposed method in
learning the precision healthcare digital twin.
Interpolation Extrapolation
? rRMSE ? rRMSE
Cardiac young group
Transfer learning 0.90 (0.09) 0.42 (0.21) 0.91 (0.11) 0.40 (0.24)
Mixed learning 0.76 (0.26) 0.62 (0.33) 0.82 (0.22) 0.54 (0.29)
Leave-N-out 0.67 (0.23) 0.76 (0.26) 0.73 (0.17) 0.71 (0.21)
Cardiac old group
Transfer learning 0.95 (0.07) 0.28 (0.17) 0.95 (0.06) 0.28 (0.16)
Mixed learning 0.66 (0.39) 0.69 (0.41) 0.60 (0.45) 0.76 (0.45)
Leave-N-out 0.58 (0.40) 0.82 (0.34) 0.61 (0.37) 0.81 (0.32)
Noncardiac young group
Transfer learning 0.87 (0.11) 0.49 (0.23) 0.88 (0.13) 0.48 (0.28)
Mixed learning 0.78 (0.25) 0.57 (0.30) 0.74 (0.30) 0.62 (0.36)
Leave-N-out 0.32 (0.56) 1.07 (0.54) 0.34 (0.55) 1.06 (0.52)
Noncardiac old group
Transfer learning 0.92 (0.14) 0.37 (0.42) 0.95 (0.05) 0.28 (0.14)
Mixed learning 0.80 (0.24) 0.52 (0.35) 0.89 (0.17) 0.39 (0.26)
Leave-N-out 0.59 (0.28) 0.85 (0.35) 0.66 (0.26) 0.76 (0.30)
Overall
Transfer learning 0.90 (0.11) 0.40 (0.28) 0.92 (0.10) 0.36 (0.23)
Mixed learning 0.75 (0.30) 0.60 (0.35) 0.76 (0.33) 0.59 (0.38)
Leave-N-out 0.52 (0.43) 0.90 (0.42) 0.56 (0.42) 0.86 (0.40)
Table 3.2: Experimental results from each group and overall result from all groups for the inferred
ECG in terms of the mean and the standard deviation (in parenthesis) of Pearson coefficient (?)
and rRMSE.
In addition to the overall statistical distribution, Table 3.2 lists the ECG inference perfor-
mance in terms of the mean and standard deviation of Pearson coefficient (?) and rRMSE for
each group along with the overall results for all groups. The results in each group are consistent
69
with the overall results as shown in Fig. 3.7 and the last three rows of Table 3.2: leave-N-out sets
the baseline performance, mixed learning improves it with the target participant?s sporadic data
mixed in the training phase, and the proposed transfer learning mode further boasts an improved
ECG inference performance. The only exception is in the cardiac old group where the mixed
learning and leave-N-out achieve comparable performance in the extrapolation testing mode.
This could be due to that the cardiac old group is the largest group (54 people in total), and the
weight of the target participant is relatively small in the mixed-learning, making it comparable to
the leave-one-out case. Another observation is that, except for the noncardiac young group, in the
remaining three groups, the leave-N-out training mode can achieve reasonably fair reconstruction
performance with a Pearson coefficient ? of at least 0.58 and as high as 0.73. Since leave-N-out
is the most challenging case, with the target patient?s data totally unseen in the training phase,
its acceptable quality of reconstruction indicates that separating patients into groups of similar
attributes is helpful to achieve good reconstruction performance for people belonging to the same
group given the current dataset. This generalization capability in an even larger dataset needs
further validation, and more attributes can be considered, such as ethnic, different hospitals, etc.
Fig. 3.8 shows three visualization examples comparing reconstructed ECG signals to their
reference ECG signals. In Fig. 3.8(a), the leave-N-out mode infers the ECG of this patient from
the knowledge learned from all the training patients in the same group (noncardiac old). The first
four cycles are reconstructed with high similarity to the reference ECG, with the T-wave slightly
shifted in time. Mixed learning incorporates the target patient?s sporadic ECG and PPG pairs in
the training set, thus the last four cycles show improved reconstruction performance compared
to the leave-N-out case, though a glitch still appears in the inference of the first cycle. Trans-
fer learning further improves the reconstruction performance with all inferred cycles matching
70
71
Female,65,Sepsis Recon. ECG Female,52,Coronary Artery Disease Male,53,Coronary Artery Disease
Ref. ECG
Input PPG
Leave-N-out
Mixed 
Learning
0
Transfer 
Learning
262 262.5 263 263.5 264 264.5 265 265.5 124.5 125 125.5 126 126.5 127 127.5 128 128.5 105.5 106 106.5 107 107.5 108 108.5 109 109.5
Time (s) Time (s) Time (s)
Side by Side Side by Side
(a) (b) (c)
Figure 3.8: Qualitative comparison of the ECG signals inferred in different modes. Examples are from (a) a 65-year-old female with
sepsis, (b) a 52-year-old female with coronary artery disease, and (c) a 53-year-old male with coronary artery disease. From top to
bottom: the input PPG signal from which the ECG is inferred, results from the leave-N-out mode, results from the mixed learning mode,
and results from the transfer learning mode. At the bottom of (b) and (c), we also provide a side-by-side view of specific cycles for
illustration.
the reference signal. In Fig. 3.8(b), we show a side-by-side view in the blue box comparing the
inferred ECG using the transfer learning method to the reference ECG signal. The third cycle
(second cycle in the blue box) is slightly different from the typical ECG waveform of this patient
with an extra ascending and descending slope before the T-wave. From the side-by-side view in
the blue box, transfer learning can recover this variant well. In Fig. 3.8(c), the ECG waveform of
a participant with coronary artery disease is displayed. This patient?s ECG waveform typically
contains an obviously inverted T-wave, though the inversion is milder in the first cycle of the high-
lighted blue box. Nevertheless, the transfer learning model is able to accurately capture both the
more and less pronounced inversion characteristics. From the illustrations in Fig. 3.8(b) and (c),
the ECG variation of the target patients is captured well in the transfer learning mode but not
in the mixed learning mode, suggesting that it is more useful to inherit the knowledge from a
generic digital twin and then fine-tune it with the target patient?s data.
3.5 Discussions for XDJDL-based Personalized Digital Twin Model
3.5.1 Results Based on PPG Segmentation Scheme
In Chapter 3.4, we have evaluated different training and testing modes for digital twins
models based on the assumption that the timestamps of R peaks in the reference ECG signals
are available to segment the paired ECG and PPG signals into cycles. In realistic settings, we
may not have the reference ECG signal for segmentation. Thus, we consider a practical scenario
of reconstructing the ECG from the ?estimated cycles? of PPG that are segmented by the PPG
onsets instead of the R peaks of ECG signals. We denote:
? R2R scheme: segmentation scheme based on R peaks of ECG as is used in Chapter 3.4;
72
? O2O scheme: segmentation scheme based on PPG onsets.
Input PPG
Recon. ECG
O2O Segmentation Scheme Ref. ECG
O2O Segmentation Scheme After Time Shifting
R2R Segmentation Scheme
64.5 65 65.5 66 66.5 67 67.5 68 68.5 69 69.5
Time (s)
Figure 3.9: Qualitative comparison for different segmentation schemes. From top to bottom: the
input PPG signal from which ECG is inferred, results from the O2O segmentation scheme, results
after shifting the O2O inferred cycle in time to align the R peaks of the inferred ECG and the
reference ECG, and results from the R2R segmentation scheme.
Due to the discrepancy between the detected locations of PPG onset and R peak of ECG
from the same cycle, the ?estimated? PPG/ECG cycles using the O2O scheme vary from those
segmented by the R2R scheme. Thus, compared to the R2R scheme, in the O2O scheme, further
ECG inference error results from 1) the time misalignment between the R peak of the inferred
ECG and that of the reference ECG and 2) the reconstructed waveform error. To single out the
error caused by 2), on top of the O2O scheme, we compensate for the time offset caused by 1) by
shifting each inferred ECG cycle in time so that the reference and reconstructed ECG signals are
matched according to their R peaks. We denote the results after aligning R peaks of inferred ECG
from the O2O scheme with the reference ECG as O2O?. One qualitative comparison example is
shown in Fig. 3.9.
73
? rRMSE
Transfer learning (O2O) 0.45 (0.38) 1.08 (0.55)
Transfer learning (O2O?) 0.75 (0.22) 0.75 (0.45)
Transfer learning (R2R) 0.91 (0.10) 0.38 (0.25)
Table 3.3: Comparison of different segmentation schemes for ECG inference presented in mean
and standard deviation (in parenthesis) of Pearson coefficient (?) and rRMSE.
The overall comparison result is listed in Table 3.3. Compared to the R2R scheme, when
using the O2O scheme, the average Pearson coefficient drops from 0.91 to 0.45, and the average
rRMSE rises from 0.38 to 1.08. By compensating for the error from the misalignment of R peaks
to only account for the waveform inference discrepancy, compared to O2O, O2O? improves the
mean Pearson coefficient and the mean rRMSE to 0.75 and 0.75, respectively.
3.5.2 Performance Evaluation for Long Time Scale Data
This section aims to examine the performance of the personalized digital twins for ECG
inference when the data are collected in a longer time window, e.g., during a week, in addition
to the mini-MIMIC-127 dataset (Chapter 3.4.1) where each participant only has three 5-min
sessions collected within a few hours. We self-collected the ECG and PPG data using consumer-
grade sensors to test the temporal consistency of the personalized digital twins.
Self-collected Dataset:
One 27-year-old female subject participated in this week-long data collection (approved
by University of Maryland IRB #1786518). This participant has not been diagnosed with any
CVDs according to the most updated medical records. As shown in Table 3.4, 24 sessions for
the subject at different times (morning, afternoon, and evening) of a day during a week were
recorded. In each session, the participant was asked to hold the FDA-cleared EMAY portable
74
Subject 1 Year: 2022
Session Session Session Session
1 04-04, 11:53 8 04-06, 17:25 15 04-08, 21:37 22 04-11, 10:36
2 04-04, 16:08 9 04-06, 22:27 16 04-09, 10:31 23 04-11, 15:23
3 04-04, 20:38 10 04-07, 09:27 17 04-09, 15:39 24 04-11, 23:15
4 04-05, 09:04 11 04-07, 17:37 18 04-09, 23:31
5 04-05, 15:15 12 04-07, 21:30 19 04-10, 09:01
6 04-05, 21:38 13 04-08, 09:54 20 04-10, 15:12
7 04-06, 08:58 14 04-08, 18:03 21 04-10, 21:19
Table 3.4: The data collection time stamps for the participant during a week.
ECG monitor (Model: EMG-10) to record the lead-I ECG. We measure the lead I ECG signal
from the two hands, which is the easiest and most accessible way to use EMAY. Simultaneously,
the index finger is placed in the CMS-50E pulse oximeter for PPG monitoring. The setup is
shown in Fig. 3.10. It is worth noting that EMAY can only record a 30-second long ECG at
a time, thus we asked the participant to hold it for 6 consecutive periods of ECG snapshots (3
minutes) in each session for longer recordings. To reduce the movement-induced artifacts and
false diagnosis during the recording, the participant was asked to sit comfortably and keep both
hands on the desk as still as possible. The sampling rates of the EMAY ECG monitor and the
PPG sensor are 250 Hz and 60 Hz, respectively. The PPG signal is upsampled to 250 Hz with
spline interpolation. Then we preprocessed the signals using the same method as explained in
Chapter 2.3.1.
Learning and Evaluation Schemes:
Given the attributes of the self-collected data, which includes young participants with no
known CVDs, we first learn a generic digital twin using the data from the 40 young patients from
both cardiac and noncardiac groups in the mini-MIMIC-127 dataset from Chapter 3.4.1. With
the generic digital twin model, we use the proposed transfer learning methodology (Chapter 3.3)
to update it to a personalized model with the sporadic short paired PPG and ECG segments from
75
Figure 3.10: Experimental setup for the self-collected PPG and ECG database. The CMS-50E
pulse oximeter was measuring the PPG signal from the index finger and the EMAY was recording
the lead-I ECG signal by connecting both hands to its metal electrodes.
the target participant.
We learn and evaluate the personalized digital twin in the following schemes:
? Scheme (a) Interpolation & Extrapolation Within One Day: Can we use paired PPG and
ECG data from the morning and evening of each day to obtain the personalized digital twin
and infer the afternoon data (i.e., interpolation within a day) and vice versa, use afternoon
data to fine-tune the digital twin and infer the morning and evening data (i.e., extrapolation
within a day)?
? Scheme (b) Interpolation & Extrapolation Within Half A Week: Can we use data from Day 3
morning & Day 6 evening to learn the personalized model to infer both interpolation case
(sessions between Day 3 to Day 6) and extrapolation case (sessions from Day 1,2,7,8)?
? Scheme (c) Interpolation Within A Week: Can we use data from Day 1 morning & Day 8
76
evening to update the generic digital twin model to infer all the sessions in between?
ECG Inference Performance:
Table 3.5 summarizes the overall performance from each of the learning and evaluation
schemes, excluding the training and validation sessions from Day 1 morning, Day 3 morning,
Day 6 evening, and Day 8 evening for a fair comparison. We observe that the personalized
digital twin updated by the Day 1 morning and Day 8 evening data (Scheme (c)) achieves slightly
better inference performance than the other two schemes, suggesting that Day 1 morning data is
representative of the whole week?s ECG-PPG relation for this target participant during the week
of data collection.
Scheme (a) Scheme (b) Scheme (c)
? 0.87 (0.20) 0.88 (0.16) 0.88 (0.22)
rRMSE 0.49 (0.32) 0.49 (0.23) 0.44 (0.26)
Table 3.5: The personalized digital twin performance of different learning and evaluation
schemes. Scheme (a) learns and evaluates the personalized digital twin daily, while Scheme (b)
and Scheme (c) are conducted for data from several days to a week. Results are presented in
means and standard deviations (in parentheses) of Pearson correlation coefficient ? and rRMSE.
The breakdown of everyday performance in terms of the Pearson coefficient from the three
schemes is shown in Fig. 3.11. The height of each bin shows the average correlation coefficient
? of ECG reconstruction results from the overlapped test sessions of the three schemes each day.
?
Each error bar corresponds to the 95% confidence interval that is calculated as ?1.96??/ N ,
where ?? is the sample standard deviation and N is the sample size/number of ECG cycles. In
Fig. 3.11a, the blue bar shows the results of the experiment for ?interpolation within a day? that
uses the morning and evening data to fine-tune a personalized digital twin to infer the afternoon
ECG from the same day, and the red bar shows that for ?extrapolation within a day? using the
afternoon data to update the personalized digital twin to predict the morning and evening data,
77
(a)
(b) (c)
Figure 3.11: The breakdown of everyday performance in terms of Pearson coefficient from the
three schemes. (a), (b), and (c) show the results from Schemes (a), (b), and (c), respectively. Each
bar represents the average Pearson Coefficient and each error bar represents the 95% confidence
interval.
78
and the yellow bar is the averaged performance of ?interpolation? and ?extrapolation? modes.
Comparing the results across the three schemes, we observe that the results are similar to each
other from Day 1 to Day 7, and in more than a half of the days, Scheme (a) achieves slightly bet-
ter performance than the other two schemes, suggesting that the inference within a day is more
accurate than the prediction from several days apart. One exception/outlier is Day 8 when the av-
eraged ? of Scheme (a) is much lower than the other two schemes, especially the ?extrapolation?
mode of Scheme (a). This indicates that the afternoon data from Day 8 is not as representative to
update the personal digital twin for the morning data of Day 8, compared to the morning data of
Day 3 (i.e., Scheme (b)).
In a retrospect, the generalization performance of the personalized digital twin may be
limited by a) the attribute difference between the training data and the self-collected data (ICU
patients vs. healthy subjects) and b) the different leads of ECG signals collected in the training
data and the self-collected data (lead II vs. lead I). Note that lead II is the most common and
generally the best view because the placement of the positive electrode in Lead II views the
wavefront of the impulse from the inferior aspect of the heart as it travels from the right arm
(RA) towards the left leg (LL). Lead I ECG ?views? the heart activity from the left arm (LA) to
the right arm (RA) [79]. According to Einthoven?s law, lead I + lead III = lead II, i.e., the sum of
the potentials in lead I and lead III equals the potential in lead II. That may help explain that in
the self-collected dataset, the amplitude of the R peak of ECG is generally less than 0.5mV while
that of the training data is generally around 1mV to 2mV.
79
3.6 Using Neural Networks as The Backbone for ECG Inference from PPG to
Build Digital Twins
In this section, we aim to improve the personalized digital twins with neural network based
methods, which are more flexible for various transfer learning techniques than the XDJDL model
as the backbone. A conditional variational autoencoder (CVAE) model is adopted here as the
backbone model for PPG-to-ECG inference. Its capability of learning latent variables is suitable
for manifesting the interpretability of the underlying physiological process relating PPG and ECG
signals. Furthermore, in Chapter 3.7, a causal representation learning structure is proposed based
on the CVAE architecture here for better explainability. To differentiate from the causal CVAE
model that will be proposed in Chapter 3.7, we denote the CVAE model used in this section as
the ?vanilla CVAE? model.
3.6.1 A Retrospect: The Physiological Process Behind PPG and ECG Genera-
tion
In our previous work on PPG-to-ECG inference (Chapter 2), we have considered the ECG
as the source signal and PPG as the downstream filtered signal and viewed it as an inverse engi-
neering problem, as is shown in the yellow box of Fig. 3.12. However, if we take the full signal
generation path into consideration as illustrated in the pink box of Fig. 3.12, we know that the
myocardial activities (such as the impulse from the SA node) initiate the electrical signal in the
heart. On the one hand, the varying electrical potentials are captured by the skin electrodes of
ECG sensors. On the other hand, the electrical pulse spreads in the heart, leading to the mechan-
80
Figure 3.12: The ECG and PPG signal generation paths during heartbeats considering the origi-
nating impulses from the heart.
ical movements of the heart and the corresponding aortic pressure wave that later passes into the
blood vessel network. The peripheral pulse wave is measured from the extremities with a PPG
sensor, which received the light modulated by the transmissive and reflective interactions of the
human skin. With this full physiological process in mind, we aim to consider the common fac-
tor, the heart activity, that generates both PPG and ECG into the picture and leverage the CVAE
model to learn a latent variable z to represent this common source. The assumption we make with
the vanilla CVAE is that this common source is Gaussian i.i.d. for all people. This is a relatively
general assumption and we will see how to refine it in Chapter 3.7 with causal interpretation.
3.6.2 Conditional Variational Autoencoder (CVAE) for PPG-to-ECG Inference
To start with, we draw the connection with the previously proposed PPG-to-ECG meth-
ods, such as DCT-based [175], XDJDL-based [144] (Chapter 2), and autoencoder-based frame-
works [92], before diving into the CVAE model. They all can be viewed to be designed to
81
maximize the log of likelihood P (Y |X,?). This is because if we suppose Y = ?(X)+z, where
z ? N (0, ?2), then P (Y |X,?) ? N (?(X), ?2) and the maximum log-likelihood problem can
be translated to minimizing ??(X)? Y ?2. In the DCT-based framework, X is the DCT feature
from PPG and Y is that from ECG; in the XDJDL-based framework, X is the sparse representa-
tion for PPG and Y is that for ECG; and in the autoencoder-based framework, X is PPG and Y is
ECG themselves.
Figure 3.13: The left panel shows the CVAE structure implemented as a feed-forward neural
network during the training process. The upper right panel shows the model at test time when we
want to sample from P (Y |X). The illustration is adopted from [41].
The CVAE structure illustrated in Figure 3.13 represents the core CVAE mathematical
model in Equation (3.4). Instead of maximizing the log-likelihood on the left-hand side of Equa-
tion (3.4), CVAE tries to optimize the surrogate objective function, the variational lower bound
(ELBO) of the log-likelihood, which is the right-hand side of Equation (3.4). The first part of
the ELBO can be regarded as the reconstruction accuracy, which is shown in the top blue box of
Fig. 3.13. The second part of the ELBO is the KL divergence between the conditional distribu-
tion Q(z|Y,X) and P (z|X) represented in the leftmost blue box in Fig. 3.13, where P (z|X) is
N (0, I) because the CVAE model assumes the latent variable z is sampled independently of X
82
at the test time.
logP (Y |X)?KL[Q(z|Y,X)||P (z|Y,X)]
(3.4)
= Ez?Q(?|Y,X)[logP (Y |z, X)]?KL[Q(z|Y,X)||P (z|X)]
Figure 3.14: The vanilla CVAE model as the backbone for ECG inference from PPG.
Following the structure of CVAE, we use a convolutional neural network (CNN) to build it
up. The overview of the model architecture is shown in Figure 3.14. We treat the ECG signal as
the Y to be predicted and the PPG signal as the X on which the prediction is conditioned. The
encoder and decoder are composed of a three-layer CNN, respectively. Each layer starts with a
convolution/deconvolution kernel (channel # 60, 40, and 40 and kernel size # 30, 15, and 5 in
each convolution layer, and the parameter for the deconvolution layer is reversed), then a group
normalization, followed by a LeakyReLU activation layer. The outputs of the encoder are the
mean and variance for the latent variable, which is sampled using the reparameterization trick.
Later on, the latent variable z will be concatenated with the PPG cycle as the label information
83
to generate an ECG cycle. The size of the latent variable is chosen based on the best validation
performance among 16, 32, and 64.
3.6.3 Transfer Learning to Build Personalized Digital Twin for Cardiovascular
Monitoring
We repeat what has been proposed and done in Chapter 3.3 and Chapter 3.4 to build the
personalized digital twin with vanilla CVAE rather than XDJDL as the backbone this time. As
mentioned above, neural networks provide more options for transfer learning as it is flexible
to fine-tune specific layers or add a few layers. We adopt three different fine-tuning methods:
(a) tuning the first deconvolution layer in the decoder, (b) tuning all deconvolution layers in the
decoder, and (c) tuning all parameters in the CVAE model.
1
0.93 0.95 10.91 0.93
0.87 CVAE backbone
0.87
0.8 XDJDL backbone
0.76 0.8
0.76
0.62
0.6
0.55 0.6 0.59
0.4 0.44
0.4 0.38
0.34 0.35
0.29
0.2
0.2
0
Out Out ing ing ing er) er) s)- - n n 0N N r r rn
ve- ve-  Le
a
 Le
a a
 Le dec
od cod m
 (de
a
l pa
r
Out Out ing ing ing er) er)d ms
)
Lea Lea ixed ixed sfe
r
er i
n ing g (a
l -N- -N- arn arn arn cod o raM M nTra  lay
n
ear rnin v
e ve  Le  Le  Le  de de
c
( l pa
one er L ea L
ea Lea ixed ixed sfe
r
er i
n  
n ning g (
al
( y n
ning
 sf
ran nsfe
r L M M Tra  la eare arn
i
ear T Tra (
on L
ng nsf
er r Le
 L ni fe
sfer ear T
ra ans
ranT fer 
L Tr
(a) sTran (b)
Figure 3.15: The overall performance comparison using XDJDL and CVAE as the backbone
models for PPG-to-ECG inference in different training modes, including leave-N-out, mixed
learning, and transfer learning. The error bars correspond to the 95% confidence intervals.
84
Interpolation Extrapolation
? rRMSE ? rRMSE
Cardiac young group
Leave-N-Out 0.70 (0.19) 0.73 (0.23) 0.75(0.14) 0.66 (0.16)
Mixed Learning 0.86 (0.12) 0.48 (0.16) 0.89 (0.12) 0.41 (0.20)
Transfer Learning (one layer) 0.92 (0.08) 0.38 (0.12) 0.95 (0.07) 0.29 (0.10)
Transfer Learning (encoder) 0.92 (0.07) 0.39 (0.13) 0.95 (0.07) 0.30 (0.11)
Transfer Learning (all parameters) 0.94 (0.06) 0.32 (0.10) 0.96 (0.03) 0.27 (0.10)
Cardiac old group
Leave-N-Out 0.61 (0.26) 0.77 (0.19) 0.65 (0.26) 0.74 (0.20)
Mixed Learning 0.80 (0.24) 0.51 (0.29) 0.81 (0.27) 0.50 (0.29)
Transfer Learning (one layer) 0.91 (0.16) 0.33 (0.22) 0.93 (0.14) 0.32 (0.18)
Transfer Learning (encoder) 0.91 (0.16) 0.33 (0.21) 0.91 (0.18) 0.34 (0.20)
Transfer Learning (all parameters) 0.95 (0.08) 0.26 (0.16) 0.95 (0.11) 0.27 (0.15)
Noncardiac young group
Leave-N-Out 0.43 (0.52) 0.91 (0.47) 0.47 (0.50) 0.88 (0.45)
Mixed Learning 0.89 (0.11) 0.42 (0.18) 0.85 (0.16) 0.48 (0.23)
Transfer Learning (one layer) 0.92 (0.05) 0.39 (0.11) 0.92 (0.08) 0.39 (0.13)
Transfer Learning (encoder) 0.92 (0.05) 0.40 (0.11) 0.91 (0.08) 0.41 (0.13)
Transfer Learning (all parameters) 0.94 (0.05) 0.32 (0.11) 0.93 (0.08) 0.34 (0.15)
Noncardiac old group
Leave-N-Out 0.74 (0.15) 0.66 (0.19) 0.76 (0.13) 0.64 (0.17)
Mixed Learning 0.89 (0.14) 0.40 (0.23) 0.93 (0.10) 0.31 (0.17)
Transfer Learning (one layer) 0.94 (0.11) 0.31 (0.16) 0.96 (0.07) 0.27 (0.12)
Transfer Learning (encoder) 0.93 (0.12) 0.33 (0.17) 0.95 (0.07) 0.28 (0.12)
Transfer Learning (all parameters) 0.95 (0.10) 0.26 (0.16) 0.97 (0.04) 0.23 (0.10)
Table 3.6: The results using vanilla CVAE as the backbone model for the inferred ECG of each
group in terms of the mean and the standard deviation (in parenthesis) of Pearson coefficient (?)
and rRMSE.
85
Fig. 3.15 presents the comparison of the overall results between using XDJDL and CVAE
as the backbone models for leave-N-out, mixed learning, and transfer learning with the three
fine-tuning methods. The height of each bin shows the average correlation coefficient ? or the
rRMSE of ECG reconstruction results from both interpolation and extrapolation test modes of
all participants. Each error bar corresponds to the 95% confidence interval that is calculated
?
as ?1.96??/ N , where ?? is the sample standard deviation and N is the sample size/number of
ECG cycles. The breakdown of performance for each group of target participants is listed in
Table 3.6. First, compared to Table 3.2 with XDJDL as the backbone model, we observe that
there is an improvement in terms of the ECG reconstruction performance using vanilla CVAE
as the backbone model in all training modes across all participant groups (Table 3.6) and overall
participants (Fig. 3.15). Second, transfer learning with tuning all parameters achieves better
performance than only tuning part of the parameters. In addition, tuning only the first layer in the
decoder is almost comparable to tuning all the parameters. For practical applications, we may
consider only tuning just one layer as this strikes a balance between the algorithm performance
and computing resources.
3.7 Incorporating Causality into CVAE Model Based on Structural Causal Model
(SCM)
In the previous vanilla CVAE model for PPG-to-ECG inference, we assume that the latent
vector z representing the factors during the heart muscle mechanism to generate ECG signals
conditioned on PPG is multivariate independent Gaussian for all people. In the realistic world,
this assumption may be too general. To better fit our aim of building personalized digital twin
86
Figure 3.16: Illustration of causal representation learning for ECG inference. The factors that
form a causal mechanism in the latent space are assumed to generate the higher dimensional data
of the ECG waveform in the observed space. The arrow indicates that the parent node causes the
child one.
87
model, in this section, we take one step further and propose to learn a causal representation for
generating ECG signals to improve the previous assumption. The key underlying assumption we
make here is that the high dimensional observational data, which is the ECG signal in our case,
is a manifestation of a lower dimensional set of factors AND those factors contain causal rela-
tionships among each other, such as the sample causal graph shown in Fig. 3.16, considering the
physiological process of a heartbeat. Those factors that affect the ECG signal of each people may
be personalized and more clinically interpretable rather than being a general i.i.d. multivariate
Gaussian for all people. We aim to discover the proper causal representations in the sense that
with the ?do? operation to the latent representation factor, the higher dimensional observational
data (e.g., ECG waveform) will be causally changed accordingly. The semantic/medical mean-
ing of the nodes in the latent vectors may be subject-specific, and we will analyze them on a
case-by-case basis.
3.7.1 Importance of Incorporating Causality into Machine Learning Algorithms
and Structural Causal Model
With the fast development of big data and enhanced computational power, machine learn-
ing models, including deep learning models, have been growing fast in the past decade. In the
healthcare field, they have been widely applied and have shown great predictive power, such
as for disease classification [45, 62] and physiological signal sensing [15, 32]. However, good
prediction performance indicating that there exists a statistical association between the input data
and output labels does not necessarily imply causation between them [158]. For example, in [21],
the authors aimed to predict the probability of death for patients with pneumonia so that high-
88
risk patients could be admitted to the hospital while low-risk patients were treated as outpatients.
From their feature analysis, they found some counterintuitive relations between the input feature
and the output prediction, e.g., if a patient has a history of asthma then the patient has a lower
risk of death from pneumonia. This observation does not comply with the cause-and-effect in
common sense and the reason could be that if a patient had asthma before, it is likely that the
patient was treated once and thus has a lower possibility of death. Adding causal analysis to
machine learning models would be tremendously useful to avoid the counterintuitive results and
make the models more interpretable and it is drawing increasing attention nowadays to take the
advantage of both fields [122].
Causal Directed Acyclic Graph (DAG) and Structural Causal Model (SCM):
Consider a set of random variables X1, ..., Xn building a DAG structure which is a graph
with directed links between nodes but without directed cycles (acyclic). Note that Bayesian Net-
work is a DAG where the ?joint probability distribution of the nodes (random variables) in the
graph is p(X1, ..., Xn) =
n
i=1 p(Xi|PAi), i.e., a node is independent of its non-descendants
given its parents. However, the general DAG that carries the Markov property of the conditional
independence assumption is not enough to depict the quantitative causal relation among the nodes
in the DAG that accounts for the generation of the data. SCM describes the causal mechanisms
of a system with structural equations. A functional causal model is proposed in [112] to illustrate
how the children vertices in the DAG are influenced by their parents in an ordering from the
hypothesized cause-effect relations, i.e., Xi := fi(PAi, Ui), i = 1, ..., n, where Ui represents ar-
bitrary disturbance due to omitted factors that are mutually independent, fi is a linear or nonlinear
function, and PAi is the set of Markovian parents of Xi.
The linear structural equations model (SEM) [112] is a specialization of the functional
89
?
causal model with generalized functional relation fi, i.e., Xi := j ? PA A X + U . Supposei ji j i
the linear adjacency matrix of SEM associated with the DAG structure is A ? Rn?n, where Aij
represents the casual strength from Node i to Node j (Aij = 0 if there is no causal edge from
Node i to Node j). Then the linear SEM can be expressed in a matrix form as follows:
x = ATx+ ? (3.5)
where ?i ? N (0, 1), i = 1, ..., n and x ? Rn?1. Useful properties of A are that: (1) A can
be permuted into a strictly upper triangular matrix if the nodes in the DAG are strictly in causal
ordering; (2) The ith column of A are the parents of the ith factor and the ith row of A are the
children of the ith factor.
Notion of Do Intervention:
In [112], Pearl introduced the notion of ?do(x)? for setting X = x to distinguish it from the
notion of pure ?x? for observing X = x. In particular, the operation of do(xj) means: (1) deleting
the edges directed to the variable node Xj from PAj in the DAG and the corresponding structural
equation xj = fi(PAj, uj) in the SCM and (2) setting Xj = xj in the right-hand sides of the
other equations of a causal structure in SCM. By investigating the mapping from x to P (y|do(x))
for all x, the causal effect of X on Y can be examined.
3.7.2 Causal CVAE Model for PPG-to-ECG Inference
On top of the vanilla CVAE model that assumes the learned latent factors in the latent
vector are i.i.d. Gaussian (?), we develop the causal CVAE model in this section as shown in
Fig. 3.17. Instead of directly inputting the ? together with the PPG cycle into the decoder, we add
90
Figure 3.17: The proposed causal CVAE architecture. Compared to the vanilla CVAE structure
in Fig. 3.14, the causal CVAE model incorporates the causal representation learning module that
helps to learn the causal latent vector z = [z T n1, ...zn] ? R where zi represents the ith node in the
learned DAG.
a causal representation learning module after ? to learn the causal representation vector z based
on the linear SEM via the linear layer and then pass z into the DAG layer validate that the causal
mechanism holds for z. This causal representation learning module is inspired by the work of
CausalVAE proposed in [164]. The differences between CausalVAE [164] and our work in this
section mainly lie in:
1. Our model is based on the conditional VAE while their method is proposed for VAE. This
is not a trivial update especially when we are dealing with a different application scenario
for ECG waveform inference from PPG;
2. The true causal label in [164] is assumed to be known and incorporated in training as a
supervised problem, while that is unknown and complicated in our medical setting and is
proposed to be estimated from the intervention experiment.
Here is the detailed design of the two additional layers in the casual representation learning
91
module:
1. Linear layer: z = ATz + ? = (I ? AT)?1?. This layer is designed based on the SEM in
Eq. 3.5. The adjacent matrix A is learned during the training time to achieve the optimal
causal representation z;
2. DAG layer: z = f(A ? z) + ?, where ? represents the element-wise multiplication of each
column of A and z. This layer resembles the SCM which depicts how children nodes are
generated/influenced by their corresponding parental variables. f adds nonlinearity during
training time. Note that this layer is necessary to conduct the intervention experiment that
will be discussed later in Chapter 3.7.4.
Based on the architecture, the following loss functions are taken into account during the
training process:
1. Acyclic enforcement on the DAG related adjacent matrix A [166]:
L 1 ndagness = tr((I+ A ?A) )? n;n
2. Enforcing A to be a no?n-zero?matrix:
L 1 i=n j=nnonzero = 1/tanh( 2 i=1 j=1 |Ai,j|+ ?), where ? is set to be a small value, e.g., 1e-4;n
3. DAG layer loss to make sure the causal representation z and its reconstructed self are close
to each other:
Lcausal = ||z? f(A ? z; ?)||22.
Thus the overall loss function is L = ?LELBO + ?1Ldagness + ?2Lnonzero + ?3Lcausal.
?1, ?2, and ?3 are the hyperparameters set to be 20, 0.1, and 1, respectively, which are selected to
balance the relative value of each item in the loss function.
92
3.7.3 ECG Reconstruction Performance of Personalized Digital Twins
Interpolation Extrapolation
? rRMSE ? rRMSE
Cardiac young group
Vanilla CVAE 0.92 (0.08) 0.38 (0.12) 0.95 (0.07) 0.29 (0.10)
Causal CVAE 0.94 (0.04) 0.32 (0.10) 0.96 (0.04) 0.28 (0.10)
Cardiac old group
Vanilla CVAE 0.91 (0.16) 0.33 (0.22) 0.93 (0.14) 0.32 (0.18)
Causal CVAE 0.97 (0.04) 0.21 (0.11) 0.97 (0.03) 0.23 (0.10)
Noncardiac young group
Vanilla CVAE 0.92 (0.05) 0.39 (0.11) 0.92 (0.08) 0.39 (0.13)
Causal CVAE 0.95 (0.04) 0.30 (0.10) 0.94 (0.08) 0.31 (0.15)
Noncardiac old group
Vanilla CVAE 0.94 (0.11) 0.31 (0.16) 0.96 (0.07) 0.27 (0.12)
Causal CVAE 0.96 (0.09) 0.22 (0.15) 0.97 (0.04) 0.21 (0.10)
Table 3.7: The results from the proposed causal CVAE as the backbone model for the inferred
ECG of each group in terms of the mean and the standard deviation (in parenthesis) of Pearson
coefficient (?) and rRMSE. Transfer learning is applied by tuning the first layer of the decoder
and the newly added causal layers. The vanilla CVAE comparison group is copied from Table 3.6
when tuning the first layer of the decoder for easier reference.
In this section, we examine the performance of ECG reconstruction using the proposed
causal CVAE model as the backbone for transfer learning. From Chapter 3.6, we find that tuning
the first layer of the decoder achieves reasonably good ECG reconstruction performance with
fewer parameters to tune. Thus, for the causal CVAE model, we also load the parameters from
the leave-N-out case and fine-tune the first layer of the decoder together with the newly added
causal representation learning layers. The dimension of the latent vector is chosen as 8. The ECG
inference performance is listed in Table 3.7. Compared to the vanilla CVAE model results in Ta-
ble 3.6, the proposed causal CVAE model achieves better results in terms of ECG reconstruction.
93
3.7.4 Intervention Experiment
From 3.7.1, we know that with the ?do? operation intervening each of the nodes in the
DAG, the children nodes change together as their parent node is changed. And the intervention
can generate counterfactual outputs, indicating the underlying cause and effect represented by the
corresponding nodes according to the causal system. In this section, we conduct the intervention
experiment during the test time. We call the ECG reconstructed in a non-intervened way ?in-
ferred ECG?, which is generated by inputting a sample from normal distribution into the causal
representation learning layer and concatenating it with the PPG cycle as the new input into the
decoder (Fig. 3.17). Now we intervene each of the nodes in the latent vector z by updating their
original value (that generates the ?inferred ECG?) to a different value (e.g., 300) and the value of
their children nodes are changed as well to form the z? complying with the relation in the learned
DAG adjacency matrix to further generate the ?intervened ECG?. Since the vector dimension is
set to be 8, we analyze the impact from Node 1 to Node 8 in the intervened ECG for the tar-
get patient, in terms of both timing interval and amplitude changes. In this way, we can have a
better understanding of how each of the causal representation nodes plays the role in the ECG
generation.
Quantitative Evaluation Metrics of Effect For Causal Analysis:
We consider the following three intervals and three wave amplitude to quantitatively eval-
uate the impact of the intervention, including the PR interval, the QRS duration, and the QT
interval; the amplitude of the P wave, QRS complex, and T wave. PR interval: Normally, the
PR interval lasts 0.12-0.20 seconds, which begins from the onset of the P wave and ends at the
beginning of the QRS complex, representing the time for the electrical pulse to spread from the
94
atria to ventricles through the AV node and His Bundle. We use the segment from P point to R
point of ECG as the approximated PR interval. The duration of the PR interval indicates the func-
tionality of the conduction pathway from atria to ventricles [60]. On the one hand, a prolonged
PR interval can indicate the possibility of first-degree heart blockage. On the other hand, a short-
ened PR interval indicates either the atria have been depolarized from close to the AV node, or
there is abnormally fast conduction from the atria to the ventricles. QRS complex duration and
amplitude: The duration of the QRS complex is normally 0.12 seconds or less, for ventricular
depolarization. A prolonged QRS complex indicates impaired conduction within the ventricles
caused by bundle branch block or erroneous impulse pathway [60]. Increased height of the QRS
complex indicates ventricular hypertrophy. QT interval: The QT interval is from the onset of the
QRS complex to the end of the T wave, which is normally less than 0.48 seconds. An unusually
prolonged or short QT interval may be due to electrolyte abnormalities or drugs [60]. P wave
amplitude: The P wave represents the electrical activation (depolarization) of atria. If the P
wave is missing or amplitude is inverted, then atria are not activated normally from the SA node.
T wave amplitude: The T wave shows the repolarization of the ventricles to their resting state.
If the T wave is inverted, then the likely causes are ischemia or ventricular hypertrophy [60].
To summarize, in the dissertation author?s understanding, the timing of the ECG represents the
functionality of the impulse pathway and the shape and amplitude of the subwaves indicate the
functionality of the heart muscles.
Case Study: Female, 52, Coronary Artery Disease (CAD)
We take the result of a 52-year-old female patient with CAD from the extrapolation learning
mode as an example for analysis. By quantitatively evaluating the impact of the intervening
nodes, we aim to infer their possible meaning in a heart process for better interpretability.
95
(a) (b)
Figure 3.18: (a) The learned DAG adjacency matrix A for a 52-year-old female subject from the
cardiac young group with CAD. (b) The DAG is drawn based on the DAG adjacency matrix A
in (a).
The visualization of the learned DAG map from the training time and the corresponding
graph showing the causal relationship among different nodes in the causal latent vector z are
illustrated in Fig. 3.18. From the DAG map in Fig. 3.18a, we know that it can be permutated in
both rows and columns to form an upper triangle matrix, implying the ?DAGness? is preserved
after the training with acyclicity being 0.0003. As we know in a causal graph, the intervention
on a parent node will be translated to their children node, thus the fewer children a node has, the
easier to analyze its independent impact on the ECG. In this case study, we focus on the impact
of Nodes 3, 7, and 6 in Fig. 3.18b in our following analysis.
Intervened ECGs
Inferred ECG
Node 3 Node 7 Node 6 Node 1 Node 8 Node 4 Node 5 Node 2
PR Interval (s) 0.13 0.27 0.12 0.17 0.12 0.27 0.18 0.20 0.21
QRS Complex Duration (s) 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10
QT Interval (s) 0.40 0.41 0.40 0.42 0.40 0.41 0.40 0.40 0.41
P Wave Amplitude (mV) 0.09 -0.07 0.11 0.11 0.10 -0.07 0.06 0.00 -0.01
QRS Complex Amplitude (mV) 1.15 0.73 1.03 0.89 0.96 0.69 0.92 1.01 0.68
T Wave Amplitude (mV) 0.10 0.12 0.14 0.04 0.12 0.16 0.09 0.13 0.09
Table 3.8: The mean of each evaluation metric for both inferred ECG and intervened ECGs.
The results for the intervened ECGs for each node are ordered by their positions in the DAG
(Fig. 3.18b).
96
Table 3.8 lists the averaged intervals and subwave amplitudes of the inferred ECG and
the intervened ECGs. Some significant changes are concluded from the table: Tuning Node 3 in-
creases the average PR interval length from 0.13s to 0.27s, inverts the P wave (amplitude changed
from 0.09mV to -0.07mV), and reduces the amplitude of the QRS complex from 1.15 mV to
0.73mV; Tuning Node 7 (along with Node 3 because of the negative causal relation between
them) leads to the 40% increase of the T wave amplitude; Tuning Node 6 (along with Node 7 and
Node 3) decreases the amplitude of T wave by 60%.
Figure 3.19: Distributions of the difference between the inferred ECG and the intervened ECGs
for each evaluation metric, showing the impact of intervening each node in the latent causal
representation. Red circle markers represent the increased value in the metrics after intervention
and blue triangle markers represent the decreased values. For each node, the first number in each
bracket represents the number of cycles that are great than, equal to, or less than zero difference
after intervention and the second number represents the corresponding average of the difference.
In addition to the results in Table 3.8 that only show the averaged intervention effects/difference
between the inferred and intervened ECGs, we plot a more detailed distribution of the difference
97
Figure 3.20: Visualization of the inferred ECG and intervened ECGs after tuning Nodes 3, 7, and
6, respectively. Established algorithms [110, 124, 125] are applied to detect the P, Q, R, S, and
T fiducial points. The border point is defined as the 60%:40% segmentation point between each
RR interval.
after intervention in Fig. 3.19 to check if the impact of each node is solid. Each red circle marker
represents the corresponding metric is increased in the intervened ECG cycle than that in the
inferred ECG cycle, and each blue marker represents the decreased value after intervention. For
each node, there are three brackets by the side of the markers, the first number in which is the
number of cycles that are greater than, equal to, or less than zero difference after intervention,
respectively, and the second number in which is the corresponding average of the difference. First
we examine the effects of intervening Node 3: (1) all the intervened ECG cycles have a longer
PR interval than the inferred ECG with an average increase of 0.14s; (2) the P wave amplitude
decreases for all ECG cycles after intervention by 0.159mV to invert the P wave; (3) all inter-
vened ECG cycles have reduced QRS complex amplitude by an average of 0.413mV. Similarly,
the effects of tuning Node 7 and Node 6, that include increasing and decreasing the T wave am-
98
plitude, are confirmed in Fig. 3.19, respectively. Note that even though tuning Node 3, 6, and 7
reduces QRS duration by approximately 0.01s and tuning Node 6 elongates the QT interval by
0.02s for almost all ECG cycles, considering the smallest time scale in the ECG grid is 0.04s,
both changes are considered not significant. Also, even though the averaged P wave amplitude
is shown to be increased in Table 3.8 from 0.09mV to 0.11mV after intervening Node 6 and
Node 7, from Fig. 3.19, we know that this increase is not consistent across all cycles (approxi-
mately 2/3 increases and 1/3 decreases), thus this change is not considered as significant either.
The intervened ECG signals generated by changing the value of Nodes 3, 7, and 6 are visualized
in Fig. 3.20. From Fig. 3.20, tuning Node 3 leads to an inverted P wave, elongates the PR inter-
val, and lowers the QRS amplitude, which are aligned with the numerical results in Table 3.8. In
addition, Fig. 3.20 shows that tuning Node 7 and Node 6 leads to the peaked T wave (increased T
wave amplitude) and flattened T wave (decreased T wave amplitude), which is also aligned with
the numerical results in Table 3.8.
With the quantitative effect of intervening Node 3, 7, and 6 being clear, we attempt to infer
what the possible physiological/medical meaning behind each of the nodes during a heart process
is, i.e., trace the cause from the effect, and at the same time, their mutual causal relation should
also be born in mind for inference. For example, Node 6 could be the electrolyte disorder [42,151]
that causes the lower amplitude of the T wave as shown in Fig. 3.19. And Node 7 could be the
medication caused by Node 6 that helps to balance the electrolyte and leads to the increased T
wave amplitude. Node 3 is the child of both Node 6 and Node 7, which could be AV block or SA
block caused by the effect of drug and electrolyte abnormalities together. So that when Node 3 is
intervened, prolonged PR interval impacted by the impaired conduction pathway from SA to AV
node happens as the statistical results show, as well as the inverted P wave caused by improper
99
functioning of SA node. So far, we have examined the possibilities of the medical meaning of the
latent causal representation vector using intervention with causal CVAE model. Because of the
complexity of the cause of the disorders in the ECG waveform, with further professional input
from doctors, our hypothesis of the interpretation can be validated and complemented for a more
clinically solid causal analysis.
3.8 Chapter Summary
This chapter presents a novel application of digital twins for continuous precision cardiac
monitoring by inferring ECG waveform from PPG signals. Different from the previous chapter,
this chapter deals with real-world scenarios in which only limited ECG signals are available from
the target individuals for whom the personalized digital twin is designed. A transfer learning
method is proposed to fine-tune the generic digital twin model, which is pre-learned from a large
portion of available paired PPG and ECG data from the training corpus, with limited paired PPG
and ECG data from the target participant. Experimental results validate that the proposed transfer
learning training scenario achieves better continuous ECG reconstruction accuracy for the target
participants compared to other baseline comparison models. This suggests that our proposed
method can generate a reliable digital twin for accurate and personalized continuous cardiac
monitoring, providing a promising future in which people can receive early medical intervention
through personalized digital twins. In addition to using the previously proposed dictionary learn-
ing framework as the backbone model for fine-tuning, the vanilla CVAE model and the causal
CVAE model are proposed to learn the underlying latent vector that represents the heart process,
taking the electrical and mechanical physiology process into account for better explainability and
100
better inference performance.
101
Chapter 4
A Multi-Channel Ratio-of-Ratios Method for Noncontact Hand Video
Based SpO2 Monitoring Using Smartphone Cameras
4.1 Related Work
4.1.1 Contact-based SpO2 measurement using smart devices
It is of significance to realize early detection of changes in SpO2 to facilitate timely man-
agement of asymptomatic patients with clinical deterioration. Conventional SpO2 measurement
methods rely on contact-based sensing, such as pulse oximetry introduced in Chapter 1.1.4 de-
signed by the RoR principle.
With the ubiquity of smartphones and the growing market of smart fitness devices, the RoR
principle has been applied to new nonclinical settings for SpO2 measurement. Apple Watch Se-
ries 6 has blood oxygen measurement functionality, and it requires skin contact with the watch
neither too tight nor too loose for the best results [67]. The recent scientific literature also ex-
plored methods for SpO2 estimation using a smartphone. These methods require a user to use
his/her fingertip to cover an optical sensor and a nearby light source to capture the reemitted light
from the illuminated tissue [17, 40, 96, 123, 140]. In this setup, an adapted ratio-of-ratios model
102
is utilized with the red and blue (or green) channels of color videos in lieu of the traditional
narrowband red and infrared wavelengths.
The aforementioned SpO2 estimation methods based on smartphones and smartwatches are
contact-based. It can present the risk of cross-contamination between individuals using the same
measurement device. An additional issue with contact-based methods is that they may irritate
sensitive skin or a sense of burning from the heat built up if a fingertip is in contact with the
flashlight for an extended period of time. Also, pulse oximeters may not be widely available in
marginalized communities and some undeveloped countries [64].
4.1.2 Noncontact SpO2 measurement using cameras
Researchers have recently investigated measuring the saturation of blood oxygen by means
of contactless techniques [57, 80, 149, 150]. These methods typically acquire a user?s face video
under ambient light with CCD cameras to estimate SpO2 from pulsatile information of monochro-
matic wavelengths. Shao et al. [129] also use a facial video-based method to monitor SpO2 that is
implemented using a CMOS camera with a light source alternating at two different wavelengths.
Tsai et al. [147] acquire hand images with CCD cameras under two monochromatic lights to
analyze SpO2 from the reflective intensity of the shallow skin tissue. These contactless methods
can provide alternatives to contact-based SpO2 measurements for individuals with finger injuries
or nail polish [34, 165], for whom the traditional pulse oximeters may be inaccurate. However,
the setups used in the abovementioned studies use either high-end monochromatic cameras with
selected optical filters or controlled monochromatic light sources, making it expensive and not
common for daily use.
103
As more economical camera devices, smartphones and webcams are also applied for con-
tactless SpO2 estimation. Most of the SpO2 estimation works using digital RGB cameras under
ambient light [12, 22, 119, 139] adapt the conventional RoR model based on the red and infrared
wavelengths directly to the use of red and blue channels of RGB videos. It is worth noting that
the SpO2 data collected in [22,119] only covers a small dynamic range (mostly above 95%), and
Tarassenko et al. [139] and Bal et al. [12] show a fitted linear relation between RoR and SpO2
for only several minutes of data. The limitations as mentioned above can be due to: i) Signals
extracted from the red and blue channels are noisier than those extracted from the green chan-
nel [152], and ii) Unlike the narrowband signals being modeled in the conventional RoR model,
the RGB color channels capture a wide range of wavelengths from the ambient light. The ag-
gregation of the broad range of wavelengths lowers the optical difference between Hb and HbO2
and makes it less optically selective than narrowband oximeter sensors and more challenging for
SpO2 sensing. So we are motivated to disentangle the aggregation effect through a meaningful
combination of the pulsatile signals from all three channels of RGB videos to distill the SpO2
information.
4.2 Ratio-of-ratios (RoR) Model for Noncontact SpO2 Measurement
Consider a light source with the spectral distribution I(?) illuminating the skin and a re-
mote color camera with spectral responsivity r(?) recording an image. According to the skin-
reflection model [156], the color camera will receive the specularly reflected light from the skin
surface and the diffusely reflected light from the tissue-light interaction that contains the pulsatile
information. Based on the assumption proposed in [57] that the specular reflection can be ig-
104
nored if the color change from movement is properly treated and minimized, the camera sensor
response at time t can be expressed as:
?
Sc(t) = I(?) ? e??d(?,t) ? rc(?) d?. (4.1)
?c
where the ? is the wavelength. The integral range ?c is the sensitive response wavelength band
of the c th channel of the camera, I(?) is the spectral intensity of the light source, ?d(?, t) is the
diffusion coefficient, and rc(?) is the sensor response of the c th channel of the camera.
According to Beer-Lambert?s law, the diffusion coefficient ?d(?, t) can be expanded into:
?d(?, t) = ?t(?)Ctlt+[?Hb(?)CHb+?HbO2(?)CHbO2 ]?l(t). (4.2)
where ?Hb, ?HbO2 , and ?t are the extinction coefficients of arterial deoxyhemoglobin, arterial oxy-
hemoglobin, and other tissues including the venous blood vessel, respectively. Ct, CHb, and CHbO2
are the concentration of the corresponding substances. lt is the path length that the light travels
in the tissue, which is assumed to be static and invariant of time. l(t) is the path length that the
light travels in the arterial blood vessels. It is modeled as time-varying because the arteries will
dilate with increased blood during systole compared to diastole.
When the camera is monochromatic, incoming light is filtered by a narrowband optical
filter, or the light source is a narrowband LED, the integral range ?c can be simplified to a single
value ?i, such that the response of the camera sensor in Eq. (4.1) can be written as:
S (t) =I(? ) ? e??t(?i)Ctlt ? r (? ) ? e?[?Hb(?i)CHb+?HbO (?2 i)CHbO ]?l(t)c i c i 2 . (4.3)
105
Let ?l = lmax ? lmin denote the difference of the light path of the pulsatile arterial blood
between diastole when l(t) = lmin and systole when l(t) = lmax. Then the ratio of the response of
the c th channel of the camera sensor during diastole and systole is:
( )
Sc|R l=l(?i) = log min (4.4a)Sc|l=lmax
= [?Hb(?i)CHb + ?HbO2(?i)CHbO2 ] ??l. (4.4b)
The ratio-of-ratios (RoR) between two different wavelengths ?1 and ?2 is:
R(?
RoR 1
) ?Hb(?1)CHb+?HbO2(?1)CHbO(?1, ?2)= =
2 . (4.5)
R(?2) ?Hb(?2)CHb+?HbO2(?2)CHbO2
Since SpO C2(%) =
HbO2 , the relation between RoR and SpO2 can be derived from Eq. (4.5)CHbO +C2 Hb
as:
?Hb(?SpO 1
)??Hb(?2)?RoR
2= (4.6a)?Hb(?1)??HbO2(?1)+[?HbO2(?2)??Hb(?2)]?RoR
? ??RoR+?. (4.6b)
where the linear approximation can be obtained by Taylor expansion.
The linear RoR model in Eq. (4.6b) has been applied under different SpO2 measurement
scenarios. For pulse oximeters, ?1 = 660 nm and ?2 = 940 nm are used to leverage the optical
absorption difference of Hb and HbO2 at the two wavelengths. In some prior art using narrow-
band light sources or monochromatic camera sensors [80, 129] for contactless SpO2 monitoring,
different combinations of (?1, ?2) have been explored. In the prior art using consumer-grade
106
RGB cameras [12,22,119,136,139], only two out of the three available RGB channels were used
for the linear RoR model.
Among the abovementioned SpO2 estimation methods using consumer-grade RGB cam-
eras, the SpO2 data collected in [22,119] only cover a small dynamic range (mostly above 95%),
which is not very meaningful. Bal et al. [12] and Tarassenko et al. [139] show a fitted linear
relation between RoR and SpO2 for data that last for merely several minutes. These limitations
can be attributed to that, unlike the signals captured in the narrowband setting that is modeled
precisely by Eq. (4.3) and Eq. (4.4), all three RGB color channels capture a wide range of wave-
lengths from the ambient light, as is described in Eq. (4.1). The aggregation of the broad range
of wavelengths lowers the optical difference between Hb and HbO2 and makes it less optically
selective than narrowband sensors used in oximeters. To address this issue, we disentangle the
aggregation through a careful combination of the pulsatile signals from all three channels of RGB
videos to efficiently distill the SpO2 information.
4.3 Proposed Multi-Channel RoR Method
In this work, we propose a multi-channel RoR method for noncontact SpO2 monitoring
using hand videos captured by smartphone cameras under ambient light. Fig. 4.1 illustrates the
proposed procedure for the SpO2 estimation from the smartphone captured hand videos. First,
the hand is detected as the region of interest (ROI) for each frame. Second, the spatial average
from the ROI is calculated to obtain three time-varying signals of RGB channels. The averaged
RGB signals are extracted for two purposes: i) to estimate the heart rate (HR), and ii) to acquire
the filtered cardio-related AC components using an HR-based adaptive bandpass filter. Third,
107
100
95
Est. 
90 Ref.
Hand ROI Hand Spatial Feature SpO2
Video Localization Combining Extraction Prediction
rPPG HR Est. Ref.
Extraction Estimation
To Assist the Design of the Adaptive Bandpass Filter
Figure 4.1: System illustration for the SpO2 prediction using the smartphone captured hand
videos. The pixels from the hand region are utilized for prediction, and an rPPG signal is ex-
tracted for heart rate (HR) estimation. Multi-channel RoR features are derived from the spatially
combined RGB signals with the help of the HR-guided filters. The extracted features are then
used for SpO2 prediction.
the ratio between the AC and the DC components for each color channel and the pairwise ratios
of the resulting three ratios are computed as the features for a regression model where SpO2 is
treated as the label. The details of each step are provided as follows.
4.3.1 ROI Localization and Spatial Combining
First, we manually draw a rectangle to include the target hand region. This RGB region is
converted to YCrCb color space, and the Cr channel is used [23] to determine a threshold that
differentiates the skin pixels from the background based on the Otsu algorithm [109]. We apply
an erosion and a dilation algorithm with a median filter to exclude noise pixels outside of the
binary hand mask region. The final hand-shaped mask is considered as the ROI, and an example
is shown in the second picture in Fig. 4.1. For all n frames in the video, we calculate the spatial
average of the RGB channels in the ROI as A = [?r; g?; b?], where r?, g?, b? ? R1?n, and A ? R3?n.
108
HR (bpm) SpO2(%)
4.3.2 rPPG Extraction and HR Estimation
Typically in the RoR method, after the matrix A in Section 4.3.1 is calculated, the AC
component for each channel of A is determined by either the standard deviation [123] or the peak-
to-valley amplitude [129]. Since the signal-to-noise ratio (SNR) is lower for the video captured
by a smartphone in a contactless manner, we propose to use an adaptive bandpass filter centered
at the HR to filter the RGB channel signals and extract the AC components more precisely.
The HR can be measured contact-free by capturing the pulse-induced subtle color varia-
tions of the skin. The pulse signal, referred to as remote photoplethysmogram (rPPG), can be
obtained by applying the plane-orthogonal-to-skin (POS) algorithm [156], which defines a plane
orthogonal to the skin tone in the RGB space for robust rPPG extraction. The HR is then tracked
from the rPPG signal via a state-of-the-art adaptive multi-trace carving (AMTC) [172,173] algo-
rithm that tracks the HR from the spectrogram of rPPG by dynamic programming and adaptive
trace compensation.
To study the role of accurate HR tracking for feature extraction, we also implemented a
peak-finding method and a weighted energy method for frequency estimation [59] to compare
with AMTC. The peak-finding method takes the peaks of the squared magnitude of the Fourier
transform of rPPG as the estimated HR values, which was used in [149] and [57]. The weighted
energy method finds the heart rate by weighing the frequency bins in the corresponding frame of
the spectrogram of rPPG. Compared to the peak-finding method, the weighted energy method is
more robust to outliers in frequency. Fig. 4.2 illustrates an example of the HR estimation results
by the peak-finding method, the weighted energy algorithm, and AMTC, respectively.
109
200
175
150
125
100
75
50
0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
(a)
MAE: Peak-finding = 6.00 bpm, Weighted = 4.94 bpm, AMTC = 2.50 bpm.
120
HR (Ref)
110 HR (Peak-finding)
100 HR (Weighted)HR (AMTC)
90
80
70
60
50
0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Time (min)
(b)
Figure 4.2: (a) Spectrogram of an rPPG signal. (b) Reference HR signals and HR signals es-
timated by the ?naive? algorithm, the weighted energy frequency estimation algorithm, and the
AMTC algorithm, respectively. The mean absolute error (MAE) of the HR estimation algorithms
are 6.00 bpm, 4.94 bpm, and 2.50 bpm, respectively.
4.3.3 Feature Extraction
We use a processing window of 10 seconds with a step size of 1 second to segment the
whole video into L windows. Within each window, the DC and AC components of the RGB
channels are calculated to build a feature vector f .
DC component We use a second-order lowpass Butterworth filter with a cutoff frequency of 0.1
Hz. The DC component is estimated using the median of the lowpass filtered signal of each
window.
AC component The estimated heart rate values from Section 4.3.2 are used as the center frequen-
cies for the adaptive bandpass (ABP) filters to extract the AC components of the RGB channels,
which eliminates all frequency components that are unrelated to the cardiac pulse. We adopt an
110
Frequency (bpm) Frequency (bpm)
8th-order Butterworth bandpass filter with ?0.1 Hz (?6 bpm) bandwidth, centering at the esti-
mated HR of the current window. The magnitude of the AC component is estimated using the
average peak-to-valley amplitudes of the filtered signals within the current processing window.
We define the normalized AC components at the ith window as R(i, c) = AC(i,c)DC , where(i,c)
c ? {r, g, b} represents color channel and i ? {1, 2, ..., L}. We define the multi-channel ratio-of-
ratios based feature vector of the ith window as
fi = [R(i, r), R(i, g), R(i, b),
R(i,r) , R(i,r) , R(i,g) ] ? R1?6.
R(i,g) R(i,b) R(i,b)
4.3.4 Regression and Postprocessing
As a proof-of-concept, we use linear regression and support-vector-regression (SVR) to
learn the mapping between the features and the SpO2 values.
The linear regression has limited learning capability since it captures only the linear re-
lationship. So we use it as a baseline approach. In the objective function in Eq. (4.7), y =
[y1, ..., yl] ? Rl?1 is the target SpO2 value, ? ? R6?1 is the predictor, and F is the feature matrix
that serves as input. We add an L2 regularization term in Eq. (4.7) to avoid collinearity. To select
the optimal weight ? for the L2 regularization term, we use 5-fold cross-validation.
min ?y?F??2F+? ???
2
2 .
(4.7)
?
SVR models are adopted for exploring possibly nonlinear relation between the feature
vectors and the SpO2 estimation. The Libsvm library [24] is used for training the ??-SVR? in
Eq. (4.8). In our implementation, we use the nonlinear Radial Basis Function (RBF) kernel for
the SVR. The hyperparameters, including the penalty cost C, and the kernel parameter ? of kernel
111
( )
function K(fi, fj) = ?(f Ti) ?(fj) = exp ?? ?f ? f ?2i j are selected via grid search and a 5-fold
cross-validation.
?l1
min ???22+C ? (?i+?
?
i )
?,b,?,??2
i=1
s.t. ?(fi)??+b?yi ? ?+?i, (4.8)
yi??(fi)???b ? ?+??i ,
? , ??i i ? 0, i=1, ..., l.
Once an estimated weight vector w? is learned from the linear or support vector regression,
w? is then used to predict a preliminary SpO2 signal. Finally, a 10-second moving average window
is applied to smooth out the preliminarily predicted signal to obtain the final predicted SpO2
signal.
4.4 Experimental Results
4.4.1 Data Collection
Fourteen volunteers, including eight females and six males, were enrolled in our study
under protocol #1376735 approved by the University of Maryland Institutional Review Board
(IRB), with age range between 21 and 30. Participants were asked to categorize their skin tone
based on the Fitzpatrick skin types [10] shown in Fig. 4.3. There are two, eight, one, and three
participants having skin types II, III, IV, and V, respectively. None of the participants had any
known cardiovascular or respiratory diseases. During the data collection, participants were asked
to hold their breath to induce a wide dynamic range of SpO2 levels. The typical SpO2 range
112
for a healthy person is from 95% to 100%. By holding breath, the SpO2 level can drop below
90%. Once the participant resumes normal breathing, the SpO2 will return to the level before the
breath-holding.
Figure 4.3: Fitzpatrick skin types [10].
Figure 4.4: Experimental setup for data collection of hand videos and reference signals using an
oximeter. The left index finger was placed in a CMS-50E pulse oximeter to record the reference
HR and SpO2 signals. The smartphone camera is recording the video of both hands.
Each participant was recorded for two sessions. During the recording, the participant sat
comfortably in an upright position and put both hands on a clean dark foam sheet placed on a
113
table. As shown in Fig. 4.4, the palm side of the right hand and the back side of the left hand
were facing the camera. These two hand-video capturing positions are defined as palm up (PU)
and palm down (PD), respectively. The participant was asked to place his/her hands still on the
table to avoid hand motion. Simultaneously, a Contec CMS-50E pulse oximeter was clipped
to the left index finger to measure the participant?s SpO2 level at a sampling rate of 1Hz. As
we have reviewed earlier, the oximeter is adopted clinically to be within a ?2% deviation from
the invasive, gold standard for SpO2 [114], so we use the oximeter measurement results as the
reference in our experiments. An iPhone 7 Plus camera was fixed by a smartphone stand mounted
on a tripod for video recording at a sampling rate of 30 fps. The video started 30 seconds before
the oximeter started and stopped immediately after the oximeter ended to allow for proper time
synchronization. The participants were asked to hold their breath for generally 30?40 seconds to
lower the SpO2 level, as long as they were comfortable and able to do so. Then the participants
resumed normal breathing for generally 30?40 seconds until they recovered and felt ready for the
next breath-holding. The recovery period was long enough for the participants? SpO2 to return to
the levels before the breath-holding. The aforementioned process is defined as one breath-holding
cycle. In each session, the breath-holding cycles were repeated three times. After the first session,
the participants were asked to relax for at least 15 minutes before attending the second session for
data collection. From our data collection protocol using breath-holding, we were able to obtain
the SpO2 measurements ranging from 89% to 99%.
The total length of recording time for all fourteen participants is 138.9 minutes. In terms
of each participant, the minimum duration is 103 seconds and the maximum duration is 468
seconds. The average duration is 298 seconds. The current data size is relatively small for
large-scale neural network training. This is by a large part due to the restrictions for human
114
subject related data collection imposed during the COVID-19. The available data, however, is
adequate for our principled multi-channel signal based approach to SpO2 monitoring, showing a
benefit of combining signal processing and biomedical knowledge and modeling with data than
the primarily data-driven approach.
Delay Estimation of Pulse Oximeter: When the CMS-50E oximeter is turned on and ready for
measurement, the first reading is displayed a few seconds after the finger is inserted. This delay
may be due to the oximeter?s internal firmware startup and algorithmic processing. Since we
need to synchronize the video and the oximeter readings using their precise starting time stamps,
the delay in the oximeter can introduce misalignment errors in the reference data that we use to
train the regression model. To avoid misalignment, we first estimate the delay and then subtract
it from the oximeter?s internal timestamp as the corrected oximeter?s timestamp. To estimate the
internal delay, we asked one participant to repeatedly place the left index finger, middle finger,
and ring finger into the oximeter 50 times each and obtained the average delay time of 1.8s, 1.9s,
and 1.7s, respectively. Because the left index finger is used for reference data collection in our
setup, we take 1.8s as the delay. To further examine whether there exists any difference among
the delays from the three fingers, we conducted a one-way ANOVA test. The p-value is 0.14,
which shows no statistically significant different delays among the three fingers.
4.4.2 Performance Metrics
The performance of the algorithm is evaluated using the mean absolute error (MAE) and
Pearson?s correlation coefficient ? given in (4.9). Note that the correlation is adopted to evaluate
115
how well the trend of SpO2 is tracked.
N
1 ? (y?y?)T(y??y??)
MAE(y, y?)= |yi?y?i|, ?(y, y?)= ?? ???? . (4.9)N ?y?y?? y??y
i=1 2 2
where y=[y1, ..., y ]TN , y?=[y?1, ..., y? TN ] , y?, and y?? denote the reference SpO2 signal, the estimated
SpO2 signal, the average values of all coordinates of vectors y and y?, respectively. We adopt the
correlation metric to evaluate how well the trend of the SpO2 signal is tracked.
4.4.3 Results From Proposed Algorithm
In this subsection, we use the training data from one participant to train the regression
model for the prediction of his/her testing session recorded later. We call the aforementioned
training and testing procedure the participant-specific mode in which the models are specifically
learned for each participant. We will discuss the leave-one-out mode of the performance of the
proposed algorithm in Section 4.4.5.
Fig. 4.5 presents the learning results for all the participants using SVR for PU cases. Both
training and testing sessions are shown for each participant. The SpO2 curves in each session con-
tain three dips that are resulted from breath-holding, except for participant #8 who had a shorter
session due to limited tolerance of breath-holding. For each participant, we provide the skin-tone
information in the subplot and show the accuracy indicators, MAE and ?, for SpO2 prediction. In
all training sessions, MAE is below 2.4% and ? is above 0.6. From this observation, we find that
all the predicted SpO2 signals in the training sessions are closely following the reference signals?
trends, despite the exact value differences between the predicted and the reference signals, such
as the differences around the last dip for participant #13. Furthermore, all testing MAE values
116
MAE = 1.66%,   = 0.87 MAE = 1.23%,   = 0.84 MAE = 0.97%,   = 0.88 MAE = 1.34%,   = 0.71 MAE = 1.08%,   = 0.76 MAE = 1.34%,   = 0.65
Training Test Training Test Training Test
(a) Participant 1, Skin Type III (b) Participant 2, Skin Type III (c) Participant 3, Skin Type III
MAE = 1.23%,   = 0.74 MAE = 1.57%,   = 0.65 MAE = 0.96%,   = 0.89 MAE = 1.37%,   = 0.47 MAE = 0.41%,   = 0.81 MAE = 0.54%,   = 0.75
Training Test Training Test Training Test
(d) Participant 4, Skin Type III (e) Participant 5, Skin Type III (f) Participant 6, Skin Type III
MAE = 0.83%,   = 0.77 MAE = 0.74%,   = 0.82 MAE = 2.38%,   = 0.74 MAE = 1.50%,   = 0.76 MAE = 1.70%,   = 0.60 MAE = 1.36%,   = 0.64
Training Test Training Test Training Test
(g) Participant 7, Skin Type II (h) Participant 8, Skin Type IV (i) Participant 9, Skin Type V
MAE = 0.99%,   = 0.67 MAE = 1.01%,   = 0.62 MAE = 2.00%,   = 0.61 MAE = 1.29%,   = 0.58 MAE = 1.26%,   = 0.83 MAE = 1.71%,   = 0.71
Training Test Training Test Training Test
(j) Participant 10, Skin Type V (k) Participant 11, Skin Type II (l) Participant 12, Skin Type III
MAE = 1.96%,   = 0.74 MAE = 1.64%,   = 0.69 MAE = 1.13%,   = 0.80 MAE = 1.02%,   = 0.68
Training Test Training Test
Reference Signal
Predicted Signal
Time (s) Time (s) Time (s) Time (s)
(m) Participant 13, Skin Type V (n) Participant 14, Skin Type III
Figure 4.5: Predicted SpO2 signals for all participants using SVR when the palm is facing the
camera, i.e., the palm-up scenario. Prediction results of training and testing sessions are shown
for each participant with reference SpO2 in red dash lines and predicted SpO2 in solid black lines.
The higher the correlation ? and the lower the MAE, the better the predicted SpO2 captures the
trend of the reference signal.
117
SpO2 (%) SpO2 (%) SpO2 (%) SpO2 (%) SpO2 (%)
Training Testing
MAE Correlation ? MAE Correlation ?
1.69% 0.63 1.52% 0.62
PU
LR (?0.57%) (?0.16) (?0.54%) (?0.11)
1.74% 0.61 1.53% 0.56
PD
(?0.76%) (?0.21) (?0.53%) (?0.21)
1.33% 0.76 1.26% 0.68
PU
SVR (?0.54%) (?0.09) (?0.33%) (?0.10)
1.35% 0.75 1.28% 0.65
PD
(?0.45%) (?0.09) (?0.40%) (?0.14)
Table 4.1: Performance of the proposed method. Results using linear regression (LR) and support
vector regression (SVR) for both sides of the hand are quantified in terms of the sample mean
and sample standard deviation (in parentheses).
are within 1.8%, suggesting that those trained models adapt well to the testing data. While there
are a few cycles where the predicted signal does not fully follow the reference signal, such as the
second dip for participant #4 and participant #11, the trends are consistent.
Table 4.1 summarizes the training and testing SpO2 estimation performance of both LR
and SVR based methods for both PU and PD cases. The best performance is achieved using
the SVR method in the PU case. We further examine the difference between the two regression
methods using boxplots in Fig. 4.6(a) that show the distributions of the correlation ? for testing
by LR and SVR, respectively. Each boxplot in Fig. 4.6(a) contains both PU and PD cases from all
participants. The results are compared in terms of the median and the interquartile range (IQR).
IQR quantifies the spread of the distribution by measuring the difference between the first quartile
and the third quartile. The boxplots in Fig. 4.6(a) reveal that the SVR method outperforms LR
with a higher median of 0.68 compared to 0.59 and with a narrower IQR of 0.17 compared to
0.19. This suggests that there may exist a nonlinear relationship between the extracted features
and the SpO2 values.
To examine the impact of the side of a hand and the skin tone on the performance of SpO2
118
Palm down Palm up
(a) (b) (c)
Figure 4.6: Boxplots of testing correlation coefficient ? for all participants when grouped using
different criteria. (a) Distributions contrasting linear and support vector regressions. (b) Distri-
butions of palm-up and palm-down cases. (c) A detailed breakdown of (b) in terms of skin-tone
subgroups.
estimation, we analyze the following two research questions: (i) whether the side of the hand
makes a difference in lighter skin (type II and III) or darker skin (type IV and V) or mixed skins
(all participants), and (ii) whether the different skin tones matter in PU or PD case.
To answer question (i), we first focus on the distributions from PU and PD cases in Fig. 4.6(b)
with each boxplot representing the correlation ? in testing for all participants. We observe that
the PU case outperforms the PD case with a higher median of 0.64 compared to 0.60 and a nar-
rower IQR of 0.15 compared to 0.25. We then zoom into each subgroup of skin tones shown in
Fig. 4.6(c). For the lighter skin group, even though the median of PD case is 0.71, which is 9%
better than that of PU, the IQR of PD case is 0.24, which is worse than the IQR of 0.17 of PU
case. This suggests that the distributions are comparable between PU and PD cases for the lighter
skin group. For the darker skin group, the PU case outperforms the PD case with a higher median
of 0.62 compared to 0.54 and a narrower IQR of 0.07 compared to 0.15. In summary, there is no
substantial difference between PU and PD cases in the lighter skin group, whereas, for the darker
119
Correlation coefficient
skin group and overall participants, the PU case is better than the PD case.
To answer question (ii), we first focus on the left two boxplots of Fig. 4.6(c). In the PD
case, the median of the lighter skin group is significantly larger than that of the darker skin group
by 31%, however, the lighter skin group also has a larger IQR. This makes it difficult to make a
conclusion from the median?IQR analysis, hence we apply the t-test to complement our analysis.
We note that the p-value is 0.037 < 0.05, showing that there is a significant difference between
these two groups. In the PU case shown in the right half of Fig. 4.6(c), the medians of the lighter
skin group and darker skin group are 0.65 and 0.62, with IQR being 0.17 and 0.07, respectively.
Thus, in our current dataset, no substantial performance difference is observed between lighter
and darker skin tones in the PU case.
4.4.4 Ablation Study of Proposed Pipeline
In Sections 4.3.2 and 4.3.3, we have proposed three key designs in our algorithm, includ-
ing a) the feature vector f containing pulsatile information from all RGB channels, b) the narrow
ABP filter, and c) the passband of the ABP filter centered at precise HR frequency tracked by
AMTC. To study the importance of each component, we conducted three controlled experiments
by removing one factor at a time and the configurations of methods corresponding to the exper-
iments are listed in Table 4.2. The results for the methods are illustrated in Fig. 4.7. The height
of each bin shows the average correlation coefficient ? or the MAE of SpO2 estimation results
from testing sessions (SVR, PU case) of all participants. Each pair of error bars corresponds to
?
the 95% confidence interval that is calculated as ?1.96??/ N , where ?? is the sample standard
deviation and N is the sample size/number of participants.
120
Configuration
Method Index Multi-channel Narrow Accurate
RoR features? ABP filter? HR tracking?
I Two-channel RoR ? ?(AMTC)
II ? No ABP n/a
III ? Wide ABP ?(AMTC)
IV ? ? Peak-finding
V ? ? Weighted energy
Proposed ? ? ?(AMTC)
Table 4.2: Configurations for the ablation study of the proposed pipeline. The controlled experi-
ments are conducted by replacing or removing one component at a time.
1
2
0.8 1.67 1.69
0.68 1.6
0.60 1.430.6 0.56 0.57 1.39
1.40
1.26
1.2
0.4 0.33
0.22 0.8
0.2
0.4
0
I II III IV V Proposed I II III IV V Proposed
Method Index Method Index
(a) (b)
Figure 4.7: Ablation study of the proposed method. The bar plots are from testing sessions (SVR,
PU case) of all participants. The error bars correspond to the 95% confidence intervals.
121
Correlation 
MAE (%)
4.4.4.1 Advantage of The Proposed Multi-Channel RoR Over Two-Channel
RoR
In this part, we compare our proposed algorithm with ?Method (I): RoR with nABP
(AMTC).? Method (I) follows the feature extraction method proposed in Section 4.3.3, including
the narrow adaptive bandpass filter (nABP) centered at AMTC-tracked HR. The only exception
is that, instead of using the feature vector f that contains multi-channel information, only the ratio
of ratios between the red and blue channels as in traditional RoR methods is used.
Fig. 4.7 reveals that our proposed method outperforms method (I) by a big margin. More
specifically, our proposed method improves the correlation coefficient from 0.22 to 0.68 and the
MAE from 1.67% to 1.26%. This improvement confirms that our proposed multi-channel feature
set helps with more accurate SpO2 monitoring.
4.4.4.2 Contribution of Narrowband ABP Filter for Feature Extraction
Here we compare the following two methods to show the necessity of using a narrowband
HR-guided bandpass filter:
? Method (II): Feature vector without ABP uses a nonadaptive, generic bandpass filter
with the passband over [1, 2] Hz, covering the normal range of heart rate in sedentary mode
to replace the HR-based narrow ABP filter proposed in Section 4.3.3 for feature extraction.
? Method (III): Feature vector with wide ABP (AMTC) applies a wider ABP filter with
?0.5 Hz bandwidth than the ?0.1 Hz one used in our proposed method. This wider ABP
filter?s center frequency is provided by the AMTC tracking algorithm of the HR described
122
in Section 4.3.2.
The bandpass filters used for methods (II) and (III) have the same bandwidth, 1 Hz. In terms
of center frequency, method (II) used a fixed setting at 1.5 Hz, while method (III) is adaptively
centered at the estimated HR value. Compared to method (II), method (III) has an improved
testing MAE by 18%. Furthermore, compared to method (III), our proposed method with a
narrow ABP filter improves the correlation coefficient ? for testing by 13% and MAE by 9%,
suggesting the contribution of the narrow HR-based ABP filter strategy for AC computation.
4.4.4.3 Importance of Accurate HR Tracking on SpO2 Monitoring
We consider the following two methods to compare with our proposed method:
? Method (IV): Feature vector with narrow ABP (peak-finding) applies a narrow ABP
filter of bandwidth ?0.1 Hz for extracting the feature vector f . The center frequency of the
ABP filter is the HR estimated from the peak-finding algorithm described in Section 4.3.2.
? Method (V): Feature vector with narrow ABP (weighted) is similar to method (IV),
except that the frequency estimation algorithm is replaced by the weighted energy in Sec-
tion 4.3.2.
The averaged MAE of the HR estimation for all participants by the peak-finding algorithm,
weighted frequency estimation algorithm, and AMTC algorithm are 7.11 (?3.66) bpm, 6.42 (?3.02) bpm,
and 4.14 (?1.72) bpm, respectively.
Fig. 4.7 shows that methods (IV) and (V) perform similarly with 0.56 vs. 0.57 for correla-
tion ? and 1.43% vs. 1.40% for MAE, respectively. Our proposed method guided by the AMTC
tracked HR outperforms methods (IV) and (V) by 21% and 19% in correlation, and by 12% and
123
10% in MAE, respectively. These results suggest that the accurate HR estimation for ABP filter
design improves the quality of the AC magnitude by preserving the most cardiac-related signal
from RGB channels, which in turn helps with accurate SpO2 monitoring.
4.4.5 Leave-One-Out Experiments
As a proof of concept and considering the currently limited amount of available data, we
have so far discussed the SpO2 estimation under the participant-specific (PS) scenario in Sec-
tion 4.4 where the models are calibrated for each individual. This PS mode corresponds well to
the trending ?precision telehealth? that tailors the healthcare service to individuals.
In this subsection, we consider a more practical scenario where the test participant?s data
are never seen or only form a limited portion of the training data. In this scenario, we can develop
a group-based model based on skin tone or other determinants of health, and for each subgroup,
the model is ?universal? and participant-independent. We will examine this group-based model
through the following two modes of leave-one-out experiments:
? Leave-one-session-out (LOSessO): when testing on a given participant, we include his/her
training session data together with other participants? data for training.
? Leave-one-participant-out (LOPartO): when testing on a given participant, we only use
other people?s data for training and leave out the data from this test participant.
We group the participants by skin type into lighter skin color (skin types II and III) and
darker skin color (skin types IV and V) groups. We conduct LOSessO and LOPartO experiments
on each subgroup and obtain the SVR generated testing results from all participants in Table 4.3.
The MAE and correlation coefficient ? improve from LOPartO to LOSessO to PS for both PU
124
LOPartO LOSessO PS
MAE ? MAE ? MAE ?
1.70% 0.53 1.59% 0.55 1.26% 0.68
PU
(?0.60%) (?0.38) (?0.58%) (?0.36) (?0.33%) (?0.10)
1.76% 0.48 1.70% 0.50 1.28% 0.65
PD
(?0.59%) (?0.38) (?0.59%) (?0.39) (?0.40%) (?0.14)
Table 4.3: Testing results of leave-one-participant-out (LOPartO) and leave-one-session-out
(LOSessO) experiments, measured in the sample mean and the sample standard deviation (in
parentheses).
and PD cases. This result suggests that the precision telehealth inspired PS mode is the most
accurate approach to monitoring SpO2 for an individual. Based on the overall results shown in
Table 4.3, most participants demonstrate a consistent trend of the accuracy of SpO2 estimation
from LOPartO to LOSessO to PS case. The correlation ? of participant #12 is less than -0.5 in
both leave-one-out modes, suggesting that this participant may have some uncommon relation
compared to others between the extracted features and SpO2 values.
4.5 Discussions
4.5.1 Performance on Contact SpO2 Monitoring
In addition to contact-free SpO2 monitoring, we evaluate whether our proposed algorithm
can be applied to a contact-based smartphone setup. To collect data, the left index finger covers
the smartphone?s illuminating flashlight and the nearby built-in camera, and the camera captures
a pulse video at the fingertip. Another smartphone is used to simultaneously record a top view
video of the back side of the right hand whose index finger is placed in the oximeter for SpO2 ref-
erence data collection. One participant took part in this extended experiment where one training
session with three breath-holding cycles was recorded, and three testing sessions were recorded
125
Training Testing
MAE ? MAE ?
RoR [123] (LR) 1.60% 0.54 1.38% 0.64
RoR [123] (SVR) 1.14% 0.73 1.32% 0.60
Contact RoR [96] (LR) 1.47% 0.62 1.39% 0.63
RoR [96] (SVR) 0.99% 0.83 1.27% 0.66
Proposed 0.91% 0.84 1.17% 0.81
Contact-free RoR (2-channel) 1.61% 0.73 1.75% 0.36Proposed 1.36% 0.62 1.29% 0.68
Table 4.4: Comparison of the proposed algorithm in both contact and contact-free SpO2 estima-
tion settings. The testing results are measured in the average MAE and correlation coefficient ?.
30 minutes after the training session.
In Table 4.4, we compare the performance of our proposed algorithm in both the contact-
based and contact-free SpO2 measurement settings. The conventional RoR models used in [123]
and [96] were implemented as baseline models for contact-based SpO2 measurement. In [123],
the mean and standard deviation of each window from the red and blue channels are calculated
as the DC and AC components. A linear model was built to relate the ratio-of-ratios from the
two color channels with SpO2. In [96], the median of the pulsatile peak-to-valley amplitude
is regarded as the AC component. For the two RoR methods, we implemented both LR and
SVR. For contact-free SpO2 measurement, we take the traditional two-color channel RoR method
implemented in Section 4.4.4 as the baseline to compare with the proposed method.
Table 4.4 reveals that our proposed algorithm outperforms other conventional RoR mod-
els in contact-based SpO2 monitoring. Even in the contact-free case, our algorithm presents a
comparable performance to that of the contact-based cases, despite that the SNR of the fingertip
video is better than the SNR from a remote hand video.
126
Figure 4.8: Illustration of blurring effects using different blurry levels ? on hand videos. The
wider the kernel is, the blurrier the videos are.
4.5.2 Resilience Against Blurring
In this subsection, we explore the robustness of our algorithm to the blurring effect on hand
images. In the current setup, the hands are placed on a stable table with a cellphone camera
acquiring the skin color of both hands. Ideal laboratory conditions are often not satisfied under
practical scenarios, and the hand images captured by the cellphone cameras may be blurred due
to being out of focus. The point spread function is modeled as a 2D homogeneous Gaussian
kernel. The finite support of the kernel is defined manually to generate perceptually different
blurry effects and then the standard deviation ? is computed based on the given support. To test
different blurry effects, we experimented with two different blurry levels ? = 1.1 (5 ? 5 pixels)
and ? = 2.6 (15? 15 pixels), respectively. We show the blurring effects in Fig. 4.8.
Table 4.5 presents the SVR generated results for PU cases with different ? and kernel
sizes. We use the SVR, PU scenario to showcase here as it achieves the best SpO2 prediction
performance, which is verified in Section 4.4.3. From the table, we find that our algorithm is
robust to the Gaussian blurring effect. After the ? = 1.1 blurring, the testing ? remains the same,
and testing MAE is 6.3% higher than the no blurring case. After the ? = 2.6 blurring, the testing
? is 1.5% lower and MAE is 4.0% higher than the no blurring case.
127
Training Testing
MAE ? MAE ?
? = 2.6 blur 1.41% 0.72 1.31% 0.67
(15? 15 pixels) (?0.50%) (?0.11) (?0.35%) (?0.09)
? = 1.1 blur 1.42% 0.70 1.34% 0.68
(5? 5 pixels) (?0.59%) (?0.16) (?0.41%) (?0.10)
1.33% 0.76 1.26% 0.68
No blur
(?0.54%) (?0.09) (?0.33%) (?0.10)
Table 4.5: Simulation for Gaussian blurring effect on hand videos. SVR generated results for PU
cases are listed for different ? and Gaussian kernel sizes. The results are quantified in terms of
the sample mean and sample standard deviation (in parentheses).
4.5.3 Limitations and Further Verification with Intermittent Hypoxia Protocols
From the recordings of our data collection protocol for voluntary breath-holding, we ob-
served that HR and SpO2 are correlated for many participants. That is, in one breath-holding
cycle, when the participant starts to hold their breath, his/her HR increases and SpO2 drops as the
oxygen runs out. As he/she resumes normal breathing, his/her HR and SpO2 recover to be within
the normal range. Due to individuals? different physical conditions, in some participants, the
peak of the HR signal and valley of the SpO2 signal happen in such a short time interval that HR
and SpO2 are significantly negatively correlated. This observation is in line with the biological
literature [56]. In the literature, breath-holding exercises were found to be able to yield significant
changes in the cardiovascular system. In the central circulation, they caused significant changes
in heart rate, and in the peripheral circulation, they caused significant changes in arterial blood
flow and oxygen saturation.
Based on the above observation that HR is correlated with SpO2 during breath-holding,
we are curious whether our method also works for a different protocol where the instant HR
change is relatively less correlated to SpO2. An intermittent hypoxia (IH) protocol used in the
128
Figure 4.9: Experimental setup for the intermittent hypoxia protocol. The participant lies down
on a bed with a mask controlling the breathing-in air which alternates between hypoxia and nor-
moxia. The right index finger is clipped by the CMS-50E pulse oximeter to record the reference
SpO2 and HR signals. The palm side of the left hand (PU) is facing toward the smartphone cam-
era during hand video recording sessions.
literature shows that by receiving hypoxic air (inspired fraction of oxygen between 12% and
15%) intermittently with normoxic air, the participant can have a much milder HR change than
breath-holding, while a significant decrease in SpO2 can be achieved during the hypoxia [46].
The research restriction affecting human subject research in many U.S. institutions limited our
ability to carry out the abovementioned hypoxia protocol before and as the restriction is eased
recently, we investigate the performance of our proposed algorithm when applied to the new
hypoxia protocol.
IH Protocol and Data Collection Setup:
Similar to the breath-holding protocol used in Chapter 4.4.1, the data collection setup of the
IH protocol (shown in Fig. 4.9) includes a Contec CMS-50E pulse oximeter attached to the right
index finger to measure the participant?s SpO2 and HR level as the reference and an iPhone 7 Plus
camera mounted on a tripod for hand video recording. In lieu of holding breath to induce variation
in SpO2 values, in the IH protocol, the participant is equipped with a face mask that controls
129
Figure 4.10: Illustration for the intermittent hypoxia (IH) protocol. The IH breathing is com-
posed of 15 cycles of exposures to the alternating 25-second hypoxia (5% oxygen) period and
the 90-second normoxia (21% oxygen) period. Three hand video sessions are recorded during
the process and each video takes around 4.5 minutes. The first, second, and third videos start at
the 2nd, 6th, and 10th IH cycle, respectively. In the practical data collection, the start time of the
second and third video can be delayed by 1 or 2 cycles for the participants to adjust their hand
positions after the previous video session.
the breathing-in air. The face mask is connected to a one-way non-rebreathing valve, which is
attached to a two-way switching valve. The two-way switching valve is used to control the input
of either hypoxic air (5% oxygen, 3% carbon dioxide, balanced nitrogen) or room air (normoxia:
21% oxygen). Throughout the protocol, a switching valve is alternated between the acute (25-
second) exposures to hypoxic air and the 90-second exposure to the normoxic medical gas for a
total of 15 hypoxic events. Three hand video sessions are recorded for each participant during
the process and each video takes around 4.5 minutes. The illustration for the procedure is shown
in Fig. 4.10.
Overview of Participant Information and Collected Data:
Three participants, including one male and two females, were enrolled in the study under
protocol #1511266 approved by the University of Maryland IRB, with one female?s Fitzpatrick
skin type being type I and the other two participants? being type II. One frame from the hand
videos where the palm side facing the camera (PU case) of each participant is shown in Fig. 4.11.
According to the IH protocol described in the previous paragraph, each participant had three hand
130
Figure 4.11: Hand images from (a) a male participant whose Fitzpatrick skin type is II, (b) a
female participant whose Fitzpatrick skin type is I, and (c) a female participant whose Fitzpatrick
skin type is II.
video sessions recorded while their SpO2 and HR were measured by the pulse oximeter during
the intermittent hypoxia process. The histograms of SpO2 values in the collected datasets using
the breath holding protocol and the new IH protocol are shown in Fig. 4.12.
Recall in our previous breath holding protocol used in Chapter 4.4.1, we observed that
some participants have their HR and SpO2 correlated due to the reaction of the cardiac system
during breath holding. This is manifested in the histogram shown in the left panel of Fig. 4.13,
where 79% (22/28) of the participants? SpO2 and HR have an absolute correlation greater than
a threshold of 0.4. While in the new intermittent hypoxia protocol, as shown in the left panel of
Fig. 4.13, only 22% (2/9) have an absolute correlation greater than 0.4. This indicates that this
new IH protocol induces less correlation between HR and SpO2, serving as a new scenario to test
the robustness of our proposed algorithm.
SpO2 Prediction Performance:
The SpO2 prediction is conducted in the participant-specific manner. The first video session
of each participant is used for training and validation, and the second and third video sessions are
131
Figure 4.12: Comparison of the distributions of SpO2 collected using the breath-holding protocol
and the intermittent hypoxia protocol.
Figure 4.13: Comparison of the correlations between HR and SpO2 from the breath-holding
protocol and the intermittent hypoxia protocol.
132
MAE = 0.32%,   = 0.78 MAE = 0.65%,   = 0.22 MAE = 0.57%,   = -0.26
100 Training 100 Test 100 Test
95 95 95
Reference Signal
90 Predicted Signal 90 90
0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250
(a) Participant IH-1, Skin Type II
MAE = 0.86%,   = 0.58 MAE = 0.98%,   = 0.27 MAE =1.64%,   = 0.48
100 Training 100 Test 100 Test
95 95 95
90 90 90
0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 300
(b) Participant IH-2, Skin Type I
MAE = 0.30%,   = 0.84 MAE = 0.92%,   = 0.29 MAE = 1.15%,   = 0.61
100 Training 100 Test 100 Test
95 95 95
90 90 90
0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250
Time (s) Time (s) Time (s)
(c) Participant IH-3, Skin Type II
Figure 4.14: Predicted SpO2 signals using SVR are shown for all participants from the IH proto-
col. The reference SpO2 is in red dash lines and the predicted SpO2 is in solid black lines. The
higher the correlation ? and the lower the MAE, the better the predicted SpO2 captures the trend
of the reference signal.
133
SpO2 (%) SpO2 (%) SpO2 (%)
used for testing. SVR is used for regression. Fig. 4.14 shows the training and testing results for
the three participants. For Participant IH-1, the variation in the reference SpO2 values is small
with the lowest SpO2 being 96% during the video sessions, resulting in no obvious dips in the
SpO2 trend. This may be due to the interpatient variability in tolerance to hypoxia. Thus, even
though his predicted test SpO2 signals do not follow the trend of reference SpO2 well (with ?
being 0.22 and -0.26, respectively), the MAEs are less than 0.65%. Overall speaking, the test
MAEs are within 1.64% while the dips of some of the SpO2 trends are not captured well, such
as Participant IH-2. From the limited data that we have collected so far with the IH protocol,
our proposed algorithm achieved reasonable results and the results need to be verified and further
improved with more data collected.
Discussions for Future Development:
From the comparison of SpO2 distributions between the breath-holding protocol and the
IH protocol shown in Fig. 4.12 and SpO2 trends in Fig. 4.14 versus those in Fig. 4.5, we observe
that the drop of SpO2 does not get deeper and wider with the new IH protocol as described in the
literature with similar IH protocols [46, 65]. The differences between our IH protocol and that in
the literature mainly lie in the duration of the hypoxia period (in each episode and overall), its
relative duration to the normoxia phase, and the fraction of the inspired oxygen. For example,
in [65], the hypoxia environment induced by the FiO2 (the fraction of inspired oxygen) protocol
lasts for consecutively 16 minutes on average per participant to create a much wider range of
SpO2 from 61% to 100%, though the fraction of oxygen is unclear in the paper. In [46], the
IH experiment is conducted 5 sessions per week throughout a 3-week duration. They found the
most prominent decrease of SpO2 was 10% on average, which happened in week 3 with 5 times
of 5-minute hypoxia provoked by 12% oxygen interspersed with 3-minute normoxia intervals in
134
each session.
With the IH protocols applied in the literature and the suggestions of a proper level of
hypoxia and duration that lead to safe and positive effects and therapeutic potential of intermittent
hypoxia [103], we consider the following modifications in our future design of protocol with
advice and supervision from physicians to prevent adverse effects to the participants:
? having a relatively longer hypoxia period (e.g., increase from 25 seconds to several min-
utes); and/or
? inducing a modest fraction of inspired oxygen (e.g., 9% to 16%) [103] that can match the
increased duration of the hypoxia period.
With the longer and larger decrease in SpO2 values created by the updated protocol, we may have
more meaningful training samples and better take advantage of the IH protocol.
4.6 Chapter Summary
This chapter presents a contact-free method of measuring blood oxygen saturation from
hand videos captured by smartphone cameras. The whole algorithm pipeline includes 1) receiving
video of the hand of a subject captured by a regular RGB camera of a smartphone; 2) extracting
a region of interest of the hand video; 3) performing feature extraction of the region of interest
based on spatial and temporal data analysis of more than two color channels; and 4) estimating
a blood oxygen saturation level of the subject from the features. The key contributions of this
chapter are mainly focused on the proposed feature engineering method, which is a synergistic
combination of several key components, including the multi-channel ratio-of-ratios feature set,
the narrowband filtering that adaptively centered at heart rates, and the accurately estimated heart
135
rate. We have seen encouraging results of a mean absolute error of 1.26% with a commercial
pulse oximeter as the reference, outperforming the conventional ratio-of-ratios method by 25%.
We have also analyzed the impact of the sides of the hand and skin tones on the SpO2 estimation.
We have found that, given our collected dataset, the palm side performs well regardless of the
skin tone; for palm-up cases, we do not observe significant performance differences between
lighter and darker skin tones.
Part of the research in this chapter was published in [142].
136
Chapter 5
Optophysiological Model Guided Neural Networks for Contactless Blood
Oxygen Estimation From Hand Videos
5.1 Introduction
Deep learning has demonstrated promising performance in camera-based physiological
measurements, such as heart rate, breathing rate, and body temperature [29, 107, 133, 170]. An
end-to-end convolutional attention network was proposed in [29] to estimate the blood volume
pulse from face videos. Frequency analysis is then conducted on the estimated pulse signal for
heart rate and breathing rate tracking. The study in [107] demonstrates that the heart rate can be
directly inferred using a convolutional network with spatial-temporal representations of the face
videos as its input. Mobile applications have been developed to utilize CNNs to measure body
temperature from facial images [170].
Deep learning for SpO2 monitoring from videos is still in its early stage. Ding et al. [40]
proposed a convolutional neural network architecture for contact-based SpO2 monitoring with
smartphone cameras. Even though the work in [40] showed better performance than the conven-
tional ratio-of-ratios method, their technique requires the users? fingertips to be in contact with
137
the illuminated flashlight and camera, which not only may lead to a sense of burning for a contin-
uous period of time but also raises sanitation concerns, especially if the sensing device is shared
by multiple participants during pandemics.
The video-based contactless methods for physiological signal sensing provide a comfort-
able and an unobtrusive option of monitoring SpO2 and have the potential to be adopted in health
screening and telehealth. In Chapter 4, we have taken advantage of the contact-free sensing
from a regular RGB camera as well as the well-known two-channel RoR mechanism from pulse
oximeters for accurate SpO2 estimation. In particular, we proposed a strategic use of the hand
video data by performing spatial and temporal data analysis of more than two color channels.
Under the umbrella of the synergistic framework that takes advantage of both biophysical imag-
ing principles and the availability of participants? video and SpO2 data to learn and determine the
details for obtaining SpO2-relevant features and making SpO2 estimation, Chapter 4 determined
the specific features and the related detailed parameters explicitly from the biophysical imaging
principles, while in this chapter, we propose to use these principles to guide the design of neural
network architectures to ?learn? the specific SpO2-relevant features from the input color signals
with a data-driven implicit approach and perform SpO2 estimation in a holistic manner. Com-
pared to the principled signal processing scheme for feature engineering proposed in Chapter 4,
the neural network based schemes proposed in this chapter learn features implicitly from data
and use synergy with the principled methodology to guide the selection of the neural network
architectures.
Specifically, inspired by the optophysiological model for SpO2 measurement [123, 149,
159], in this chapter, we develop convolutional neural networks (CNN) based SpO2 estimation
schemes designed based on the optophysiological models for a better explanation, wherein data-
138
driven feature extraction and estimation of the blood oxygen saturation level comprise imple-
menting a combination of spatial averaging, color channel mixing, and temporal trend analysis.
The schemes analyze the videos of a participant?s hand captured by regular RGB cameras in
a contactless way, which is convenient and comfortable for users and can protect their privacy
compared to face videos [26] and allow for keeping face masks on.
5.2 Proposed Optophysiology-Guided Neural Network Method for Estimating
SpO2 From Videos
Fig. 5.1 is an overview of the system design. First, the ROI, including the palm and the
back of the hand, is extracted from the smartphone captured videos. Second, the ROI is spatially
averaged to produce R, G, and B color time series. Next, the three color-channel signals are
fed into an optophysiology-inspired CNN to extract features to achieve more explainable and
accurate SpO2 predictions.
100
95
Est. 
90 Ref.
Hand ROI Hand Spatial CNN Structure for SpO2
Video Localization Combining Feature Extration Prediction
Figure 5.1: Proposed neural network based contactless SpO2 estimation method. Three color
time series are extracted from the skin area of a hand video by spatial averaging and are then fed
into an optophysiology-inspired neural network to extract features by color channel mixing and
temporal analysis for SpO2 prediction.
139
SpO2(%)
5.2.1 Extraction of Skin Color Signals
The physiological information related to SpO2 is embedded in the color of the reflected/reemitted
light from a person?s skin. Hence, a preprocessing step that precisely extracts the color informa-
tion from the skin area is crucial to the design of an effective SpO2 estimation method. For each
participant?s video, we aim to extract the R, G, and B time series and refer to these 1-D time
series as skin color signals. As Chapter 4.3 explained, the ROI of the skin pixels is separated
using Otsu?s method [109] which determines a threshold that best separates the skin pixels from
the background by minimizing the variance within the skin and non-skin classes in the Cr axis of
the YCbCr color space [18]. Once the ROI corresponding to the hand is located, the R, G, and B
time series are generated by spatially averaging over the values of skin pixels for each frame of
the video.
In this chapter, the skin color signals are split up into 10-second segments using a sliding
window with a step size/stride of 0.2 seconds to serve as the inputs for neural networks. From
an optophysiological perspective, the reflected/reemitted light from the skin for the duration of
one cycle of heartbeat, i.e., 0.5?1 seconds for a heart rate of 60?120 bpm, should contain almost
the complete information necessary to estimate the instantaneous SpO2 [127]. In our system de-
sign, we use longer segments to add resilience against sensing noise. Since the segment length is
one order of magnitude longer than the minimally required length to contain the SpO2 informa-
tion, we can use a fully-connected or convolutional structure to adequately capture the temporal
dependencies without resorting to a recurrent neural network structure.
140
5.2.2 Neural Network Architectures
The previous neural network work for SpO2 prediction mainly explored prediction, but not
the model explainability [40]. Explainability/interpretability is highly desirable in many applica-
tions yet often not sufficiently addressed, partly due to the black box nature of neural networks.
From a healthcare standpoint, explainability is a key factor that should be taken into account
at the beginning of the design of a system. To extract features from the skin color signals and
estimate SpO2, we propose three physiologically motivated neural network structures. These
structures are inspired by domain knowledge-driven physiological sensing methods and designed
to be physically explainable. For heart rate sensing [107, 176] and respiratory rate sensing [102,
132], the RGB skin color signals are often combined first to form one ?rPPG? signal followed by
temporal feature extraction, as is done in the plane-orthogonal-to-skin (POS) algorithm [156]. In
contrast, for conventional SpO2 sensing methods such as the ratio-of-ratios [159], the temporal
features are extracted first (i.e., extracting AC and DC from a time window for each color channel)
and the color components are combined at the end (i.e., taking the ratio and pairwise ratio of
ratios) before doing regression fitting. Our proposed neural network structures explore different
arrangements of channel combination and temporal feature extraction. We want to systematically
compare the performance of our explainable model structures.
Color Channel Mixing Followed by Temporal Analysis: In Model 1, shown as the leftmost
structure depicted in Fig. 5.2, we combine the color channels first using several channel com-
bination layers and then extract temporal features using temporal convolution and max pooling.
A channel combination layer first linearly combines the Cin input channels/vectors into Cout ac-
tivation vectors and then applies a rectified linear unit (ReLU) activation function to obtain the
141
Figure 5.2: Proposed network structures for predicting SpO2 levels from a fixed-length segment
of skin color signals. We highlight the differences among the three model configurations instead
of showing the exact model structures. Model 1 combines the RGB channels before temporal
feature extraction. Model 2 extracts the temporal features from each channel separately and fuses
them toward the end. Model 3 interleaves color channel mixing and temporal feature extraction.
output channels/vectors. Mathematically, the channel combination layer is described as follows:
V = ?(WU+ b T1 ), (5.1)
where U ? RCin?L is the input comprised of Cin time series/vectors of length L. The initial
channel combination layer has an input of three channels with 300 points along the time axis.
W ? RCout?Cin is a weight matrix, where each of the Cout rows of the matrix is a different linear
combination for the input channels. A bias vector b ? RCout contains the bias terms for each of
the Cout output channels, which ensures that each data point in the created segment of length L
has the same intercept. T ? R1?L1 is a row vector of all ones. The nonlinear ReLU function
?(x) = max(0, x) is applied elementwise to the activation map/matrix. The output of the channel
combination layer V ? RCout?L contains Cout channels of nonlinearly combined input channels.
The channel mixing section concatenates multiple channel combination layers with de-
creasing channel counts to provide significant nonlinearity. The output of the last channel com-
142
bination layer has seven channels. After the channel mixing, for temporal feature extraction, we
utilize multiple convolutional and max pooling layers with a downsampling factor of two to ex-
tract the temporal features of the channel-mixed signals. When there are multiple filters in the
convolutional layer, there will also be some additional channel combining with each filter out-
putting a channel-mixed signal. Finally, a single node is used to represent the predicted SpO2
level. This model has three channel combination layers and three temporal feature extraction
layers.
Temporal Analysis Followed by Color Channel Mixing: In Model 2, which is the middle struc-
ture depicted in Fig. 5.2, we reverse the order of color channel mixing and temporal feature
extraction from that in Model 1. The three color channels are separately fed for temporal fea-
ture extraction. The convolutional layers learn different features unique to each channel. At the
output of the temporal feature extraction section, each color channel has been downsampled to
retain only the important temporal information.
The color channels are then mixed together in the same way as described for Model 1
before outputting the SpO2 value. This model has two temporal feature extraction layers and
three channel combination layers.
Interleaving Feature Extraction and Channel Mixing: In our third model, we explore the
possibility of interleaving the color channel mixing and temporal feature extraction steps. As
illustrated by the rightmost structure depicted in Fig. 5.2, the input is first put through a convolu-
tional layer with many filters and then passed to max pooling layers, resulting in feature extraction
along the time as well channel combinations through each filter. The number of filters is reduced
with each successive convolutional layer, gradually decreasing the number of combined channels
and downsampling the signal in the time domain. This model has five convolutional layers.
143
Loss Function and Parameter Tuning. We use the root-mean-squared-error (RMSE) as the loss
function for all models. During training, we save the model instance at the epoch that has the
lowest validation loss. The neural network inputs are scaled to have zero mean and unit variance
to improve the numerical stability of the learning. The parameters and hyperparameters of each
model structure were tuned using the HyperBand algorithm [90], which allows for a faster and
more efficient search over a large parameter space than grid search or random search. It does this
by running random parameter configurations on a specific schedule of iterations per configuration
and uses earlier results to select candidates for longer runs. The parameters that are tuned include
the learning rate, the number of filters and kernel size for convolutional layers, the number of
nodes, the dropout probability, and whether to do batch normalization after each convolutional
layer.
5.3 Experimental Results
5.3.1 Dataset and Capturing Conditions
Our proposed models are evaluated on a self-collected dataset that is studied in Chapter 4.
To recapitulate, the dataset consisted of two sessions of hand video recordings and simultaneously
recorded reference SpO2 data from each of the 14 participants, of which there were six males and
eight females between the ages of 21 and 30. The distribution of the participants? skin types is
as follows: Two participants of type II, eight participants of type III, one participant of type IV,
and three participants of type V. This research was using protocol #1376735 approved by the
University of Maryland Institutional Review Board (IRB).
Each participant was asked to place his/her hands still on a table to avoid hand motion.
144
Figure 5.3: Illustration of two hand-video capturing positions. Left hand: palm down (PD). Right
hand: palm up (PU).
Their palm of the right hand and the back of the left hand are facing the camera, as illustrated
in Fig. 5.3. We refer to these two hand-video capturing positions as palm up (PU) and palm
down (PD), respectively. Each participant was asked to follow the breathing protocol outlined in
Fig. 5.4(a). The participant breathes normally for 30?40 seconds, exhales all the way, and then
holds his/her breath for 30?40 seconds. This process is repeated three times for each session.
The collected SpO2 value distribution is shown in Fig. 5.4(b).
2500
2000
1500
1000
500
0
89 90 91 92 93 94 95 96 97 98 99 100
SpO2 (%)
(a) (b)
Figure 5.4: (a) Breathing protocol that participants were asked to follow, including 3 cycles of
normal breathing and breath holding. (b) Histogram of SpO2 values in the collected dataset.
In this chapter, we increase the data size by interpolating the reference SpO2 signal to 5
sample points per second to match the segment sampling rate (Chapter 5.2.1) using a smooth
145
Frequency
spline approximation [53]. Each RGB segment and SpO2 value pair is fed into our models as a
single data point, the models output a single SpO2 estimate per segment. To evaluate a model
on a video recording, the model is sequentially fed with all RGB segments from the recording to
generate a time series of preliminarily predicted SpO2 values. All predictions greater than 100%
SpO2 are clipped to 100% based on physiological knowledge. A 10-second long moving average
filter is applied to generate a refined time series of predicted SpO2 values.
5.3.2 Participant-Specific Results
To investigate how well the proposed models could learn to estimate a specific individual?s
SpO2 from his/her own data, we first conducted participant-specific experiments, that is, we learn
individualized models for each participant.
Experimental Setting: Two recordings per participant were captured with at least 15 minutes
in between. One recording is used for training and validation of the model and the remaining
recording is for testing. An example of the training and validation predictions curves is shown in
Fig. 5.5(a). Each recording contains three breathing cycles, for each training/validation record-
ing, the first two breathing cycles are taken for training and the third cycle is used for validation.
Splitting the recordings into cycles instead of randomly sampling the 10-sec overlapping RGB
segments ensures that there are no overlapping segments of data between the training and vali-
dation set. Example test prediction curves and their correlation and mean-absolute-error (MAE)
are shown for reference in Fig. 5.5(b). It should be noted that if the correlation is low, e.g.,
a constant temporal estimate, then the MAE and RMSE metrics are less meaningful. For the
participant-specific experiments, due to the small dataset size, we augment the training and val-
146
MAE_train: 0.74 , Pearson_t rain: 0.96, MAE_val 1.24, Pearson_val: 0.80
100 Training Validat ion
99
98
97
96
95
94
t raining ref t raining pred
93 validat ion ref validat ion pred
92
0 50 100 150 200 250 300
Tim e (s)
(a)
MAE: 1.09, Correlation: 0.74 MAE: 1.10, Correlation: 0.31
100 100
99 99
98 98
97 97
96 96
95 95
reference
94 94 predicted
93 93
0 50 100 150 200 250 0 50 100 150 200
Time (s) Time (s)
MAE: 1.38, Correlation: 0.61 MAE: 1.59, Correlation: 0.45
100 100
99 99
98 98
97 97
96 96
95 95
94 94
93 93
0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350
Time (s) Time (s)
(b)
Figure 5.5: (a) Training vs. validation predictions. (b) Test predictions of varying performance
with reference SpO2. The higher the Pearson correlation, the better the prediction captures the
reference SpO2 trend. The lower the MAE, the better the prediction captures the dips in SpO2.
idation data by sampling with replacement. This is an example of the bootstrapping data reuse
strategy [70, Chapter 5]. The oversampling also helps address the imbalance in SpO2 data values
that is shown in Fig. 5.4(b).
147
SpO2 % SpO2 %
SpO2 %
SpO2 % SpO2 %
In each experiment, the model structure and hyperparameters are first tuned using the train-
ing and validation data. Once the model has been tuned, we train multiple instances of the model
using the best tuned hyperparameters. Between each instance, we vary the random seed used for
model weights initialization and random oversampling. Each model instance is evaluated on the
training/validation recording, and the model instance that achieves the highest validation RMSE
is selected for evaluation on the test recording. This model is then evaluated on the test recording
to obtain the final test results.
Results: Table 5.1 shows the performance comparison of our proposed models with the prior-
art model from Ding et al. [40]. To the best of our knowledge, Ding et al.?s model is the only
convolutional neural network structure that has been tried for contact-based SpO2 estimation.
Its structure is similar to our Model 3 but with fewer layers. We also compare with the classic
ratio-of-ratios method proposed by Scully et al. [123]. The performance is measured in Pearson?s
Correlation, mean absolute error (MAE), and root mean square error (RMSE), and the results of
each condition are summarized in the median and interquartile range (IQR). IQR quantifies the
spread of an empirical distribution of a set of data points by computing the difference between
the first quartile and the third quartile of the distribution.
Table 5.1 reveals that Model 2 achieves the best correlation in both PD and PU cases,
whereas Model 3 achieves the best MAE and a comparable correlation with Model 2, suggesting
that Model 2 and Model 3 are comparably the best in the individualized learning. Even though
the method proposed in Scully et al. [123] achieves the best (lowest) RMSE, its correlations are
the worst (lowest). This suggests that the classic ratio-of-ratios method cannot track the trend of
SpO2 well using the contactless measurement by smartphone. All of our model configurations
outperform Ding et al. [40]. For example, in the PU case for Model 3, the correlation is improved
148
Hand Correlation MAE (%) RMSE (%)
Mode Median IQR Median IQR Median IQR
Model 1 PD 0.41 0.40 2.12 0.91 2.51 0.78
(Proposed) PU 0.39 0.37 2.16 1.80 2.70 2.09
Model 2 PD 0.46 0.44 2.09 1.32 2.52 1.63
(Proposed) PU 0.41 0.32 1.96 0.68 2.48 0.89
Model 3 PD 0.44 0.40 1.93 1.11 2.48 1.31
(Proposed) PU 0.41 0.46 1.81 1.83 2.43 2.44
PD 0.08 0.37 1.94 0.92 2.22 0.77
Scully et al. [123]
PU 0.19 0.24 2.01 0.80 2.36 0.78
PD 0.38 0.39 3.25 2.85 3.83 3.24
Ding et al. [40]
PU 0.34 0.56 3.40 3.16 4.58 3.12
Table 5.1: Performance comparison of each model structure for participant-specific experiments.
Results are given as the test median and IQR of all participants.
1.0 1.0Part icipant -Specific Leave-One-Out
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
? 0.2 ? 0.2
Lighter PD
? 0.4 ? 0.4Darker PU
? 0.6 ? 0.6
PD PU PD PU Part icipant -Specific Leave-One-Out
(a) (b)
Figure 5.6: Boxplots comparing distributions of correlations for (a) lighter vs. darker skin types,
and (b) PD vs. PU for all skin types. The PD results are better for darker skin tones in both the
participant-specific and leave-one-out cases.
from 0.34 to 0.41 and the MAE is lowered from 3.40% to 1.81%. It is worth noting that the
international standard for clinically acceptable pulse oximeters allows an error of 4% [68], and
our estimation errors are all within this range.
There are two factors, including the skin type and the side of the hand, which might influ-
ence the performance of SpO2 estimation. We therefore investigate the following two questions:
(1) Whether the different skin types matter in PU or PD cases, and (2) whether the side of hand
149
Test  Correlat ion
Test  Correlat ion
matters in lighter skin (types II + III) or darker skin (types IV + V). The box plots in Fig. 5.6
summarize the distributions of the test correlations from all the three proposed models in PU and
PD modes of (a) lighter-skin and darker-skin participants, and (b) all participants.
Bayesian statistical test: We use Bayesian statistical tests to further analyze the results in Fig. 5.6
by providing a probabilistic assessment of whether the results from two groups being compared
have the same mean [81?85]. We avoid using the popular t-test because it makes only a binary
decision due to its lack of direct information about the probability of difference between group
means of the given data [81]. In contrast, the Bayesian statistical test computes the posterior
distribution of difference between the two group means to quantify its certainty of possible val-
ues [85]. The decision rule of the Bayesian statistical test for the null hypothesis that the two
groups have the same mean can be stated as follows given the region of practical equivalence
(ROPE) of zero difference [83]:
? (Accepted): If the percentage of the posterior distribution of the group-mean difference
inside the ROPE is sufficiently high (e.g., greater than 95% [83]), then the null hypothesis
is accepted.
? (Rejected): If the percentage of the posterior distribution of the group-mean difference
inside the ROPE is sufficiently low (e.g., less than 2.5% [83]), then the null hypothesis is
rejected.
? (Undecided): When the null hypothesis is neither accepted nor rejected, the percentage
of the posterior distribution of the group-mean difference inside the ROPE can be used
to quantify the certainty that two group means are the same. One example is shown in
Fig. 5.7.
150
ROPE
?0.15 ?0.10 ?0.05 0.00 0.05 0.10 0.15 0.20
Difference of group means, ?1 ? ?2  
Figure 5.7: Posterior distribution of the difference of group means. This shows an example of
an undecided case of the Bayesian statistical test given that the ROPE of zero difference is set
to [?0.03, 0.03] and 33% of the posterior distribution falls within the ROPE. The percentage of
coverage can be used to quantify the certainty that two groups have the same mean.
To conduct the Bayesian statistical tests, we use an R statistical package named BEST [116]. To
determine the ROPE on the difference between the means, we use Cohen?s established convention
that the ROPE of small standardized mean difference is [-0.1,0.1] [82, 84]. Given the standard
deviation being 0.3 of our data, the ROPE for the difference of means of our data is scaled to
[-0.03, 0.03].
To answer question (1) about the impact of the skin type on the prediction performance,
we focus on the left panel of Fig. 5.6(a). For the PD case, only 14% of the posterior distribution
of the difference between the means of the lighter and darker skin groups falls in the ROPE. For
the PU case, 23% of the posterior distribution falls in the ROPE. This suggests that it is highly
credible to conclude that the skin type makes a difference in SpO2 prediction, and the difference
is more certain to be observed when using the back of the hand as the ROI compared to using the
palm.
To answer question (2), we first focus on the left panel of Fig. 5.6(b) when participants
of all skin colors are considered together. 33% of the posterior distribution of the difference of
means between PU and PD cases falls in the ROPE. We then zoom into the darker skin group as
shown in the left panel of Fig. 5.6(a), only 15% of the posterior distribution of the difference of
151
means between PD and PU cases falls in the ROPE, whereas for the lighter skin group, 31% of
the posterior distribution falls in the ROPE. This implies that it is highly credible that the side
of the hand may have some impact on SpO2 prediction, especially when concerning mainly the
darker skin group.
5.3.3 Leave-One-Participant-Out Results
To investigate whether the features learned by the model from other participants are gen-
eralizable to new participants whom it has not seen before, we conduct leave-one-participant-out
experiments. For each experiment, when testing on a certain participant, we use all the other
participant?s data for training and leave the test participant?s data out. The recordings from all
the non-test participants are used for participant-wise cross-validation to select the best model
structure and hyperparameters. The selected model is evaluated on the two recordings of the test
participant, whose data was never seen by the model during training.
Table 5.2 shows the performance comparison of each model in leave-one-participant-out
experiments. Model 1 achieved the best performance in terms of correlation and achieved the
best MAE and RMSE for the PU case. Similar to the participant-specific case, the classic ratio-
of-ratios method proposed in Scully et al. [123] achieved better MAE and RMSE results for the
PD case but the correlation result was low, suggesting that the model achieved low error by simply
predicting a nearly constant SpO2 near the middle of the SpO2 range. The best performance of
Model 1 in the leave-one-participant-out experiment may imply that the features extracted after
combining the color channels at the beginning of the pipeline can be generalized better to unseen
participants than the features extracted before channel combination or through interleaving as in
152
Hand Correlation MAE (%) RMSE (%)
Mode Median IQR Median IQR Median IQR
Model 1 PD 0.33 0.42 2.33 1.07 3.07 1.52
(Proposed) PU 0.46 0.36 1.97 0.80 2.32 0.87
Model 2 PD 0.15 0.50 2.43 0.94 3.35 1.11
(Proposed) PU 0.33 0.39 2.08 0.73 2.41 0.71
Model 3 PD 0.23 0.38 2.48 1.18 2.98 1.33
(Proposed) PU 0.27 0.31 2.02 1.03 2.54 1.28
PD 0.05 0.43 2.08 0.65 2.44 1.14
Scully et al. [123]
PU 0.01 0.54 2.08 0.60 2.43 1.20
PD 0.11 0.56 3.19 1.61 3.76 1.52
Ding et al. [40]
PU 0.26 0.42 2.43 1.22 2.85 1.51
Table 5.2: Performance comparison of each model structure in leave-one-participant-out experi-
ments. Results are given as the test median and IQR of all participants.
Models 2 or 3.
In the participant-specific case, the model is specifically tailored to the test individual,
whereas the leave-one-participant-out case is more difficult because the model needs to accom-
modate for the variation in the population. As expected, in Fig. 5.6, we observe that the overall
results from the leave-one-participant-out experiments do not match those from the participant-
specific experiments. Because of the modest size of the dataset, the model has not seen as diverse
data as a larger and richer dataset would offer. The generalization capability to new participants
can be improved when more data is available.
We now revisit the two research questions raised in Section 5.3.2 under the leave-one-
participant-out scenario. First, we analyze the impact of skin type given the same side of the
hand. From the right panel of Fig. 5.6(a), in the PD case, only 0.04% of the posterior distribution
of the difference of means between lighter and darker skin groups is within the ROPE, suggesting
that the null hypothesis is rejected and the darker skin group outperforms the lighter skin group.
In the PU case, 18% of the posterior distribution falls within the ROPE. This observation is
consistent with the participant-specific experiments that when using the back of the hand as the
153
ROI, the skin color is more credible to be a factor in the accuracy of SpO2 estimation than using
the palm.
Second, we analyze the impact of the side of the hand for two skin color groups. For the
darker skin group shown in the right panel of Fig. 5.6(a), only 9% of the posterior distribution
of the difference of means of the PU and PD cases falls in the ROPE. This shows that there is
high uncertainty in the estimate of zero difference, which is consistent with the results from the
participant-specific experiments. However, unlike the participant-specific experiments, for the
lighter skin group, 0.2% of the posterior distribution of the difference of means between PU and
PD cases falls in the ROPE. This suggests that the null hypothesis is rejected and that the PU
outperforms the PD in the lighter skin group. As for the mixed group illustrated in the right panel
of Fig. 5.6(b), only 8% of the posterior distribution of the difference of means falls in the ROPE,
suggesting that there is a high uncertainty to conclude that PU and PD cases are comparable.
This different generalization capability in the PU and PD cases may be attributed to the
skin color difference between the palm and the back of the hand. The color of the back of the
hand tends to be darker than the color of the palms and has larger color variation among par-
ticipants due to different degrees of sunlight exposure. In contrast, the color variation of the
palms is much milder among participants. Furthermore, in the participant-specific experiments,
the individualized models learn the traits of the skin type and the side of the hand from each par-
ticipant, whereas, in the leave-one-participant-out experiments, the learned model must capture
the general characteristics of the population.
154
Method ? MAE(%) RMSE(%)
Linear Ch. Comb. Median 0.46 2.14 2.66
+ Conv. layer for Feat. Extra. IQR 0.38 0.73 0.93
Nonlinear Ch. Comb. Median 0.41 2.29 2.66
+ Fully Connec. layer for Feat. Extra. IQR 0.39 0.63 0.70
Model 1 (Proposed): Nonlinear Ch. Comb. Median 0.46 1.97 2.32
+ Conv. layer for Feat. Extra. IQR 0.36 0.80 0.87
Table 5.3: Numerical results of the ablation studies for Model 1 (M1) in the leave-one-participant-
out mode. Comparisons among the proposed (nonlinear) M1, modified M1 with only linear chan-
nel combinations, and modified M1 with fully connected dense layers instead of convolutional
layers are listed. Ablation studies confirm that the nonlinear channel combinations and convolu-
tional layers improve model performance.
5.3.4 Ablation Studies
To justify the use of nonlinear channel combinations and convolutional layers for temporal
feature extraction in our proposed models, we conduct two ablation studies comparing the per-
formance of these model components to other generic ones. We focus on the PU case to avoid the
uncontrolled impact of such factors as skin tone and hair. In the first ablation study, we compare
nonlinear to linear channel combinations. We create a variant of Model 1 with only a single linear
channel combination layer with no activation function and repeat the leave-one-participant-out
experiments. In the second study, we compare the performance of using convolutional layers for
temporal feature extraction to using fully-connected dense layers. We create this second variant
of Model 1 and repeat leave-one-participant-out experiments.
Table 5.3 presents the medians and IQRs specified for numerical comparison of the ablation
study. First, we compare the first and the third rows in Table 5.3 for ablation study 1. Our
proposed Model 1 achieves a better correlation with a median of 0.46 and IQR of 0.36 and a
better RMSE with a median of 2.32 and IQR of 0.87 than its linear channel combination variant.
155
Besides, Model 1 achieves a comparable MAE with a better median of 1.97 but a wider IQR of
0.80. The overall better performance of Model 1 suggests the necessity of using the nonlinear
channel combination method. Second, in ablation study 2, we compare the second and the third
rows in Table 5.3. We observe that Model 1 outperforms its second variant with fully connected
layers for feature extraction with better medians in terms of correlation (0.46 vs. 0.41), MAE
(1.97 vs. 2.29), and RMSE (2.32 vs. 2.66), and narrower IQR of correlation. This suggests that
convolutional layers are better than fully connected layers for temporal feature extraction.
5.4 Discussions
5.4.1 Contact-based Dataset Testing
We also test our models on the publicly available dataset gathered by Nemcova et al. for
their SpO2 estimation work [104]. This dataset consists of contact-based smartphone video
recordings where a participant placed a finger on the smartphone camera and was illuminated
by the camera flashlight. Participants were asked to breathe normally without following any so-
phisticated breathing protocol. Each recording lasts about 10 to 20 seconds. The subject for each
recording is not identified, so subject-specific and leave-one-participant-out experiments cannot
be conducted. There is a single reference SpO2 value associated with each recording. We used
14 recordings for training and seven recordings for testing and compared them with the modified
ratio-of-ratios method proposed in their paper.
As shown in Table 5.4, Models 1 and 2 outperform the method used by Nemcova et al. on
both the training and test recordings. Model 3 is not able to generalize well from the training set
to the test set, which may be due to the small size of the dataset. It should be noted that because
156
MAE (%) RMSE (%)
Training Test Training Test
Model 1 0.86 1.19 0.94 1.36
Model 2 0.50 1.28 0.59 1.64
Model 3 0.75 3.28 0.99 3.69
Nemcova et al. [104] 2.05 2.18 2.24 2.36
Table 5.4: Experimental results of proposed methods on the contact-based video SpO2 dataset
from Nemcova et al. [104]. One SpO2 estimate was output per recording and MAE and RMSE
were calculated across all recordings. Models 1 and 2 outperform the method proposed by Nem-
cova et al., Model 3 was unable to generalize well to the test set.
(a) (b)
Figure 5.8: Histograms of correlation values between reference SpO2 signals and (a) randomly
generated SpO2 signals, or (b) SpO2 signals predicted by neural network Model 2. The cor-
relation distribution for Model 2 is centered much higher than the random guess, confirming
Model 2?s capability to track SpO2.
the participants were not asked to follow any sophisticated breathing protocol, the dynamic range
of SpO2 values is narrow. These results show that our CNN Models 1 and 2 work well for
contact-based video recordings in addition to contactless video recordings.
5.4.2 Ability to Track SpO2 Change
By employing the standard machine learning methodology of training-validation-test split
in Section 5.3 to learn neural networks that perform well on unseen data, we have already ensured
the generalizability of our models [128, Chapter 11]. As further evidence that our models are
capable of outputting meaningful predictions, we compare SpO2 predictions from our learned
157
models to randomly generated SpO2 values. For each reference signal, a random prediction
signal was generated by choosing SpO2 values between the minimum and maximum values from
the reference signal and applying a moving average window in the same way as is applied to
the neural network predictions. Fig. 5.8(a) shows a histogram of the correlations between the
reference SpO2 signals and the randomly generated predictions and Fig. 5.8(b) shows a histogram
of correlations between the reference SpO2 signals and the predictions generated by Model 2. It is
revealed that the neural network with a median correlation of 0.411 outperforms random guessing
with a median correlation of ?0.02, confirming Model 2?s capability to track SpO2.
5.4.3 Visualizations of RGB Combination Weights
To understand and explain what our physiologically inspired models have learned, we con-
duct a separate investigation to visualize the learned weights for the RGB channels. Our goal
is to understand the best way to combine the RGB channels for SpO2 prediction. Having an
explainable model is important for a physiological prediction task like this. Our neural network
models can be considered as nonlinear approximations of the hypothetically true function that can
extract the physiological features related to SpO2 buried in the RGB videos. The ratio-of-ratios
method, for example, is another such extractor that combines the information from the different
color channels at the end of the pipeline. For this experiment, we use the modified version of
Model 1 from the ablation studies that has only a single linear channel combination at the begin-
ning. Seeing that using a single linear channel combination did not significantly reduce model
performance in the ablation studies, and understanding that the linear component may dominate
1It has been shown in other applications that even low correlation coefficients can be meaningful. For example,
in photo response non-uniformity (PRNU) work, the device used to take a photo can be predicted with correlation
values below 0.1 [11].
158
(a) (b)
(c) (d)
Figure 5.9: Learned RGB channel weights. Plots (a) and (b) are the channel weights learned by
different model instances trained on the data of all study participants together, projected onto the
RB and RG planes in the RGB space. Plots (c) and (d) are the RB and RG projections of the
learned channel weights for model instances trained on random subsets of the participants? data.
Each point is color-coded according to the correlation ? achieved by the instance.
159
the Taylor expansion of a nonlinear function, we use only linear combinations for this model to
facilitate more interpretable visualizations.
We have trained 100 different instances of the model on the first two cycles from all the
recordings and tested on the third cycle from all recordings. The difference between each instance
is that the weights are randomly initialized. The weights for each channel learned by the model
instances were visualized as points representing the heads of the linear combination vector in
RGB space. Each point is colored according to the average test correlation achieved by the
model instance. Figs. 5.9(a) and 5.9(b) show the projections of these points onto the RB and RG
planes. The subfigures reveal that the majority of the channel weights lay along certain lines in
the RGB space. For the weights on the line, the ratio of the blue channel weight to the red channel
weight is 0.87, and the ratio of the green channel weight to the red channel weight is 0.18. It is
clear that the red and blue channels are the dominating factors for SpO2 prediction.
To further verify this result, we repeat this experiment under the following setup: instead
of using the data from all participants, for each model instance, we randomly select seven par-
ticipants and use their data for training and testing. In this case, the difference between each
model instance is not only the initialized weights but also the random subset of participants that
the model was trained on. Fig. 5.9(d) reveals that most of the better-performing instances (with
? ? 0.45) have little contribution from the green channel. In Fig. 5.9(c), we again see that most
of the points lay on a line in the RB plane, the ratio of the blue channel weight to the red channel
weight for these points is 0.80.
These results are in accordance with the biophysical understanding of how light is absorbed
by hemoglobin in the blood. Recall that Fig. 1.5 reveals a large difference between the extinction
coefficients, or the amount of light absorbed, by deoxygenated and oxygenated hemoglobin at
160
the red wavelength. There is a significantly smaller difference at the blue wavelength and almost
no difference at green. The amount of light absorbed influences the amount of light reflected
which can be measured through the camera. A larger difference in extinction coefficients makes
it easier to measure the ratio of light absorbed by oxygenated vs. deoxygenated hemoglobin over
time. This ratio indicates the level of blood oxygen saturation. Therefore, from a physiological
perspective, it makes sense for the neural networks to give larger weight to the red and then blue
channels and give little to the green channel. These visualizations indicate that the models are
learning physically meaningful features.
5.5 Chapter Summary
We have proposed the first CNN-based work to solve the challenging problem of video-
based remote SpO2 estimation. We have designed three optophysiologically inspired neural net-
work architectures. In both participant-specific and leave-one-participant-out experiments, our
models are able to achieve better results than the state-of-the-art method. We have also analyzed
the effect of skin color and the side of the hand on SpO2 estimation and have found that in the
leave-one-participant-out experiments, the side of the hand plays an important role, with better
SpO2 estimation results achieved in the palm-up case for the lighter-skin group. We have also
shown the explainability of our designed architectures by visualizing the weights for the RGB
channel combinations learned by the neural network, and have confirmed that the choice of the
color band learned by the neural network is consistent with the established optophysiological
methods.
The research in this chapter was submitted for journal publication [99] and was in close
161
collaboration with Mr. Joshua Mathew from North Carolina State University.
162
Chapter 6
Conclusions and Future Perspectives
Periodic blood volume change underneath a person?s skin induces subtle color variations
in the skin area. These subtle changes can be captured by the PPG, a noninvasive and low-cost
optical technique. The methods of PPG measurement have evolved in the past decades from
using contact-based devices (e.g., fingertip PPG from the pulse oximeters) to using contactless
devices (e.g., remote PPG from the RGB cameras). In this dissertation, we studied the modeling
of contact-based and contact-free PPG signals to facilitate its promising applications in cardio-
vascular signal and vital sign sensing and learning for digital smart health.
In the first part of the dissertation (Ch. 2), we explored the potential of user-friendly and
continuous electrocardiogram (ECG) monitoring with the help of contact-based PPG sensors.
ECG is a clinical gold standard for non-invasive cardiac monitoring. Given that continuous ECG
monitoring in consumer products is challenging, PPG provides a low-cost alternative, though
it provides less clinical knowledge compared to ECG. To leverage the advantages of these two
measurement modalities for better and easier healthcare, we first studied the physiological and
signal relationship between PPG and ECG signals, and then inferring the waveform of ECG via
the PPG signals based on their relationship. To address this cardiovascular inverse problem, joint
163
dictionary learning frameworks were proposed to learn the mapping that relates the sparse domain
coefficients of each PPG cycle to those of the corresponding ECG cycle. This line of research
has the potential to fully utilize the easy measurability of PPG and the rich clinical knowledge
of ECG for better preventive healthcare. Future directions may include applying our proposed
technique in a real-time processing platform and extending to contact-free ECG sensing with a
more robust and developed remote PPG technique.
In the second part of the dissertation (Ch. 3), we developed a physiological digital twin for
personalized continuous cardiac monitoring. Digital twins are emerging as a promising frame-
work for realizing precision health for their ability to represent an individual?s health status.
Using our proposed dictionary learning based algorithm in Ch. 2 as the backbone model, this
chapter of the dissertation focused on the problem of inferring ECG signals from PPG signals
for continuous precision cardiac monitoring under realistic conditions in which available ECG
data is scarce. By performing transfer learning, a generic digital twin model learned from a large
portion of paired ECG and PPG data was fine-tuned to precisely infer the ECG from the PPG
of a target participant whose available ECG data are scarce. Experimental results showed that
the proposed transfer learning method yielded the best ECG reconstruction accuracy compared
to other baseline comparison models, which suggested that it could be used as a reliable digital
twin for precision continuous cardiac monitoring. In parallel, convolutional neural network based
and causality-incorporated backbone model designs were also proposed based on the underlying
physiological process of ECG generation for better explainability. In future work, the digital
twin framework can be enriched by incorporating additional vital signs (such as blood pressure
and blood oxygen level) and other data from electronic medical records into the machine learn-
ing formulation in addition to PPG and ECG for a broader umbrella of cardiovascular health
164
monitoring.
In the third part of the dissertation (Ch. 4 and Ch. 5), we presented the noncontact methods
of blood oxygen saturation (SpO2) monitoring from remote PPG signals captured by smartphone
cameras. SpO2 is an important indicator of pulmonary and respiratory functionalities. Recent
works have investigated how ubiquitous smartphone cameras can be used to infer SpO2. Most
of these works are contact-based, requiring users to cover a phone?s camera and its nearby light
source with a finger to capture reemitted light from the illuminated tissue. Contact-based meth-
ods may lead to skin irritation and cross contamination, especially during a pandemic. Thus, we
aimed for contactless methods for SpO2 monitoring using hand videos acquired by regular RGB
cameras of smartphones. Both principled signal processing based method and data-driven neural
network based method were proposed for SpO2 estimation by either explicitly or implicitly ex-
tracting features from multi-channel skin color signals with color channel mixing and temporal
analysis. Experimental results showed that our proposed methods could achieve better accuracy
of blood oxygen estimates compared to traditional methods using only two color channels and
prior arts. Future work may consider verifying our methods with more data collected under dif-
ferent hypoxia protocols and investigate the effectiveness and performance of our methods in
clinical applications while the participants are in motion.
165
Bibliography
[1] U Rajendra Acharya, Hamido Fujita, Shu Lih Oh, Yuki Hagiwara, Jen Hong Tan, and
Muhammad Adam. Application of Deep Convolutional Neural Network for Automated
Detection of Myocardial Infarction Using ECG Signals. Information Sciences, 415:190?
198, 2017.
[2] Michal Aharon, Michael Elad, Alfred Bruckstein, et al. K-SVD: An Algorithm for Design-
ing Overcomplete Dictionaries for Sparse Representation. IEEE Transactions on Signal
Processing, 2006.
[3] Hanad Ahmed and Laurence Devoto. The potential of a digital twin in surgery. Surgical
Innovation, 28(4):509?510, 2021.
[4] Nuzhat Ahmed and Yong Zhu. Early detection of atrial fibrillation based on ECG signals.
Bioengineering, 7(1):16, 2020.
[5] Zia Uddin Ahmed, Mohammad Golam Mortuza, Mohammed Jashim Uddin, Md Humayun
Kabir, Md Mahiuddin, and MD Jiabul Hoque. Internet of Things based patient health
monitoring system using wearable biomedical device. In 2018 international conference
on innovation in engineering and technology (ICIET), pages 1?5. IEEE, 2018.
[6] Mohammed Al-Disi, Hamza Djelouat, Christos Kotroni, Elena Politis, Abbes Amira, Fay-
cal Bensaali, George Dimitrakopoulos, and Guillaume Alinier. ECG Signal Reconstruction
on the IoT-gateway and Efficacy of Compressive Sensing Under Real-Time Constraints.
IEEE Access, 2018.
[7] John Allen. Photoplethysmography and Its Application in Clinical Physiological Measure-
ment. Physiological Measurement, 2007.
[8] Euan A Ashley and Josef Niebauer. Cardiology explained. Remedica, 2004.
[9] Md. Asif-Ur-Rahman, Fariha Afsana, Mufti Mahmud, M. Shamim Kaiser, Muhammad R.
Ahmed, Omprakash Kaiwartya, and Anne James-Taylor. Toward a Heterogeneous Mist,
Fog, and Cloud-Based Framework for the Internet of Healthcare Things. IEEE Internet of
Things J., 2019.
[10] Australian Radiation Protection and Nuclear Safety Agency. Fitzpatrick skin phototype.
[11] Teun Baar, Wiger van Houten, and Zeno Geradts. Camera identification by grouping
images from database, based on shared noise patterns, 2012.
[12] Ufuk Bal. Non-contact Estimation of Heart Rate and Oxygen Saturation Using Ambient
Light. Biomed. Opt. Exp., Jan. 2015.
166
[13] Rohan Banerjee, Aniruddha Sinha, Anirban Dutta Choudhury, and Aishwarya Vis-
vanathan. PhotoECG: Photoplethysmography to Estimate ECG Parameters. In IEEE In-
ternational Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014.
[14] Syed Khairul Bashar, Dong Han, Shirin Hajeb-Mohammadalipour, Eric Ding, Cody Whit-
comb, David D McManus, and Ki H Chon. Atrial Fibrillation Detection from Wrist Pho-
toplethysmography Signals Using Smartwatches. Scientific reports, 9(1):1?10, 2019.
[15] Dwaipayan Biswas, Luke Everson, Muqing Liu, Madhuri Panwar, Bram-Ernst Verhoef,
Shrishail Patki, Chris H Kim, Amit Acharyya, Chris Van Hoof, Mario Konijnenburg, et al.
Cornet: Deep learning framework for ppg-based heart rate estimation and biometric iden-
tification in ambulant environment. IEEE transactions on biomedical circuits and systems,
13(2):282?291, 2019.
[16] Koen Bruynseels, Filippo Santoni de Sio, and Jeroen van den Hoven. Digital Twins in
Health Care: Ethical Implications of An Emerging Engineering Paradigm. Frontiers in
genetics, 9:31, 2018.
[17] Nam Bui, Anh Nguyen, Phuc Nguyen, Hoang Truong, Ashwin Ashok, Thang Dinh, Robin
Deterding, and Tam Vu. Smartphone-Based SpO2 Measurement by Exploiting Wave-
lengths Separation and Chromophore Compensation. ACM Trans. Sens. Netw., Jan. 2020.
[18] Wilhelm Burger and Mark J. Burge. Digital Image Processing - An Algorithmic Introduc-
tion using Java. Springer, 2008.
[19] A John Camm. The role of continuous monitoring in atrial fibrillation management. Ar-
rhythmia & Electrophysiology Review, 3(1):48, 2014.
[20] Cardiovascular diseases (CVDs). https://www.who.int/en/news-room/fact-sheets/detail/
cardiovascular-diseases-(cvds). Accessed: 2022-06-09.
[21] Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad.
Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmis-
sion. In Proceedings of the 21th ACM SIGKDD international conference on knowledge
discovery and data mining, pages 1721?1730, 2015.
[22] Gabriella Casalino, Giovanna Castellano, and Gianluca Zaza. A mHealth solution for
contact-less self-monitoring of blood oxygen saturation. In IEEE Symposium on Comput-
ers and Communications (ISCC), Jul. 2020.
[23] Douglas Chai and King N Ngan. Face segmentation using skin-color map in videophone
applications. IEEE Trans. Circuits and Systems for Video Technology, 9(4):551?564, Jun.
1999.
[24] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A Library for Support Vector Machines.
ACM Trans. Intelligent Systems and Technology, May 2011.
[25] Mingliang Chen. Security Enhancement and Bias Mitigation for Emerging Sensing and
Learning Systems. PhD thesis, University of Maryland, College Park, 2021.
167
[26] Mingliang Chen, Xin Liao, and Min Wu. PulseEdit: editing physiological signals in facial
videos for privacy protection. IEEE Transactions on Information Forensics and Security,
17:457?471, 2022.
[27] Mingliang Chen, Qiang Zhu, Min Wu, and Quanzeng Wang. Modulation model of the
photoplethysmography signal for vital sign extraction. IEEE Journal of Biomedical and
Health Informatics, Aug. 2020.
[28] Mingliang Chen, Qiang Zhu, Harrison Zhang, Min Wu, and Quanzeng Wang. Respira-
tory rate estimation from face videos. In 2019 IEEE EMBS International Conference on
Biomedical & Health Informatics (BHI), pages 1?4. IEEE, 2019.
[29] Weixuan Chen and Daniel McDuff. DeepPhys: Video-based physiological measurement
using convolutional attention networks. In The European Conference on Computer Vision
(ECCV), pages 349?365, 2018.
[30] Yang Chen, Joo Heung Yoon, Michael R Pinsky, Ting Ma, and Gilles Clermont. Devel-
opment of hemorrhage identification model using non-invasive vital signs. Physiological
measurement, 41(5):055010, 2020.
[31] Hong-Yu Chiu, Hong-Han Shuai, and Paul C.-P. Chao. Reconstructing qrs complex from
ppg by transformed attentional neural networks. IEEE Sensors Journal, 20(20):12374?
12383, 2020.
[32] Youngjun Cho, Nadia Bianchi-Berthouze, and Simon J Julier. Deepbreath: Deep learning
of breathing patterns for automatic stress recognition using low-cost thermal imaging in
unconstrained settings. In 2017 seventh international conference on affective computing
and intelligent interaction (acii), pages 456?463. IEEE, 2017.
[33] Eric Chern-Pin Chua, Stephen J Redmond, Gary McDarby, and Conor Heneghan. Towards
Using Photo-plethysmogram Amplitude to Measure Blood Pressure During Sleep. Annals
of Biomedical Engineering, 2010.
[34] Charles J Cote?, E Andrew Goldstein, William H Fuchsman, and David C Hoaglin. The
effect of nail polish on pulse oximetry. Anesthesia and analgesia, Jul. 1988.
[35] Jennifer Couzin-Frankel. The Mystery of The Pandemic?s ?Happy Hypoxia?. Science,
2020.
[36] Darren Craven, Brian McGinley, Liam Kilmartin, Martin Glavin, and Edward Jones.
Adaptive Dictionary Reconstruction for Compressed Sensing of ECG Signals. IEEE Jour-
nal of Biomedical and Health Informatics, 2016.
[37] Gerard De Haan and Vincent Jeanne. Robust pulse rate from chrominance-based rppg.
IEEE Transactions on Biomedical Engineering, Jun. 2013.
[38] Anneke de Torbal, Eric Boersma, Jan A Kors, Gerard van Herpen, Jaap W Deckers,
Deirdre AM van der Kuip, Bruno H Stricker, Albert Hofman, and Jacqueline CM Witte-
man. Incidence of recognized and unrecognized myocardial infarction in men and women
aged 55 and older: the rotterdam study. European heart journal, Mar. 2006.
168
[39] Diagnose your irregular heart rhythm faster and more reliably with Zio. https://www.
irhythmtech.com/patients/how-it-works. Accessed: 2022-06-17.
[40] Xinyi Ding, Damoun Nassehi, and Eric C Larson. Measuring Oxygen Saturation With
Smartphone Cameras Using Convolutional Neural Networks. IEEE Journal of Biomed.
Health Informat., Dec. 2018.
[41] Carl Doersch. Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908,
2016.
[42] ECG changes due to electrolyte imbalance (disorder). https://ecgwaves.com/topic/ecg-
electrolyte-imbalance-electrolyte-disorder-calcium-potassium-magnesium/. Accessed:
2022-07-06.
[43] Empetica care: Unlock better health for thousands. https://www.empatica.com/care/. Ac-
cessed: 2022-07-14.
[44] Kjersti Engan, Sven Ole Aase, and J Hakon Husoy. Method of Optimal Directions for
Frame Design. In IEEE International Conference on Acoustics, Speech, and Signal Pro-
cessing. Proceedings (ICASSP), 1999.
[45] Andre Esteva, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M Swetter, Helen M Blau,
and Sebastian Thrun. Dermatologist-level classification of skin cancer with deep neural
networks. nature, 542(7639):115?118, 2017.
[46] Martin Faulhaber, Hannes Gatterer, Thomas Haider, Tobias Linser, Nikolaus Netzer, and
Martin Burtscher. Heart rate and blood pressure responses during hypoxic cycles of a
3-week intermittent hypoxia breathing program in patients at risk for or with mild copd.
International Journal of Chronic Obstructive Pulmonary Disease, 2015.
[47] Riccardo Favilla, Veronica Chiara Zuccala, and Giuseppe Coppini. Heart rate and heart
rate variability from single-channel video and ica integration of multiple signals. IEEE
journal of biomedical and health informatics, Nov. 2018.
[48] Aidan Fuller, Zhong Fan, Charles Day, and Chris Barlow. Digital twin: Enabling tech-
nologies, challenges and open research. IEEE access, 8:108952?108971, 2020.
[49] W Bruce Fye. A history of the origin, evolution, and impact of electrocardiography. The
American journal of cardiology, 73(13):937?949, 1994.
[50] Eduardo Gil, Michele Orini, Raquel Bailon, Jose? Mar??a Vergara, Luca Mainardi, and Pablo
Laguna. Photoplethysmography Pulse Rate Variability as A Surrogate Measurement of
Heart Rate Variability During Non-stationary Conditions. Physiological measurement,
2010.
[51] Edward Glaessgen and David Stargel. The digital twin paradigm for future NASA and US
Air Force vehicles. In 53rd AIAA/ASME/ASCE/AHS/ASC structures, structural dynam-
ics and materials conference 20th AIAA/ASME/AHS adaptive structures conference 14th
AIAA, page 1818, 2012.
169
[52] Ary L Goldberger, Luis AN Amaral, Leon Glass, Jeffrey M Hausdorff, Plamen Ch Ivanov,
Roger G Mark, Joseph E Mietus, George B Moody, Chung-Kang Peng, and H Eugene
Stanley. PhysioBank, PhysioToolkit, and PhysioNet: Components of A New Research
Resource for Complex Physiologic Signals. Circulation, 2000.
[53] B.W. Green, P. J.; Silverman. Nonparametric Regression and Generalized Linear Models.
Chapman and Hall, 1990.
[54] Michael Grieves. Digital twin: manufacturing excellence through virtual factory replica-
tion. White paper, 1:1?7, 2014.
[55] Michael Grieves and John Vickers. Digital twin: Mitigating unpredictable, undesirable
emergent behavior in complex systems, pages 85?113. Springer, 2017.
[56] Albinas Grunovas, Eugenijus Trinkunas, Alfonsas Buliuolis, Eurelija Venskaityte, and
Jonas Poderys. Cardiovascular response to breath-holding explained by changes of the
indices and their dynamic interactions. Biological Systems: Open Access, 2016.
[57] Alessandro R Guazzi, Mauricio Villarroel, Joao Jorge, Jonathan Daly, Matthew C Frise,
Peter A Robbins, and Lionel Tarassenko. Non-contact Measurement of Oxygen Saturation
with An RGB Camera. Biomed. Opt. Express, Sep. 2015.
[58] Hadi Habibzadeh, Karthik Dinesh, Omid Rajabi Shishvan, Andrew Boggio-Dandry, Gau-
rav Sharma, and Tolga Soyata. A Survey of Healthcare Internet of Things (HIoT): A
Clinical Perspective. IEEE Internet of Things J., 2020.
[59] Adi Hajj-Ahmad, Ravi Garg, and Min Wu. Instantaneous frequency estimation and lo-
calization for enf signals. In Proc. 4th Annu. Summit and Conf. (APSIPA). IEEE, Dec.
2012.
[60] John Hampton and Joanna Hampton. The ECG Made Easy E-Book. Elsevier Health
Sciences, 2019.
[61] Awni Y Hannun, Pranav Rajpurkar, Masoumeh Haghpanahi, Geoffrey H Tison, Codie
Bourn, Mintu P Turakhia, and Andrew Y Ng. Cardiologist-level Arrhythmia Detection and
Classification in Ambulatory Electrocardiograms Using a Deep Neural Network. Nature
medicine, 25(1):65?69, 2019.
[62] Awni Y Hannun, Pranav Rajpurkar, Masoumeh Haghpanahi, Geoffrey H Tison, Codie
Bourn, Mintu P Turakhia, and Andrew Y Ng. Cardiologist-level arrhythmia detection
and classification in ambulatory electrocardiograms using a deep neural network. Nature
medicine, 25(1):65?69, 2019.
[63] Simon S Haykin. Adaptive filter theory. Pearson Education India, 2008.
[64] Lara J. Herbert and Iain H. Wilson. Pulse oximetry in low-resource settings. Breathe,
9(2):90?98, 2012.
170
[65] Jason S Hoffman, Varun Viswanath, Xinyi Ding, Matthew J Thompson, Eric C Larson,
Shwetak N Patel, and Edward Wang. Smartphone camera oximetry in an induced hypox-
emia study. arXiv preprint arXiv:2104.00038, 2021.
[66] Holter monitor. https://www.hopkinsmedicine.org/health/treatment-tests-and-therapies/
holter-monitor. Accessed: 2022-07-14.
[67] How to use the Blood Oxygen app on Apple Watch Series 6. https://support.apple.com/en-
us/HT211027. Accessed: 2021-05-17.
[68] International Organization for Standardization. Particular requirements for basic safety
and essential performance of pulse oximeter equipment , 2011.
[69] Luca Iozzia, Luca Cerina, and Luca Mainardi. Relationships between heart-rate variability
and pulse-rate variability obtained from video-ppg signal using zca. Physiological mea-
surement, Sep. 2016.
[70] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An Introduction to
Statistical Learning. Springer, 2013.
[71] In Cheol Jeong and Joseph Finkelstein. Introducing contactless blood pressure assessment
using a high speed video camera. Journal of medical systems, Apr. 2016.
[72] Zhuolin Jiang, Zhe Lin, and Larry S Davis. Label Consistent K-SVD: Learning a Discrim-
inative Dictionary for Recognition. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 2013.
[73] Anders Johansson. Neural Network for Photoplethysmographic Respiratory Rate Moni-
toring. Medical and Biological Engineering and Computing, 2003.
[74] Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-wei, Mengling Feng, Mo-
hammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G
Mark. MIMIC-III, A Freely Accessible Critical Care Database. Scientific Data, 2016.
[75] Anand Kumar Joshi, Arun Tomar, and Mangesh Tomar. A Review Paper on Analysis
of Electrocardiograph (ECG) Signal for the Detection of Arrhythmia Abnormalities. In-
ternational Journal of Advanced Research in Electrical, Electronics and Instrumentation
Engineering, 2014.
[76] KardiaMobile: Check in on your heart from home. https://store.kardia.com/products/
kardiamobile. Accessed: 2022-1-25.
[77] Walter Karlen, Srinivas Raman, J Mark Ansermino, and Guy A Dumont. Multiparameter
Respiratory Rate Estimation from the Photoplethysmogram. IEEE Trans. Biomed. Eng.,
60(7):1946?1953, 2013.
[78] Emroz Khan, Forsad Al Hossain, Shiekh Zia Uddin, S Kaisar Alam, and Md Kamrul
Hasan. A Robust Heart Rate Monitoring Scheme Using Photoplethysmographic Signals
Corrupted by Intense Motion Artifacts. IEEE Transactions on Biomedical Engineering,
63(3):550?562, 2016.
171
[79] Paul Kligfield, Leonard S Gettes, James J Bailey, Rory Childers, Barbara J Deal, E William
Hancock, Gerard Van Herpen, Jan A Kors, Peter Macfarlane, David M Mirvis, et al. Rec-
ommendations for the standardization and interpretation of the electrocardiogram: part i:
the electrocardiogram and its technology a scientific statement from the american heart as-
sociation electrocardiography and arrhythmias committee, council on clinical cardiology;
the american college of cardiology foundation; and the heart rhythm society endorsed by
the international society for computerized electrocardiology. Journal of the American Col-
lege of Cardiology, 49(10):1109?1127, 2007.
[80] Lingqin Kong, Yuejin Zhao, Liquan Dong, Yiyun Jian, Xiaoli Jin, Bing Li, Yun Feng,
Ming Liu, Xiaohua Liu, and Hong Wu. Non-contact detection of oxygen saturation based
on visible light imaging device using ambient light. Opt. Exp., Jul. 2013.
[81] John K Kruschke. Bayesian estimation supersedes the t test. Journal of Experimental
Psychology: General, 142(2):573, 2013.
[82] John K Kruschke. Rejecting or accepting parameter values in bayesian estimation. Ad-
vances in Methods and Practices in Psychological Science, 2018.
[83] John K Kruschke. Bayesian analysis reporting guidelines. Nature Human Behaviour,
pages 1?10, 2021.
[84] John K Kruschke and Torrin M Liddell. Bayesian data analysis for newcomers. Psycho-
nomic bulletin & review, 25(1):155?177, 2018.
[85] John K Kruschke and Torrin M Liddell. The bayesian new statistics: Hypothesis testing,
estimation, meta-analysis, and power analysis from a bayesian perspective. Psychonomic
Bulletin & Review, 25(1):178?206, 2018.
[86] Aparna Kumari, Sudeep Tanwar, Sudhanshu Tyagi, and Neeraj Kumar. Fog computing
for Healthcare 4.0 environment: Opportunities and challenges. Computers and Electrical
Engineering, 72:1?13, 2018.
[87] Zachary Mcbride Lazri, Qiang Zhu, Mingliang Chen, Min Wu, and Quanzeng Wang. De-
tecting essential landmarks directly in thermal images for remote body temperature and
respiratory rate measurement with a two-phase system. IEEE Access, 10:39080?39094,
2022.
[88] Jingshan Li and Pascale Carayon. Health Care 4.0: A vision for smart and connected
health care. IISE transactions on healthcare systems engineering, 11(3):171?180, 2021.
[89] Kai Li, Zhengming Ding, Sheng Li, and Yun Fu. Discriminative Semi-coupled Projective
Dictionary Learning for Low-resolution Person Re-identification. In Thirty-Second AAAI
Conference on Artificial Intelligence, 2018.
[90] Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar.
Hyperband: A novel bandit-based approach to hyperparameter optimization. The Journal
of Machine Learning Research, Apr. 2018.
172
[91] Xiaobai Li, Jie Chen, Guoying Zhao, and Matti Pietikainen. Remote heart rate measure-
ment from face videos under realistic situations. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 4264?4271, 2014.
[92] Yuenan Li, Xin Tian, Qiang Zhu, and Min Wu. A Lightweight Neural Network for
Inferring ECG and Diagnosing Cardiovascular Diseases from PPG. arXiv preprint
arXiv:2012.04949, 2020. Under preparation for journal submission.
[93] Zhicheng Li, Hong Huang, and Satyajayant Misra. Compressed Sensing via Dictionary
Learning and Approximate Message Passing for Multimedia Internet of Things. IEEE
Internet of Things J., 2017.
[94] Tong Liu, Yujuan Si, Dunwei Wen, Mujun Zang, and Liuqi Lang. Dictionary Learning
for VQ Feature Extraction in ECG Beats Classification. Expert Systems with Applications,
2016.
[95] Ying Liu, Lin Zhang, Yuan Yang, Longfei Zhou, Lei Ren, Fei Wang, Rong Liu, Zhibo
Pang, and M Jamal Deen. A novel cloud-based framework for the elderly healthcare
services using digital twin. IEEE Access, 7:49088?49101, 2019.
[96] Zhiyuan Lu, Xiang Chen, Zhongfei Dong, Zhangyan Zhao, and Xu Zhang. A Prototype
of Reflection Pulse Oximeter Designed for Mobile Healthcare. IEEE Journal of Biomed.
Health Informat., Aug. 2015.
[97] Julien Mairal, Jean Ponce, Guillermo Sapiro, Andrew Zisserman, and Francis R Bach.
Supervised Dictionary Learning. In Advances in Neural Information Processing Systems,
2009.
[98] Angshul Majumdar and Rabab Ward. Robust Greedy Deep Dictionary Learning for ECG
Arrhythmia Classification. In IEEE International Joint Conference on Neural Networks
(IJCNN), 2017.
[99] Joshua Mathew*, Xin Tian*, Chau-Wai Wong, Simon Ho, Donald Milton, and Min Wu.
Remote Blood Oxygen Estimation From Videos Using Neural Networks. arXiv preprint
arXiv:2107.05087, 2021. Submitted for journal submission (* for equal contribution).
[100] Daniel McDuff, Sarah Gontarek, and Rosalind Picard. Remote measurement of cognitive
stress via heart rate variability. In 2014 36th Annual International Conference of the IEEE
Engineering in Medicine and Biology Society, pages 2957?2960. IEEE, 2014.
[101] Monitor your heart rate with Apple Watch. https://support.apple.com/en-us/HT204666.
Accessed: 2022-07-29.
[102] Yunyoung Nam, Bersain A Reyes, and Ki H Chon. Estimation of respiratory rates using
the built-in microphone of a smartphone or headset. IEEE Journal of Biomedical and
Health Informatics, Sep. 2015.
173
[103] Angela Navarrete-Opazo and Gordon S Mitchell. Therapeutic potential of intermittent
hypoxia: a matter of dose. American Journal of Physiology-Regulatory, Integrative and
Comparative Physiology, 307(10):R1181?R1197, 2014.
[104] Andrea Nemcova, Ivana Jordanova, Martin Varecka, Radovan Smiseka, Lucie Marsanova,
Lukas Smital, and Martin Vitek. Monitoring of heart rate, blood oxygen saturation, and
blood pressureusing a smartphone. Biomedical Signal Processing and Control, May 2020.
[105] Masataka Nishiga, Dao Wen Wang, Yaling Han, David B Lewis, and Joseph C Wu.
COVID-19 and cardiovascular disease: from basic mechanisms to clinical perspectives.
Nature Reviews Cardiology, 17(9):543?558, 2020.
[106] Meir Nitzan, Ayal Romem, and Robert Koppel. Pulse oximetry: Fundamentals and tech-
nology update. Medical Devices (Auckland, NZ), 7:231, 2014.
[107] Xuesong Niu, Shiguang Shan, Hu Han, and Xilin Chen. RhythmNet: End-to-end heart
rate estimation from face via spatial-temporal representation. IEEE Trans. on Image Pro-
cessing, Oct. 2019.
[108] Optical Absorption of Hemoglobin. https://omlc.org/spectra/hemoglobin/. Accessed:
2021-03-09.
[109] Nobuyuki Otsu. A threshold Selection Method from Gray-level Histograms. IEEE Trans.
Syst., Man, and Cybernet., Jan. 1979.
[110] Jiapu Pan and Willis J. Tompkins. A Real-time QRS Detection Algorithm. IEEE Trans-
actions on Biomedical Engineering, 1985.
[111] Neeraj Paradkar and Shubhajit Roy Chowdhury. Cardiac Arrhythmia Detection Using
Photoplethysmography. In 2017 39th Annual International Conference of the IEEE Engi-
neering in Medicine and Biology Society (EMBC), pages 113?116. IEEE, 2017.
[112] Judea Pearl. Causality. Cambridge university press, 2009.
[113] Marco A. F. Pimentel, Alistair E. W. Johnson, Peter H. Charlton, Drew Birrenkott, Pe-
ter J. Watkinson, Lionel Tarassenko, and David A. Clifton. Toward a robust estimation
of respiratory rate from pulse oximeters. IEEE Transactions on Biomedical Engineering,
64(8):1914?1923, 2017.
[114] Annette Plu?ddemann, Matthew Thompson, Carl Heneghan, and Christopher Price. Pulse
Oximetry in Primary Care: Primary Care Diagnostic Technology Update. British Journal
of General Practice, May 2011.
[115] Ming-Zher Poh, Daniel J McDuff, and Rosalind W Picard. Advancements in noncon-
tact, multiparameter physiological measurements using a webcam. IEEE transactions on
biomedical engineering, Oct. 2010.
[116] R Package for BEST: Bayesian Estimation Supersedes the t-Test. https://CRAN.R-project.
org/package=BEST. Accessed: 2021-09-30.
174
[117] Natasa Reljin, Gary Zimmer, Yelena Malyuta, Yitzhak Mendelson, Chad E Darling, and
Ki H Chon. Detection of blood loss in trauma patients using time-frequency analysis of
photoplethysmographic signal. In 2016 IEEE-EMBS International Conference on Biomed-
ical and Health Informatics (BHI), pages 118?121. IEEE, 2016.
[118] Natasa Reljin, Gary Zimmer, Yelena Malyuta, Kirk Shelley, Yitzhak Mendelson, David J
Blehar, Chad E Darling, and Ki H Chon. Using support vector machines on photo-
plethysmographic signals to discriminate between hypovolemia and euvolemia. PLoS One,
13(3):e0195087, 2018.
[119] Alessandra Rosa and Roberto Cesar Betini. Noncontact SpO2 Measurement Using Eule-
rian Video Magnification. IEEE Trans. Instrum. Meas., May 2019.
[120] Anna Rosiek and Krzysztof Leksowski. The Risk Factors and Prevention of Cardiovas-
cular Disease: The Importance of Electrocardiogram in the Diagnosis and Treatment of
Acute Coronary Syndrome. Therapeutics and Clinical Risk Management, 2016.
[121] Gregory A Roth, George A Mensah, Catherine O Johnson, Giovanni Addolorato, Enrico
Ammirati, Larry M Baddour, Noe?l C Barengo, Andrea Z Beaton, Emelia J Benjamin,
and Catherine P Benziger. Global burden of cardiovascular diseases and risk factors,
1990?2019: update from the GBD 2019 study. Journal of the American College of Cardi-
ology, 76(25):2982?3021, 2020.
[122] Bernhard Scho?lkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalch-
brenner, Anirudh Goyal, and Yoshua Bengio. Toward causal representation learning. Pro-
ceedings of the IEEE, 109(5):612?634, 2021.
[123] Christopher G Scully, Jinseok Lee, Joseph Meyer, Alexander M Gorbach, Domhnull
Granquist-Fraser, Yitzhak Mendelson, and Ki H Chon. Physiological Parameter Moni-
toring from Optical Recordings with A Mobile Phone. IEEE Trans. Biomed. Eng., Jul.
2011.
[124] Hooman Sedghamiz. BioSigKit: A Matlab Toolbox and Interface for Analysis of BioSig-
nals. Journal of Open Source Software, 2018.
[125] Hooman Sedghamiz and Daniele Santonocito. Unsupervised Detection and Classification
of Motor Unit Action Potentials in Intramuscular Electromyography Signals. In IEEE
E-health and Bioengineering Conference (EHB), 2015.
[126] Servier Medical Art. https://smart.servier.com/?s=heart. Accessed: 2022-07-14.
[127] John W Severinghaus. Takuo Aoyagi: Discovery of pulse oximetry. Anesthesia & Anal-
gesia, Dec. 2007.
[128] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From The-
ory to Algorithms. Cambridge University Press, 2014.
175
[129] Dangdang Shao, Chenbin Liu, Francis Tsow, Yuting Yang, Zijian Du, Rafael Iriya, Hui Yu,
and Nongjian Tao. Noncontact Monitoring of Blood Oxygen Saturation Using Camera and
Dual-wavelength Imaging System. IEEE Trans. Biomed. Eng, Sep. 2015.
[130] Niraj Shenoy, Rebecca Luchtel, and Perminder Gulani. Considerations for Target Oxygen
Saturation in COVID-19 Patients: Are We Under-shooting? BMC Medicine, Dec. 2020.
[131] M. Celeste Simon and Brian Keith. The Role of Oxygen Availability in Embryonic Devel-
opment and Stem Cell Function. Nature Reviews Molecular Cell Biology, Apr. 2008.
[132] Kwanghyun Sohn, Faisal M Merchant, Omid Sayadi, Dheeraj Puppala, Rajiv Doddamani,
Ashish Sahani, Jagmeet P Singh, E Kevin Heist, Eric M Isselbacher, and Antonis A Ar-
moundas. A novel point-of-care smartphone based system for monitoring the cardiac and
respiratory systems. Scientific Reports, Mar. 2017.
[133] Radim S?petl??k, Vojtech Franc, and Jir?? Matas. Visual heart rate estimation with convolu-
tional neural network. In British Machine Vision Conf., Newcastle, UK, Sep. 2018.
[134] Steven R Steinhubl, Jill Waalen, Alison M Edwards, Lauren M Ariniello, Rajesh R Mehta,
Gail S Ebner, Chureen Carter, Katie Baca-Motes, Elise Felicione, Troy Sarich, et al. Effect
of a home-based wearable continuous ecg monitoring patch on detection of undiagnosed
atrial fibrillation: the mstops randomized clinical trial. Jama, 320(2):146?155, 2018.
[135] Yu Sun and Nitish Thakor. Photoplethysmography revisited: from contact to noncontact,
from point to imaging. IEEE transactions on biomedical engineering, Sep. 2015.
[136] Zhiyuan Sun, Qinghua He, Yuandong Li, Wendy Wang, and Ruikang K Wang. Robust non-
contact peripheral oxygenation saturation measurement using smartphone-enabled imag-
ing photoplethysmography. Biomed. Opt. Exp., 12(3):1746?1760, Mar. 2021.
[137] M Suresh and Urmila Natarajan. Healthcare 4.0: Recent advances and futuristic research
avenues. Materials Today: Proceedings, 2021.
[138] Take an ECG with the ECG app on Apple Watch. https://support.apple.com/en-us/
HT208955. Accessed: 2022-01-25.
[139] Lionel Tarassenko, Mauricio Villarroel, Alessandro Guazzi, Joa?o Jorge, DA Clifton, and
Chris Pugh. Non-contact Video-based Vital Sign Monitoring Using Ambient Light and
Auto-regressive Models. Physiol. Meas, Mar. 2014.
[140] I?smail Tayfur and Mustafa Ahmet Afacan. Reliability of smartphone measurements of vital
parameters: A prospective study using a reference method. The American J. Emergency
Medicine, 37(8):1527?1530, Aug. 2019.
[141] Jason Teo. Early Detection of Silent Hypoxia in COVID-19 Pneumonia Using Smartphone
Pulse Oximetry. Journal of Medical Systems, Aug. 2020.
[142] Xin Tian, Chau-Wai Wong, Sushant M Ranadive, and Min Wu. A Multi-Channel Ratio-
of-Ratios Method for Noncontact Hand Video Based SpO2 Monitoring Using Smartphone
Cameras. IEEE Journal of Selected Topics in Signal Processing, 16(2):197?207, 2022.
176
[143] Xin Tian, Qiang Zhu, Yuenan Li, and Min Wu. Cross-Domain Joint Dictionary Learn-
ing for ECG Reconstruction from PPG. In IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages 936?940, 2020.
[144] Xin Tian, Qiang Zhu, Yuenan Li, and Min Wu. Cross-domain Joint Dictionary Learning
for ECG Inference from PPG. arXiv preprint arXiv:2101.02362, 2021. Submitted for
journal publication.
[145] Martin J Tobin, Franco Laghi, and Amal Jubran. Why COVID-19 Silent Hypoxemia is
Baffling to Physicians. American Journal of Respiratory and Critical Care Medicine,
Aug. 2020.
[146] Joel A Tropp and Anna C Gilbert. Signal Recovery from Random Measurements via
Orthogonal Matching Pursuit. IEEE Transactions on Information Theory, 2007.
[147] Hsin-Yi Tsai, Kuo-Cheng Huang, and J Andrew Yeh. No-contact oxygen saturation mea-
suring technology for skin tissue and its application. IEEE Instrum. Meas. Magazine, Sep.
2016.
[148] Sergey Tulyakov, Xavier Alameda-Pineda, Elisa Ricci, Lijun Yin, Jeffrey F Cohn, and
Nicu Sebe. Self-adaptive matrix completion for heart rate estimation from face videos
under realistic conditions. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 2396?2404, 2016.
[149] Mark Van Gastel, Sander Stuijk, and Gerard De Haan. New Principle for Measuring Arte-
rial Blood Oxygenation, Enabling Motion-Robust Remote Monitoring. Scientific Reports,
Dec. 2016.
[150] Mark van Gastel, Wim Verkruysse, and Gerard de Haan. Data-driven Calibration Estima-
tion for Robust Remote Pulse-oximetry. Applied Sciences, Jan. 2019.
[151] JP Varshney. Electrocardiography in Veterinary Medicine. Springer, 2020.
[152] Wim Verkruysse, Lars O Svaasand, and J Stuart Nelson. Remote plethysmographic imag-
ing using ambient light. Opt. Exp., Dec. 2008.
[153] Adriana N Vest, Giulia Da Poian, Qiao Li, Chengyu Liu, Shamim Nemati, Amit J Shah,
and Gari D Clifford. An Open Source Benchmarked Toolbox for Cardiovascular Waveform
and Interval Analysis. Physiological Measurement, 2018.
[154] Khuong Vo, Emad Kasaeyan Naeini, Amir Naderi, Daniel Jilani, Amir M Rahmani, Nikil
Dutt, and Hung Cao. P2E-WGAN: ECG waveform synthesis from PPG with conditional
wasserstein generative adversarial networks. In Proceedings of the 36th Annual ACM
Symposium on Applied Computing, pages 1030?1036, 2021.
[155] Shenlong Wang, Lei Zhang, Yan Liang, and Quan Pan. Semi-coupled Dictionary Learn-
ing with Applications to Image Super-resolution and Photo-sketch Synthesis. In IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
177
[156] Wenjin Wang, Albertus C den Brinker, Sander Stuijk, and Gerard De Haan. Algorithmic
principles of remote PPG. IEEE Trans. on Biomedical Eng., Sep. 2016.
[157] Wenjin Wang, Sander Stuijk, and Gerard De Haan. A novel algorithm for remote photo-
plethysmography: Spatial subspace rotation. IEEE transactions on biomedical engineer-
ing, Dec. 2015.
[158] Larry Wasserman. All of statistics: a concise course in statistical inference, volume 26.
Springer, 2004.
[159] John G Webster. Design of Pulse Oximeters. CRC Press, Oct. 1997.
[160] Taiyang Wu, Fan Wu, Chunkai Qiu, Jean-Michel Redoute?, and Mehmet Rasit Yuce. A
Rigid-Flex Wearable Health Monitoring Sensor Patch for IoT-Connected Healthcare Ap-
plications. IEEE Internet of Things Journal, 2020.
[161] Jian Xu, Chun Qi, and Zhiguo Chang. Coupled K-SVD Dictionary Training for Super-
resolution. In IEEE International Conference on Image Processing (ICIP), 2014.
[162] Jianchao Yang, Zhaowen Wang, Zhe Lin, Scott Cohen, and Thomas Huang. Coupled
Dictionary Training for Image Super-resolution. IEEE Transactions on Image Processing,
2012.
[163] Jianchao Yang, John Wright, Thomas S Huang, and Yi Ma. Image Super-resolution via
Sparse Representation. IEEE Transactions on Image Processing, 2010.
[164] Mengyue Yang, Furui Liu, Zhitang Chen, Xinwei Shen, Jianye Hao, and Jun Wang.
Causalvae: Disentangled representation learning via neural structural causal models. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pages 9593?9602, 2021.
[165] Gu?lendam Hakverdiog?lu Yo?nt, Esra Akin Korhan, and Berna Dizer. The effect of nail
polish on pulse oximetry readings. Intensive and Critical Care Nursing, Apr. 2014.
[166] Yue Yu, Jie Chen, Tian Gao, and Mo Yu. DAG-GNN: DAG structure learning with graph
neural networks. In International Conference on Machine Learning, pages 7154?7163.
PMLR, 2019.
[167] Gaobo Zhang, Zhen Mei, Yuan Zhang, Xuesheng Ma, Benny Lo, Dongyi Chen, and Yuant-
ing Zhang. A Noninvasive Blood Glucose Monitoring System Based on Smartphone PPG
Signal Processing and Machine Learning. IEEE Transactions on Industrial Informatics,
16(11):7209?7218, 2020.
[168] Zheng Zhang, Yong Xu, Jian Yang, Xuelong Li, and David Zhang. A Survey of Sparse
Representation: Algorithms and Applications. IEEE Access, 2015.
[169] Zhilin Zhang, Zhouyue Pi, and Benyuan Liu. TROIKA: A General Framework for Heart
Rate Monitoring Using Wrist-type Photoplethysmographic Signals During Intensive Phys-
ical Exercise. IEEE Trans. Biomed. Eng., 2014.
178
[170] Yufeng Zheng, Hongyu Wang, and Yingguang Hao. Mobile application for monitor-
ing body temperature from facial images using convolutional neural network and support
vector machine. Mobile Multimedia/Image Processing, Security, and Applications, April
2020.
[171] Qiang Zhu. Robust and Analytical Cardiovascular Sensing. PhD thesis, University of
Maryland, College Park, 2020.
[172] Qiang Zhu, Mingliang Chen, Chau-Wai Wong, and Min Wu. Adaptive multi-trace carving
based on dynamic programming. In 2018 52nd Asilomar Conference on Signals, Systems,
and Computers, pages 1716?1720. IEEE, 2018.
[173] Qiang Zhu, Mingliang Chen, Chau-Wai Wong, and Min Wu. Adaptive Multi-Trace Carv-
ing for Robust Frequency Tracking in Forensic Applications. IEEE Trans. Inf. Forensics
Security, May 2020.
[174] Qiang Zhu, Xin Tian, Chau-Wai Wong, and Min Wu. ECG Reconstruction via PPG: A
Pilot Study. In IEEE EMBS International Conference on Biomedical & Health Informatics
(BHI), Chicago, IL, May 2019.
[175] Qiang Zhu, Xin Tian, Chau-Wai Wong, and Min Wu. Learning Your Heart Actions From
Pulse: ECG Waveform Reconstruction From PPG. IEEE Internet of Things Journal,
8(23):16734?16748, 2021.
[176] Qiang Zhu, Chau-Wai Wong, Chang-Hong Fu, and Min Wu. Fitness heart rate measure-
ment using face videos. In IEEE International Conference on Image Processing (ICIP),
Sep. 2017.
179