TY - JOUR
T1 - A comparison study on infant-parent voice diarization
AU - Zhu, Junzhe
AU - Hasegawa-Johnson, Mark
AU - McElwain, Nancy L.
N1 - Funding Information:
Thanks to Jiahao Xu from University of Sydney for help with server. This work was supported by funding from the National Institute on Drug Abuse (R34DA050256-01), the National Institute of Mental Health (R21MH112578-01) and the National Institute of Food and Agriculture, U.S. Department of Agriculture (ILLU-793-339).
Publisher Copyright:
© 2021 IEEE
PY - 2021
Y1 - 2021
N2 - We design a framework for studying prelinguistic child voice from 3 to 24 months based on state-of-the-art algorithms in diarization. Our system consists of a time-invariant feature extractor, a context-dependent embedding generator, and a classifier. We study the effect of swapping out different components of the system, as well as changing loss function, to find the best performance. We also present a multiple-instance learning technique that allows us to pre-train our parameters on larger datasets with coarser segment boundary labels. We found that our best system achieved 43.8% DER on test dataset, compared to 55.4% DER achieved by LENA software. We also found that using convolutional feature extractor instead of logmel features significantly increases the performance of neural diarization.
AB - We design a framework for studying prelinguistic child voice from 3 to 24 months based on state-of-the-art algorithms in diarization. Our system consists of a time-invariant feature extractor, a context-dependent embedding generator, and a classifier. We study the effect of swapping out different components of the system, as well as changing loss function, to find the best performance. We also present a multiple-instance learning technique that allows us to pre-train our parameters on larger datasets with coarser segment boundary labels. We found that our best system achieved 43.8% DER on test dataset, compared to 55.4% DER achieved by LENA software. We also found that using convolutional feature extractor instead of logmel features significantly increases the performance of neural diarization.
KW - Child Speech
KW - Language Development
KW - Multiple Instance Learning
KW - Speaker Diarization
KW - Transfer Learning
KW - Voice Activity Detection
UR - http://www.scopus.com/inward/record.url?scp=85115056870&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85115056870&partnerID=8YFLogxK
U2 - 10.1109/ICASSP39728.2021.9413538
DO - 10.1109/ICASSP39728.2021.9413538
M3 - Conference article
AN - SCOPUS:85115056870
SN - 1520-6149
VL - 2021-June
SP - 7178
EP - 7182
JO - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
JF - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
T2 - 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2021
Y2 - 6 June 2021 through 11 June 2021
ER -