|Alternative Title||Depression Recognition with Audios Collected under Natural Enviroment|
|Place of Conferral||中国科学院心理研究所|
|Keyword||语音识别 抑郁症 自然情境 年龄 地域 语音时长|
The incidence of depression in modern life showed an increasing trend year by year, while the traditional methods of diagnosis susceptible to subjective factors, and patients' cooperation is necessary as well. One automated model of depression recognition could be set up by employing the patients’ speech acoustics features on machine learning algorithms. The sample speech is not restricted with text content and such method could be deployed in primary healthy institutions quickly or executed by the patients themselves with certain self-service device. It could help a lot in the early diagnosis and intervention of depression. Previous studies are based on audios collected under laboratory environment, some of them even designed deliberate interviews to stimulate certain emotions of the subjects. All such design add the complexity of experiment and the cost of depression recognition. On the other hand, due to the difficulty of collecting depressive data, all previous studies were based on smaller sample size which restrict the researchers to analyze more on those covariant factors like age, area, duration, also the adaptability of those researches. Domestic studies on depression research with audios are still on initial phase, and to be improved.
Considering the drawbacks: small sample size, unreal environment and less of covariant analysis, we proceed the study surrounding with 2 main topics:
I. We use the Chinese phonetic material collected under the natural environment to study the speech recognition between the depressive and the health.
When comparing the audio features between the depressive and the healthy, we found some evident which ar a consistent with previous studies like the difference of F0, loudness, and MFCC.Then in the binary classification tests, more than 600% accuracy on all 7 demographic questions, 70% on some questions were achieved. That confirms it's feasible to recognize the depressive with audios collected under natural environment.
II. The audios come from more than 1600 subjects which makes possible to analyses covariant.
Firstly the correlation between age and area of subjects and their audio features were prowled with 2 experiments. Then we reorganized the samples by splitting them with age, area or both, in order to get serval sub-datasets with higher homogeneity. The classification was done in each split. It's also founded that the classification result is better in those subjects from the south of China than those from the north. The younger (30-44) subjects have better classification results than the elder (45-60).The 2 founding may be caused by the phonetic difference of south and north dialect, or the physical difference of voice duality among different groups. The best accuracy is 74.83% on samples of Jiangsu province.
A series of experiments were also carried on to analyse the affect of sample size and duration. The sample size is correlated with classification results in some cases. However, longer audios with more features could also improve the classification.
All classifications in our study are based decision-fused machine learning model, which run AVM algorithm for classification. One naive Bayer classifier fuses the classification results of 12 classifiers, then output the final predict result. The 12 classifiers run with different feature reduction algorithms which act differently on feature selection. While the decision fusion could balance them and make sure to output one more stable result on top-middle level.
Unlike the full audio in past studies, the audio clips of our study were extracted from interview recordings, and it breaks the audio integrity to certain degree. Consequently, the features of speech speed and pause duration were not used in our experiments. The prediction result didn’t improve after one try of denoising. In further studies, besides audio integrity and denoising, it also needs to be noticed that the sample size on different age, area, gender should be balanced, so that the effect of covariant is minimal. The duration of audio samples shouldn't be too short. It would be better to be 5 seconds at least .Also it's promising to improve recognition accuracy by fusing audio features with text，body gestures, and face features together.
|隋小芸. 自然情境语音识别抑郁症的研究[D]. 中国科学院心理研究所. 中国科学院大学,2017.|
|Files in This Item:|
|隋小芸-硕士学位论文.pdf（9725KB）||学位论文||限制开放||CC BY-NC-SA||Application Full Text|
|Recommend this item|
|Export to Endnote|
|Similar articles in Google Scholar|
|Similar articles in Baidu academic|
|Similar articles in Bing Scholar|
Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.