Subtopic Deep Dive
Speech Emotion Recognition
Research Guide
What is Speech Emotion Recognition?
Speech Emotion Recognition (SER) extracts acoustic features from speech signals like prosody and spectral patterns to classify emotions such as anger, happiness, or sadness.
SER uses datasets like IEMOCAP and EmoDB for training models. Deep learning approaches, including CNN-LSTM networks (Zhao et al., 2018, 971 citations) and DNNs with extreme learning machines (Han et al., 2014, 790 citations), dominate recent advances. Over 10 papers from 2008-2020 exceed 600 citations each.
Why It Matters
SER enables emotionally aware virtual assistants by detecting user frustration in voice interactions (Mirsamadi et al., 2017). Call centers use it for real-time agent performance analytics via emotion classification from customer calls (Khalil et al., 2019). Automotive systems integrate SER for driver stress monitoring to enhance safety (Shu et al., 2018).
Key Research Challenges
Feature Extraction Variability
Acoustic features like MFCCs vary across speakers and languages, reducing model generalization (Zhao et al., 2018). Deep embeddings help but require large labeled datasets (Han et al., 2014). No universal feature set exists for all emotions.
Dataset Imbalance Issues
Datasets like IEMOCAP suffer from class imbalance, skewing models toward dominant emotions like neutral (Khalil et al., 2019). Synthetic data augmentation is explored but introduces noise (Mirsamadi et al., 2017).
Real-World Noise Robustness
Models trained on clean speech fail in noisy environments like calls or vehicles (Shu et al., 2018). Local attention mechanisms improve robustness but computational cost rises (Mirsamadi et al., 2017).
Essential Papers
Emotion recognition based on physiological changes in music listening
Jonghwa Kim, Elisabeth André · 2008 · IEEE Transactions on Pattern Analysis and Machine Intelligence · 1.1K citations
Little attention has been paid so far to physiological signals for emotion recognition compared to audiovisual emotion channels such as facial expression or speech. This paper investigates the pote...
Speech emotion recognition using deep 1D & 2D CNN LSTM networks
Jianfeng Zhao, Xia Mao, Lijiang Chen · 2018 · Biomedical Signal Processing and Control · 971 citations
A Review of Emotion Recognition Using Physiological Signals
Lin Shu, Jinyan Xie, Mingyue Yang et al. · 2018 · Sensors · 848 citations
Emotion recognition based on physiological signals has been a hot topic and applied in many areas such as safe driving, health care and social security. In this paper, we present a comprehensive re...
Social signal processing: Survey of an emerging domain
Alessandro Vinciarelli, Maja Pantić, Hervé Bourlard · 2008 · Image and Vision Computing · 795 citations
Speech emotion recognition using deep neural network and extreme learning machine
Kun Han, Dong Yu, Ivan Tashev · 2014 · 790 citations
Speech emotion recognition is a challenging problem partly because it is unclear what features are effective for the task. In this paper we propose to utilize deep neural networks (DNNs) to extract...
Emotion recognition using multi-modal data and machine learning techniques: A tutorial and review
Jianhua Zhang, Zhong Yin, Peng Chen et al. · 2020 · Information Fusion · 733 citations
Automatic speech emotion recognition using recurrent neural networks with local attention
Seyedmahdad Mirsamadi, Emad Barsoum, Cha Zhang · 2017 · 711 citations
Automatic emotion recognition from speech is a challenging task which relies heavily on the effectiveness of the speech features used for classification. In this work, we study the use of deep lear...
Reading Guide
Foundational Papers
Start with Han et al. (2014, 790 citations) for DNN feature extraction baseline, then Kim & André (2008, 1051 citations) for multimodal context including speech.
Recent Advances
Study Zhao et al. (2018, 971 citations) CNN-LSTM and Mirsamadi et al. (2017, 711 citations) local attention for state-of-the-art accuracies.
Core Methods
Core techniques: prosodic/spectral features fed to CNN-LSTM (Zhao et al., 2018), end-to-end DNNs (Han et al., 2014), local attention RNNs (Mirsamadi et al., 2017).
How PapersFlow Helps You Research Speech Emotion Recognition
Discover & Search
Research Agent uses searchPapers and citationGraph on 'speech emotion recognition' to map 790-cited Han et al. (2014) DNN work to successors like Zhao et al. (2018) CNN-LSTM. exaSearch uncovers IEMOCAP dataset papers; findSimilarPapers expands from Mirsamadi et al. (2017) local attention models.
Analyze & Verify
Analysis Agent runs readPaperContent on Zhao et al. (2018) to extract CNN-LSTM architectures, then verifyResponse with CoVe checks accuracy against IEMOCAP benchmarks. runPythonAnalysis replays feature extraction in NumPy sandbox; GRADE scores evidence strength for prosody claims (e.g., 85% accuracy in Han et al., 2014).
Synthesize & Write
Synthesis Agent detects gaps like noise robustness post-Mirsamadi et al. (2017), flags contradictions in feature efficacy (Han et al., 2014 vs. Khalil et al., 2019). Writing Agent applies latexEditText for SER equations, latexSyncCitations for 10+ papers, latexCompile for arXiv-ready review; exportMermaid diagrams end-to-end CNN-LSTM pipelines.
Use Cases
"Reimplement Zhao et al. 2018 CNN-LSTM on IEMOCAP with Python code"
Research Agent → searchPapers('Zhao 2018 CNN LSTM') → paperExtractUrls → Code Discovery (paperFindGithubRepo → githubRepoInspect) → Analysis Agent → runPythonAnalysis (NumPy replays model, outputs accuracy plot).
"Write LaTeX review of SER deep learning advances 2014-2020"
Research Agent → citationGraph('Han 2014') → Synthesis → gap detection → Writing Agent → latexEditText (intro/methods) → latexSyncCitations (10 papers) → latexCompile → PDF with emotion classification diagram.
"Find GitHub repos benchmarking SER on EmoDB dataset"
Research Agent → exaSearch('EmoDB speech emotion benchmarks') → findSimilarPapers(Khalil 2019) → Code Discovery → paperFindGithubRepo → githubRepoInspect (extracts eval scripts, metrics for 4 emotions).
Automated Workflows
Deep Research workflow scans 50+ SER papers via searchPapers → citationGraph, outputs structured report ranking Zhao et al. (2018) by impact. DeepScan applies 7-step analysis: readPaperContent on Mirsamadi et al. (2017) → runPythonAnalysis on attention → GRADE → CoVe verification. Theorizer generates hypotheses like 'local attention outperforms global for noisy SER' from Han et al. (2014) and successors.
Frequently Asked Questions
What is Speech Emotion Recognition?
SER classifies emotions from speech acoustics using features like prosody and MFCCs on datasets such as IEMOCAP (Zhao et al., 2018).
What are key methods in SER?
Deep methods include CNN-LSTM (Zhao et al., 2018), DNN-ELM (Han et al., 2014), and local attention RNNs (Mirsamadi et al., 2017).
What are foundational SER papers?
Han et al. (2014, 790 citations) introduced DNN features; Kim & André (2008, 1051 citations) compared speech to physiological signals.
What are open problems in SER?
Challenges include cross-corpus generalization, noise robustness, and few-shot learning for rare emotions (Khalil et al., 2019; Shu et al., 2018).
Research Emotion and Mood Recognition with AI
PapersFlow provides specialized AI tools for Psychology researchers. Here are the most relevant for this topic:
Systematic Review
AI-powered evidence synthesis with documented search strategies
AI Literature Review
Automate paper discovery and synthesis across 474M+ papers
Find Disagreement
Discover conflicting findings and counter-evidence
Deep Research Reports
Multi-source evidence synthesis with counter-evidence
See how researchers in Social Sciences use PapersFlow
Field-specific workflows, example queries, and use cases.
Start Researching Speech Emotion Recognition with AI
Search 474M+ papers, run AI-powered literature reviews, and write with integrated citations — all in one workspace.
See how PapersFlow works for Psychology researchers
Part of the Emotion and Mood Recognition Research Guide