Deep learning and NLP analysis of 79,000+ risk disclosures from SEC 10-K filings, revealing patterns in corporate risk communication.
Combining traditional ML with modern NLP to understand how companies communicate risk
SEC 10-K filings contain critical risk information, but manually analyzing thousands of disclosures is impractical. This project automates the classification and analysis of risk factors, enabling scalable insights into corporate risk communication patterns.
Companies share 51% semantic similarity but only 21% lexical overlap. This 2.4x gap reveals that companies use different words to express similar risk concepts—a form of "paraphrased boilerplate" invisible to traditional text analysis.
15 years of SEC 10-K filings from publicly traded companies
Comparing traditional ML models with transformer-based approaches
TF-IDF ensemble outperforms fine-tuned DistilBERT (71.2% vs 57.8%).
This counterintuitive result occurs because SEC filings contain extremely long documents
(avg 48K characters) that exceed transformer token limits (512 tokens). The ensemble
approach captures more context through bag-of-words representation.
SHAP analysis reveals which words matter most for each risk category
In regulated industries like finance, model decisions must be interpretable. SHAP values provide legally defensible explanations for each classification, showing exactly which words influenced the model's decision.
Sentence embeddings reveal hidden patterns in risk communication
Average word overlap between companies. Measures exact vocabulary matches.
Average meaning overlap. Captures paraphrased content with different words.
The semantic similarity is 2.4x higher than lexical similarity. This means companies express similar risk concepts using different vocabulary— a sophisticated form of boilerplate that traditional analysis misses entirely.
21 distinct risk themes automatically identified from the corpus
BERTopic discovered industry-specific risk patterns (Oil & Gas, Real Estate, Biotech) that the manual taxonomy doesn't capture. Only 14% of discovered topics align strongly with predefined categories, revealing new dimensions of risk disclosure.
A comprehensive NLP pipeline combining multiple techniques