NLP Portfolio Project

SEC Risk Factor Intelligence

Deep learning and NLP analysis of 79,000+ risk disclosures from SEC 10-K filings, revealing patterns in corporate risk communication.

79,415
Risk Paragraphs Analyzed
13,970
Unique Companies
15
Years of SEC Filings
71.2%
Classification F1 Score
Executive Summary

What This Analysis Reveals

Combining traditional ML with modern NLP to understand how companies communicate risk

01
The Challenge
Understanding SEC Risk Disclosures

SEC 10-K filings contain critical risk information, but manually analyzing thousands of disclosures is impractical. This project automates the classification and analysis of risk factors, enabling scalable insights into corporate risk communication patterns.

02
Key Discovery
Semantic vs. Lexical Gap

Companies share 51% semantic similarity but only 21% lexical overlap. This 2.4x gap reveals that companies use different words to express similar risk concepts—a form of "paraphrased boilerplate" invisible to traditional text analysis.

Data

Dataset Overview

15 years of SEC 10-K filings from publicly traded companies

Risk Categories Distribution
Filings Over Time
Classification

Multi-Class Risk Classification

Comparing traditional ML models with transformer-based approaches

Model Performance Comparison (F1 Score)
Key Finding

TF-IDF ensemble outperforms fine-tuned DistilBERT (71.2% vs 57.8%).
This counterintuitive result occurs because SEC filings contain extremely long documents (avg 48K characters) that exceed transformer token limits (512 tokens). The ensemble approach captures more context through bag-of-words representation.

Explainability

What Drives Each Classification?

SHAP analysis reveals which words matter most for each risk category

Top Features by Risk Category (SHAP Values)
Why Explainability Matters

In regulated industries like finance, model decisions must be interpretable. SHAP values provide legally defensible explanations for each classification, showing exactly which words influenced the model's decision.

Semantic Analysis

Beyond Word Matching

Sentence embeddings reveal hidden patterns in risk communication

Lexical (TF-IDF) vs Semantic (SBERT) Similarity

TF-IDF (Lexical)

21.3%

Average word overlap between companies. Measures exact vocabulary matches.

SBERT (Semantic)

51.0%

Average meaning overlap. Captures paraphrased content with different words.

The 2.4x Gap Explained

The semantic similarity is 2.4x higher than lexical similarity. This means companies express similar risk concepts using different vocabulary— a sophisticated form of boilerplate that traditional analysis misses entirely.

Unsupervised Discovery

Topics Discovered by BERTopic

21 distinct risk themes automatically identified from the corpus

Discovered Topic Distribution
Unsupervised Insights

BERTopic discovered industry-specific risk patterns (Oil & Gas, Real Estate, Biotech) that the manual taxonomy doesn't capture. Only 14% of discovered topics align strongly with predefined categories, revealing new dimensions of risk disclosure.

Methodology

Technical Approach

A comprehensive NLP pipeline combining multiple techniques

1
Text Classification
TF-IDF vectorization with ensemble of Logistic Regression, SVM, Random Forest, XGBoost, and LightGBM
2
Transformers
Fine-tuned DistilBERT for comparison, demonstrating when traditional ML outperforms deep learning
3
Sentence Embeddings
SBERT (all-MiniLM-L6-v2) for semantic similarity analysis and boilerplate detection
4
Topic Modeling
BERTopic for unsupervised discovery of latent themes using UMAP + HDBSCAN clustering
5
Explainability
SHAP (SHapley Additive exPlanations) for interpretable feature importance analysis
6
Visualization
Interactive Plotly dashboards and comprehensive matplotlib visualizations
Skills

NLP Competencies Demonstrated

Multi-class text classification at scale
Transformer fine-tuning (DistilBERT)
Sentence embeddings (SBERT)
Unsupervised topic modeling (BERTopic)
Model explainability (SHAP)
Ensemble methods and hyperparameter tuning
Memory-efficient data processing (PyArrow)
Interactive data visualization (Plotly)