SEC Risk Factor Intelligence

Executive Summary

What This Analysis Reveals

Combining traditional ML with modern NLP to understand how companies communicate risk

The Challenge

Understanding SEC Risk Disclosures

SEC 10-K filings contain critical risk information, but manually analyzing thousands of disclosures is impractical. This project automates the classification and analysis of risk factors, enabling scalable insights into corporate risk communication patterns.

Key Discovery

Semantic vs. Lexical Gap

Companies share 51% semantic similarity but only 21% lexical overlap. This 2.4x gap reveals that companies use different words to express similar risk concepts—a form of "paraphrased boilerplate" invisible to traditional text analysis.

Data

Dataset Overview

15 years of SEC 10-K filings from publicly traded companies

Risk Categories Distribution

Filings Over Time

Classification

Multi-Class Risk Classification

Comparing traditional ML models with transformer-based approaches

Model Performance Comparison (F1 Score)

Key Finding

TF-IDF ensemble outperforms fine-tuned DistilBERT (71.2% vs 57.8%).
This counterintuitive result occurs because SEC filings contain extremely long documents (avg 48K characters) that exceed transformer token limits (512 tokens). The ensemble approach captures more context through bag-of-words representation.

Explainability

What Drives Each Classification?

SHAP analysis reveals which words matter most for each risk category

Top Features by Risk Category (SHAP Values)

Why Explainability Matters

In regulated industries like finance, model decisions must be interpretable. SHAP values provide legally defensible explanations for each classification, showing exactly which words influenced the model's decision.

Semantic Analysis

Beyond Word Matching

Sentence embeddings reveal hidden patterns in risk communication

Lexical (TF-IDF) vs Semantic (SBERT) Similarity

TF-IDF (Lexical)

21.3%

Average word overlap between companies. Measures exact vocabulary matches.

SBERT (Semantic)

51.0%

Average meaning overlap. Captures paraphrased content with different words.

The 2.4x Gap Explained

The semantic similarity is 2.4x higher than lexical similarity. This means companies express similar risk concepts using different vocabulary— a sophisticated form of boilerplate that traditional analysis misses entirely.

Unsupervised Discovery

Topics Discovered by BERTopic

21 distinct risk themes automatically identified from the corpus

Discovered Topic Distribution

Unsupervised Insights

BERTopic discovered industry-specific risk patterns (Oil & Gas, Real Estate, Biotech) that the manual taxonomy doesn't capture. Only 14% of discovered topics align strongly with predefined categories, revealing new dimensions of risk disclosure.

Methodology

Technical Approach

A comprehensive NLP pipeline combining multiple techniques

Text Classification

TF-IDF vectorization with ensemble of Logistic Regression, SVM, Random Forest, XGBoost, and LightGBM

Transformers

Fine-tuned DistilBERT for comparison, demonstrating when traditional ML outperforms deep learning

Sentence Embeddings

SBERT (all-MiniLM-L6-v2) for semantic similarity analysis and boilerplate detection

Topic Modeling

BERTopic for unsupervised discovery of latent themes using UMAP + HDBSCAN clustering

Explainability

SHAP (SHapley Additive exPlanations) for interpretable feature importance analysis

Visualization

Interactive Plotly dashboards and comprehensive matplotlib visualizations

Skills

NLP Competencies Demonstrated

Multi-class text classification at scale

Transformer fine-tuning (DistilBERT)

Sentence embeddings (SBERT)

Unsupervised topic modeling (BERTopic)

Model explainability (SHAP)

Ensemble methods and hyperparameter tuning

Memory-efficient data processing (PyArrow)

Interactive data visualization (Plotly)