Narrative Miners: Uncover the Stories That Drive Markets¶

Introduction¶

This notebook demonstrates how advanced narrative mining reveals evolving market stories across multiple document types. We will track the “AI Bubble Concerns” narrative as it emerges and evolves across news, earnings calls, and regulatory filings – highlighting the difference between public discourse and corporate communications.

The bigdata-research-tools package provides three specialized classes for narrative mining:

NewsNarrativeMiner: Analyzes web-based news content
TranscriptsNarrativeMiner: Examines earnings call and event transcripts
FilingsNarrativeMiner: Explores SEC Filings from EDGAR

Each Narrative Miner follows the same workflow:

Define narrative labels which encompass a theme
Retrieve content using BigData’s search capabilities
Label content with LLMs to identify narrative matches
Analyze the results to reveal patterns and insights

Setup and Imports¶

Below is the Python code required for setting up our environment and importing necessary libraries.

from IPython.display import display, HTML, IFrame
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from datetime import datetime
import os
from scipy.ndimage import gaussian_filter1d
import plotly
import plotly.graph_objects as go
import warnings
from IPython.display import Image, display
import plotly.io as pio
pio.renderers.default = 'notebook'
plotly.offline.init_notebook_mode()

from bigdata_client import Bigdata
from bigdata_client.daterange import RollingDateRange
from bigdata_research_tools.miners import (
    NewsNarrativeMiner,
    FilingsNarrativeMiner,
    TranscriptsNarrativeMiner
)
from bigdata_client.models.sources import Source

Define Output Paths¶

We define the output paths for our narrative mining results.

# Define output file paths for our results
output_dir = "output"
os.makedirs(output_dir, exist_ok=True)

news_results_path = f"{output_dir}/ai_bubble_news.xlsx"
transcripts_results_path = f"{output_dir}/ai_bubble_transcripts.xlsx"
filings_results_path = f"{output_dir}/ai_bubble_filings.xlsx"
visualization_path = f"{output_dir}/ai_bubble_narratives.html"

Load Environment Variables¶

Make sure you have added your BigData API credentials to a .env file. Then load them as follows:

# Load environment variables for BigData credentials
from dotenv import load_dotenv
load_dotenv('.env')

BIGDATA_USERNAME = os.getenv("BIGDATA_USERNAME")
BIGDATA_PASSWORD = os.getenv("BIGDATA_PASSWORD")
bigdata = Bigdata(BIGDATA_USERNAME, BIGDATA_PASSWORD)

Define Narrative Labels¶

We define specific narratives related to the AI bubble concerns:

ai_bubble_narratives = [
    "Tech valuations have detached from fundamental earnings potential",
    "AI investments show classic signs of irrational exuberance",
    "Market is positioning AI as revolutionary without proven ROI",
    "Current AI investments may not generate predicted financial returns",
    "Tech CEOs acknowledge AI implementation challenges amid high expectations",
    "Analysts are questioning the timeline for AI-driven profits",
    "Companies are spending billions on unproven AI technology",
    "AI infrastructure costs are rising but revenue gains remain uncertain",
    "Venture capital is flooding AI startups at unsustainable valuations",
    "Regulatory concerns could derail AI market growth projections",
    "Public discourse about AI capabilities exceeds technical realities",
    "AI talent acquisition costs have created an unsustainable bubble",
    "Corporate executives privately express concerns about AI ROI timelines",
    "AI market projections rely on aggressive and unproven assumptions",
    "Industry veterans drawing parallels to previous tech bubbles"
]

Configure the Narrative Miners¶

Create narrative miners for each document type. In this example, we select CNBC as the news source.

# Configure the Narrative Miners

# Choose CNBC as a news source
tech_news_sources = bigdata.knowledge_graph.find_sources("CNBC")
tech_news_ids = [source.id for source in tech_news_sources if "CNBC" in source.name]

# Common parameters across narrative miners
common_params = {
    "theme_labels": ai_bubble_narratives,
    "llm_model": "openai::gpt-4o-mini",
    "start_date": "2023-08-01",
    "end_date": "2025-03-28",
    "rerank_threshold": 0.7  # Use reranking for better relevance
}

# Create narrative miners for each document type
news_miner = NewsNarrativeMiner(sources=tech_news_ids, **common_params)
transcripts_miner = TranscriptsNarrativeMiner(**common_params, sources=None, fiscal_year=2024)
filings_miner = FilingsNarrativeMiner(**common_params, sources=None, fiscal_year=2024)

Run Narrative Mining Across Sources¶

Execute the narrative mining processes for news, earnings call transcripts, and SEC filings:

# Run Narrative Mining Across Sources
print("Mining news narratives...")
news_results = news_miner.mine_narratives(
    document_limit=100,
    freq='W',  # Weekly frequency
    export_to_path=news_results_path
)

print("Mining earnings call transcripts...")
transcripts_results = transcripts_miner.mine_narratives(
    document_limit=100,
    freq='M',  # Monthly frequency (earnings are quarterly)
    export_to_path=transcripts_results_path
)

print("Mining SEC filings...")
filings_results = filings_miner.mine_narratives(
    document_limit=100,
    freq='M',  # Monthly frequency (filings are quarterly)
    export_to_path=filings_results_path
)

Load and Process Results¶

Load the exported Excel files, clean the data, and display a summary.

  # Load and Process Results
  def load_results(file_path, source_type):
      """
      Load and clean narrative mining results.

      Parameters:
          file_path (str): Path to the Excel file containing mining results.
          source_type (str): Type of data source (News, Earnings Call, SEC Filing).

      Returns:
          pd.DataFrame: Cleaned dataframe with source type label.
      """
      df = pd.read_excel(file_path, header=1).reset_index(drop=True)
      df = df.drop(columns=[col for col in df.columns if 'Unnamed' in str(col)])
      df['Date'] = pd.to_datetime(df['Date'])
      df['Source_Type'] = source_type  # Add source type column
      print(f"Loaded {len(df)} narrative records from {source_type}")
      return df

# Load results from all three document types with labeling
news_df = load_results(news_results_path, "News Media")
transcripts_df = load_results(transcripts_results_path, "Earnings Calls")
filings_df = load_results(filings_results_path, "SEC Filings")

# Create a summary of the dataset sizes
source_summary = pd.DataFrame({
    'Source Type': ['News Media', 'Earnings Calls', 'SEC Filings'],
    'Record Count': [len(news_df), len(transcripts_df), len(filings_df)],
    'Date Range': [
        f"{news_df['Date'].min().strftime('%Y-%m-%d')} to {news_df['Date'].max().strftime('%Y-%m-%d')}",
        f"{transcripts_df['Date'].min().strftime('%Y-%m-%d')} to {transcripts_df['Date'].max().strftime('%Y-%m-%d')}",
        f"{filings_df['Date'].min().strftime('%Y-%m-%d')} to {filings_df['Date'].max().strftime('%Y-%m-%d')}"
    ],
    'Unique Narratives': [
        news_df['Label'].nunique(),
        transcripts_df['Label'].nunique(),
        filings_df['Label'].nunique()
    ]
})

# Display the summary table
display(source_summary)

# Display samples from each source
print("\n======= SAMPLE NEWS NARRATIVES =======")
display(news_df[['Date', 'Headline', 'Label', 'Quote']].head(3))

print("\n======= SAMPLE EARNINGS CALL NARRATIVES =======")
display(transcripts_df[['Date', 'Headline', 'Label', 'Quote']].head(3))

print("\n======= SAMPLE SEC FILING NARRATIVES =======")
display(filings_df[['Date', 'Headline', 'Label', 'Quote']].head(3))

Loaded 189 narrative records from News Media
Loaded 12 narrative records from Earnings Calls
Loaded 190 narrative records from SEC Filings

	Source Type	Record Count	Date Range	Unique Narratives
0	News Media	189	2023-08-11 to 2025-03-27	14
1	Earnings Calls	12	2023-12-05 to 2025-02-12	6
2	SEC Filings	190	2023-08-02 to 2025-03-28	8

======= SAMPLE NEWS NARRATIVES =======

	Date	Headline	Label	Quote
0	2023-08-11	Investors are 'overconfident' about the impact...	Market is positioning AI as revolutionary with...	Market participants are "overconfident" about ...
1	2023-08-24	JPMorgan says AI's 'democratization' could put...	Current AI investments may not generate predic...	However, this does not guarantee sustained ear...
2	2023-08-24	Nvidia's blowout earnings report shows chipmak...	Current AI investments may not generate predic...	He said that price suggests a multiple of 13 t...

======= SAMPLE EARNINGS CALL NARRATIVES =======

	Date	Headline	Label	Quote
0	2023-12-05	Yext, Inc.: Q3 2024 Earnings Call	Analysts are questioning the timeline for AI-d...	I think we're four to six to eight quarters aw...
1	2024-05-06	Varonis Systems, Inc.: Q1 2024 Earnings Call	Tech CEOs acknowledge AI implementation challe...	Thanks for taking my question. So, Yaki, you t...
2	2024-05-09	Kakao Corp.: Q1 2024 Earnings Call	Tech CEOs acknowledge AI implementation challe...	Even global tech giants, despite significant c...

======= SAMPLE SEC FILING NARRATIVES =======

	Date	Headline	Label	Quote
0	2023-08-02	KALTURA INC files FORM 10-Q for Q2, FY 2023 on...	Regulatory concerns could derail AI market gro...	These efforts, including the introduction of n...
1	2023-08-02	ADVANCED MICRO DEVICES INC files FORM 10-Q for...	Current AI investments may not generate predic...	Moreover, our investments in new products and ...
2	2023-08-03	Missfresh Ltd files FORM 20-F for FY 2022 on A...	Current AI investments may not generate predic...	We invested significant sums in expanding and ...

Narrative Analysis Functions¶

Define functions to prepare the narrative time series data and calculate overall source scores.

# Narrative Analysis Functions

def prepare_narrative_data(df, freq='W'):
    """
    Prepare narrative data for visualization by creating time series of narrative counts,
    converting to z-scores, and applying smoothing.
    """
    pivot_df = pd.pivot_table(df, index='Date', columns='Label', aggfunc='size', fill_value=0)
    resampled_df = pivot_df.resample(freq).sum()

    # Calculate z-scores for each narrative
    zscore_df = pd.DataFrame()
    for column in resampled_df.columns:
        mean = resampled_df[column].mean()
        std = resampled_df[column].std()
        if std == 0:
            zscore_df[column] = 0
        else:
            zscore_df[column] = (resampled_df[column] - mean) / std

    # Apply smoothing using Gaussian filter
    smoothed_df = pd.DataFrame(index=zscore_df.index)
    for column in zscore_df.columns:
        smoothed_data = gaussian_filter1d(zscore_df[column].fillna(0).values, sigma=2)
        smoothed_df[column] = smoothed_data
    return smoothed_df

def calculate_source_scores(df):
    """
    Calculate overall narrative scores (z-scores) across all narratives by source.
    """
    date_counts = df.groupby('Date').size()
    weekly_counts = date_counts.resample('W').sum().fillna(0)
    mean = weekly_counts.mean()
    std = weekly_counts.std()
    if std == 0:
        zscore = weekly_counts * 0
    else:
        zscore = (weekly_counts - mean) / std
    smoothed = gaussian_filter1d(zscore.fillna(0).values, sigma=2)
    return pd.Series(smoothed, index=zscore.index)

Creating Narrative Visualizations¶

Define functions to create a comparative visualization across sources and a narrative breakdown chart for news media.

def visualize_cross_source_narratives():
    """Create a comparative visualization of narrative prevalence across sources with fixed annotation arrows"""
    # Prepare data for each source
    news_score = calculate_source_scores(news_df)
    transcript_score = calculate_source_scores(transcripts_df)
    filing_score = calculate_source_scores(filings_df)

    # Align indices across all sources
    all_dates = sorted(set(news_score.index) |
                       set(transcript_score.index) |
                       set(filing_score.index))

    # Create dataframe with aligned dates
    comparison_df = pd.DataFrame(index=all_dates)
    comparison_df['News Media'] = pd.Series(news_score)
    comparison_df['Earnings Calls'] = pd.Series(transcript_score)
    comparison_df['SEC Filings'] = pd.Series(filing_score)
    comparison_df = comparison_df.sort_index().fillna(method='ffill').fillna(0)

    # Create visualization
    fig = go.Figure()

    # Add traces for each source
    source_colors = {
        'News Media': '#FF6B6B',
        'Earnings Calls': '#4ECDC4',
        'SEC Filings': '#6A0572'
    }

    for source, color in source_colors.items():
        fig.add_trace(
            go.Scatter(
                x=comparison_df.index,
                y=comparison_df[source],
                mode='lines',
                name=source,
                line=dict(width=3, color=color),
                hovertemplate=(
                    f"<b>{source}</b><br>" +
                    "Date: %{x|%B %d, %Y}<br>" +
                    "Intensity: %{y:.2f}<extra></extra>"
                )
            )
        )

    # Find important points for annotation placement
    peak_news = comparison_df['News Media'].idxmax()
    peak_earnings = comparison_df['Earnings Calls'].idxmax()
    peak_filings = comparison_df['SEC Filings'].idxmax()

    # Create annotations with fixed arrows
    annotations = [
        # Peak news annotation
        dict(
            x=peak_news,
            y=comparison_df.loc[peak_news, 'News Media'],
            text="Peak news coverage<br>of AI bubble concerns",
            showarrow=True,
            arrowhead=2,
            ax=-140,
            ay=0,
            bgcolor="rgba(255, 255, 255, 0.8)",
            bordercolor="#FF6B6B",
            borderwidth=2,
            font=dict(size=10),
            xanchor="left"
        ),

        # Earnings calls annotation
        dict(
            x=peak_earnings,
            y=comparison_df.loc[peak_earnings, 'Earnings Calls'],
            text="Executives address<br>bubble concerns on<br>earnings calls",
            showarrow=True,
            arrowhead=2,
            ax=-60,
            ay=-40,
            bgcolor="rgba(255, 255, 255, 0.8)",
            bordercolor="#4ECDC4",
            borderwidth=2,
            font=dict(size=10),
            xanchor="right"
        )
    ]

    # Customize the layout
    fig.update_layout(
        title={
            'text': 'AI Bubble Narrative: Media vs. Corporate Communications',
            'y': 0.95,
            'x': 0.5,
            'xanchor': 'center',
            'yanchor': 'top',
            'font': dict(size=18, color='#1f1f1f')
        },
        xaxis=dict(
            title='',
            gridcolor='rgba(100, 100, 100, 0.2)',
            tickangle=-45,
            tickformat='%b %Y',
            tickfont=dict(color='#1f1f1f', size=10),
            showgrid=True
        ),
        yaxis=dict(
            title=dict(text='Narrative Intensity (z-score)', font=dict(color='#1f1f1f')),
            tickfont=dict(color='#1f1f1f'),
            gridcolor='rgba(100, 100, 100, 0.2)',
            zerolinecolor='rgba(0, 0, 0, 0.4)',
            range=[-0.8, 3.5],
            automargin=True
        ),
        hovermode='closest',
        legend=dict(
            orientation='h',
            yanchor='top',
            y=-0.2,
            xanchor='center',
            x=0.5,
            font=dict(size=12, color='#1f1f1f'),
            bgcolor='rgba(255,255,255,0.8)'
        ),
        annotations=annotations,
        margin=dict(l=50, r=50, t=70, b=120),
        template='plotly',
        plot_bgcolor='rgba(255,255,255,1)',
        paper_bgcolor='rgba(255,255,255,1)',
        height=600,
        showlegend=True
    )

    # Add horizontal reference line
    fig.add_shape(
        type='line',
        x0=comparison_df.index.min(),
        y0=0,
        x1=comparison_df.index.max(),
        y1=0,
        line=dict(
            color='#666666',
            width=1,
            dash='dash'
        )
    )

    # Add time period selectors
    fig.update_xaxes(
        rangeselector=dict(
            buttons=list([
                dict(count=1, label="1m", step="month", stepmode="backward"),
                dict(count=3, label="3m", step="month", stepmode="backward"),
                dict(count=6, label="6m", step="month", stepmode="backward"),
                dict(step="all")
            ]),
            bgcolor='rgba(150,150,150,0.2)'
        )
    )

    fig.add_annotation(
        x=peak_filings,  # X position at peak date
        y=comparison_df.loc[peak_filings, 'SEC Filings'],
        text="Peak mentions in<br>SEC filings",
        showarrow=True,
        arrowhead=2,
        ax=0,
        ay=60,
        bgcolor="rgba(255, 255, 255, 0.8)",
        bordercolor="#6A0572",
        borderwidth=2,
        font=dict(size=10),
        xanchor="center",
        yanchor="top"    # Anchor at top of text box
    )

    return fig

Create and Save Visualizations¶

Generate and view the visualizations:

# Create and Save Visualizations
warnings.filterwarnings("ignore", message=".*'method'.*", category=FutureWarning)

# Create the comparative source visualization
fig = visualize_cross_source_narratives()
fig.show()


# Create the narrative breakdown visualization
fig = visualize_news_narrative_breakdown()
fig.show()

Extract and Print Key Insights¶

Extract key insights from the narrative mining data and display them.

  # Key Insights from Narrative Mining
  def extract_narrative_insights():
      """
      Extract key insights from our narrative mining data.
      """
      news_score = calculate_source_scores(news_df)
      transcript_score = calculate_source_scores(transcripts_df)
      filing_score = calculate_source_scores(filings_df)

      peak_news_month = news_score.idxmax().strftime('%B %Y')
      peak_transcript_month = transcript_score.idxmax().strftime('%B %Y')
      peak_filing_month = filing_score.idxmax().strftime('%B %Y')

      news_narrative_counts = news_df['Label'].value_counts()
      transcript_narrative_counts = transcripts_df['Label'].value_counts()
      filing_narrative_counts = filings_df['Label'].value_counts()

      top_news_narrative = news_narrative_counts.index[0]
      top_transcript_narrative = transcript_narrative_counts.index[0]
      top_filing_narrative = filing_narrative_counts.index[0]

      total_news_mentions = len(news_df)
      total_transcript_mentions = len(transcripts_df)
      total_filing_mentions = len(filings_df)

      news_peaks = pd.Series(news_score).nlargest(3)
      transcript_peaks = pd.Series(transcript_score).nlargest(3)
      filing_peaks = pd.Series(filing_score).nlargest(3)

      avg_lag_days = []
      for news_date in news_peaks.index:
          closest_filing_date = min(filing_peaks.index, key=lambda x: abs((x - news_date).days))
          lag_days = (closest_filing_date - news_date).days
          avg_lag_days.append(lag_days)
      avg_lag = np.mean(avg_lag_days)

      return {
          "peak_news_month": peak_news_month,
          "peak_transcript_month": peak_transcript_month,
          "peak_filing_month": peak_filing_month,
          "top_news_narrative": top_news_narrative,
          "top_transcript_narrative": top_transcript_narrative,
          "top_filing_narrative": top_filing_narrative,
          "total_news_mentions": total_news_mentions,
          "total_transcript_mentions": total_transcript_mentions,
          "total_filing_mentions": total_filing_mentions,
          "avg_lag_days": int(avg_lag)
      }

insights = extract_narrative_insights()

print("## AI Bubble Narrative Key Insights\n")
print(f"Peak month for news coverage: {insights['peak_news_month']}")
print(f"Peak month for earnings call mentions: {insights['peak_transcript_month']}")
print(f"Peak month for regulatory filing mentions: {insights['peak_filing_month']}")
print(f"\nDominant narrative in news: \"{insights['top_news_narrative']}\"")
print(f"Dominant narrative in earnings calls: \"{insights['top_transcript_narrative']}\"")
print(f"Dominant narrative in regulatory filings: \"{insights['top_filing_narrative']}\"")
print(f"\nTotal narrative mentions in news: {insights['total_news_mentions']}")
print(f"Total mentions in earnings calls: {insights['total_transcript_mentions']}")
print(f"Total mentions in regulatory filings: {insights['total_filing_mentions']}")
print(f"\nAverage lag between peaks in news coverage peaks and SEC filings: {insights['avg_lag_days']} days")

## AI Bubble Narrative Key Insights

Peak month for news coverage: February 2025
Peak month for earnings call mentions: May 2024
Peak month for regulatory filing mentions: March 2025

Dominant narrative in news: "Companies are spending billions on unproven AI technology"
Dominant narrative in earnings calls: "Tech CEOs acknowledge AI implementation challenges amid high expectations"
Dominant narrative in regulatory filings: "Current AI investments may not generate predicted financial returns"

Total narrative mentions in news: 189
Total mentions in earnings calls: 12
Total mentions in regulatory filings: 190

Average lag between peaks in news coverage peaks and SEC filings: 35 days

Conclusion¶

The Narrative Miners reveal important patterns in how the AI bubble narrative evolved across information sources:

Timing and Magnitude: SEC filings show the most dramatic spike in AI bubble concerns in March 2025, reaching the highest intensity level on the chart. Notably, earnings calls show peaks that consistently precede both news media coverage and SEC filings, suggesting corporate discussions may serve as leading indicators.
Narrative Evolution: The data reveals a sequential pattern where concerns emerge first in earnings calls, followed by news media coverage, and ultimately culminating in formal SEC filings. This progression is particularly evident in the 2024-2025 period, where earnings call discussions of AI bubble concerns around April 2024 preceded the subsequent peaks in media coverage and the dramatic spike in SEC filings.

This multi-source narrative mining approach provides valuable insights into market sentiment and the evolution of key narratives—a powerful tool for market analysis.

Enjoy exploring and extending your narrative analysis!