Data Journey: How Our Statistics Are Built

This page explains the complete journey of Spotify music data-from raw metadata to the analytics you see on this site. We'll take you through every stage of the pipeline with real examples from our database.

Raw Data Sources: Three Spotify Databases

This analysis uses three Spotify datasets archived by Anna's Archive, a community-driven open library that preserves music metadata and research datasets.

The Big Picture: Our Data Pipeline

Why Multiple Stages?

Filtering ensures focus on meaningful data (popular tracks, major languages)
Normalization handles inconsistencies and duplicates
Denormalization makes queries fast
Pre-aggregation pre-computes expensive calculations
PostgreSQL makes it all accessible via web queries

Stage 1: Source Data - Three Raw Databases

We start with three Spotify datasets (from Anna's Archive):

Database 1: `spotify_clean.sqlite3`

The Foundation: All track metadata

Example Track:

"Blinding Lights" by The Weeknd
  ID: 3n3Ppam7vgaVa1iaRUc9Lp
  Popularity: 95/100
  Duration: 3 min 20 sec (200,000 ms)
  Album: "After Hours" (2019)

Database 2: `spotify_clean_track_files.sqlite3`

The Language Layer: Language + popularity annotations

Adds language metadata for ~18 million tracks:

Track ID                    Language    Popularity
3n3Ppam7vgaVa1iaRUc9Lp    → English  → 86/100

Languages Available: English, French, Spanish, German, Japanese, Polish

Database 3: `spotify_clean_audio_features.sqlite3`

The Sound Analysis: 14 numerical features per track

Track ID    Tempo   Danceability  Energy  Valence  Acousticness
...         103     0.855         0.730   0.486    0.026

What These Mean:

Danceability (0.0-1.0): How dance-friendly is the track?
Energy (0.0-1.0): Intensity & activity level
Valence (0.0-1.0): Musical positivity (major vs. minor keys)
Acousticness (0.0-1.0): Acoustic instruments vs. electronic
Tempo (BPM): Beats per minute
And 9 more: loudness, speechiness, instrumentalness, liveness, key, mode, time signature, etc.

Coverage Note: We have audio features for ~350,000 tracks (3%). For the rest, these columns are empty.

Stage 2: Filters - Narrowing Focus

We apply three main filters to keep data focused and clean:

Details of Each Filter:

Filter 1: Language

Rule: Keep only tracks in: en, fr, es, de, ja, pl

Why? Focus on major global languages.

Impact: ~58.6M tracks → ~18M tracks (31% remain)

Filter 2: Popularity

Rule: Keep only tracks where popularity > 5 (scale 0-100)

Why? Remove obscure/test tracks, focus on real releases.

Impact: ~18M tracks → ~10M tracks (55% remain)

Filter 3: Release Year

Rule: Keep years in the range [1970, 2025]

Why? Focus on recorded music era. Historical music from before 1970 is sparse anyway.

Extracted From: Album release dates (standardized format)

Stage 3: Data Enrichment - Joining Across Databases

Now we connect everything:

Result: A unified view of each track with all its properties combined.

Example Row After Enrichment:

Property	Value
Track ID	3n3Ppam7vgaVa1iaRUc9Lp
Track Name	Blinding Lights
Artists	The Weeknd
Language	English
Release Year	2019
Popularity	86
Genres	pop, dark wave, synth-pop, electronic
Tempo	103 BPM
Danceability	0.855
Energy	0.730
Valence	0.486

Stage 4: Processing Step 1 - Top 2000 Genres

The Problem: Spotify has 100,000+ genre labels. This creates sparsity.

Our Solution: Rank genres by frequency, keep only top 2,000.

Genre Coverage:

Top 50 genres: Account for ~60% of all tracks
Top 500 genres: Account for ~90% of all tracks
Top 2,000 genres: Account for ~95% of all tracks
Remaining 98,000+ genres: Niche, sparse data

Example Top Genres:

Pop (~500K tracks)
Rock (~300K tracks)
Indie (~200K tracks)
Alternative (~180K tracks)
Electronic (~150K tracks) ... and so on

Result: A clean, queryable genre list without the data sparsity of ultra-niche labels.

Stage 5: Processing Step 2 - Denormalized Track Details

We create the track_details table: One row per track, containing everything.

This is the source of truth for:

Feature visualizations (danceability by year/genre)
Individual track searches
Audio characteristic distributions

Table Schema:

- track_id (unique identifier)
- track_name, artists, album_name
- language_code, release_year
- track_popularity, genres
- danceability, energy, valence, acousticness, ...
- tempo, loudness, key, mode, ...
(20 columns total)

Size: 9,879,454 rows (~2.4 GB)

Indexing: Fast queries on language, year, genre

Stage 6: Processing Step 3 - Pre-aggregated Tables

For performance, we pre-compute three aggregations:

Table 3A: `agg_genre_language_year`

Grouped By: Genre + Language + Year

What It Stores:

Genre     | Language | Year | Track Count | Avg Popularity | Avg Danceability
Pop       | English  | 2020 | 892         | 68.5           | 0.71
Rock      | English  | 2015 | 456         | 52.1           | 0.58
Hip Hop   | English  | 2020 | 1,243       | 71.2           | 0.79
Electronic| French   | 2019 | 78          | 45.3           | 0.73

Size: 70,615 rows (~8 MB)

Why Pre-aggregate? A dashboard chart grouping 10 million tracks would take seconds to compute. Pre-aggregation makes it instant.

Table 3B: `agg_genre_year`

Grouped By: Genre + Year (aggregated across all languages)

Size: 28,861 rows (~3 MB)

Table 3C: `agg_language_year`

Grouped By: Language + Year (aggregated across all genres)

Sample Data:

Language  | Year | Track Count | Avg Popularity
English   | 2020 | 8,234       | 62.1
French    | 2020 | 1,456       | 58.3
Spanish   | 2020 | 987         | 59.2
German    | 2020 | 456         | 57.1
Japanese  | 2020 | 234         | 52.8
Polish    | 2020 | 89          | 51.3

Size: 337 rows (~88 KB) - Small enough to load in full!

Stage 7: Load Into PostgreSQL

All processed tables are copied from SQLite to PostgreSQL (accessible at kerboul.me:15433).

Why? SQLite is great for processing, but PostgreSQL is better served over the web with concurrent queries.

Stage 8: Query and Display

Data loaders (*.json.js files) query PostgreSQL and format results for visualizations:

// Example: Get pop music trends
SELECT genre, release_year, SUM(track_count) as total_tracks
FROM agg_genre_language_year
WHERE genre = 'pop' AND language_code = 'en'
GROUP BY genre, release_year
ORDER BY release_year

Results power charts like:

Stacked area charts (genre trends over time)
Donut charts (current genre distribution)
Line charts (feature evolution)
Language breakdowns

ID: 3n3Ppam7vgaVa1iaRUc9Lp
Name: "Blinding Lights"
Popularity: 95
Duration: 200,000 ms
Album: "After Hours" (Release: 2019-11-29)
Explicit: No

Language Database:

Track: 3n3Ppam7vgaVa1iaRUc9Lp
Language: English
Popularity: 86

Audio Features:

Tempo: 103.33 BPM
Danceability: 0.855 (↑ Very danceable!)
Energy: 0.730 (↑ High energy)
Valence: 0.486 (→ Neutral mood)
Acousticness: 0.0255 (↓ Mostly electronic)

After Genre Top 2000 Filter:

Artist (The Weeknd) has genres: [pop, dark wave, synth-pop, electronic]

All in top 2,000? Yes

Final `track_details` Row:

track_id: 3n3Ppam7vgaVa1iaRUc9Lp
track_name: Blinding Lights
artists: The Weeknd
album_name: After Hours
language_code: en
release_year: 2019
track_popularity: 86
genres: pop | dark wave | synth-pop | electronic
danceability: 0.855
energy: 0.730
valence: 0.486
acousticness: 0.026
tempo: 103.33
... (other audio features)

In `agg_genre_language_year`:

genre: pop
language_code: en
release_year: 2019
track_count: ... (includes this track)
avg_popularity: ... (includes 86 in the average)
avg_danceability: ... (includes 0.855)

Track names, artists, release years
Popularity scores
Language identification
Genre membership
Audio features (Spotify's official measurements)

Limitations:

Audio Features: Only 3% coverage (350K of 9.9M tracks)
Language Ambiguity: Bilingual tracks assigned to single language
Genre Sparsity: Niche genres (> 2000) are dropped
Historical Gaps: Pre-1980 data is thin (~1% of total)
Inconsistent Metadata: Old records may have incomplete info