Spotify Has 12 Years of My Data. I Just Took It Back.

spotify-music-archive.html

Spotify Has 12 Years of My Data. I Just Took It Back.

You know the moment. You sit down to focus, your headphones go on, you tap your “Deep Work” playlist, and Spotify cues up a track with a vocal hook so catchy your brain turns sideways and refuses to write the thing you sat down to write.

Or you’re driving, you want something with edge, and the algorithm hands you the same six tracks it’s been pushing for a month.

Or it’s six in the evening, you’re trying to wind down, and Spotify comes in hot with a track that wants you to start something instead of finish it.

The recommender isn’t bad. It’s just generic. It looks at the smear of everything you’ve played in the last six months and serves you the average. It doesn’t know which mood you’re in right now. It doesn’t know that the chill jazz playlist it built you sits at 12% completion because you love the idea of chill jazz but rarely the actual execution. It doesn’t know that the song you’ve played 40 times — the one it thinks is your favorite — is one you’ve skipped 38 times in the first eight seconds.

So I’m building one that does.

$ cat thesis.md

The thesis

This article isn’t about Spotify. Not really.

This article is about the difference between an AI assistant that generally helps and one that actually knows you — and I’m going to argue that the difference isn’t about which model you’re using, or how clever your prompts are, or whether you bought the premium tier. The difference is whether you’ve given the AI access to the data that already exists about you.

Spotify has twelve years of behavioral data about my listening habits. So does every other service I use. Your inbox does. Your calendar does. Your git history does. Your fitness tracker does. The data has been there the whole time, sitting in someone else’s silo, generating exactly zero value for the AI that could actually do something with it.

You can pull most of it back. The exports exist. The APIs exist. What’s missing for most people is the architectural choice to actually do it — to treat your behavioral data as something you own and structure for your own use, rather than something each vendor holds in their own format for their own purposes.

Context isn’t a feature you bolt on later. Context is an architecture choice — and it compounds.

The Spotify project is the demonstration case. The architecture underneath is the actual subject.

$ git status –honest

The bet I’m placing

The honest part: I’m publishing this as it works, not after.

Spotify’s data export is supposed to take up to thirty days. Mine arrived in a fraction of that. The surprise was on the other side — in February 2026, Spotify tightened their catalog API for new developer apps, and the enrichment scripts I wrote (the ones that fill in track duration so the percent-played math actually works) needed a rework to be polite about how often they call. Pareto math on my data: 907 tracks make up a third of my total plays. That’s the slice I’m enriching first.

What’s already built and tested is the architecture — the database, the ingest pipeline, the enrichment layer, the queries that turn play events into actual taste signals. The pipeline works. The catalog metadata is filling in tier by tier. The first real “tracks I actually love” queries are running this week.

I’m publishing now because the idea is bigger than my Spotify project. The data is just the medium. The principle is what matters: give your AI the right context, structurally and durably, and it can do work for you that no amount of clever prompting can match.

Let me show you what I mean.

$ query layer1.sql

What this unlocks (Layer 1: just Spotify)

Once the dump lands and the database fills, here’s the kind of question I’ll be able to ask:

Show me tracks I let finish more than 90% of the time, from albums released between 1990 and 1999, that I’ve played at least three times this year. Order by how recently I last heard them.

That’s a one-line natural language query against my own listening history. It generates a playlist. It’s specific in a way that Spotify’s interface can’t even formulate, because Spotify’s interface is built for browsing — not for interrogating your own taste.

A few more I’m planning for the day my dump arrives:

The no-words-while-coding mix.

“From my own top three hundred most-played tracks, give me only the ones with no vocals or where the vocals are minimal enough to ignore. Sort by tempo, ascending, so I can ramp up over the morning.”

The back-button discovery list.

“Show me every song I’ve hit back-button on more than twice, but that I don’t have on any playlist.” Hitting the back button mid-song is the rarest signal in the dataset and the most truthful — your conscious self never chose to enshrine these tracks, but your listening behavior keeps voting for them.

The committed-but-forgotten list.

“Find tracks I used to play to completion regularly five years ago, but haven’t played in the last eighteen months.” Old loves I forgot about. Spotify wants me to discover new things; what I want is to be reminded of what I already knew I loved.

The mood-aware road trip generator.

“Build me a four-hour driving playlist of tracks I let finish more than 80% of the time, weighted toward upbeat tempos in the first hour and mellower tracks toward the end. Skip anything tagged for working hours.“

As of this week, I have enough data to actually run these. The first one — top tracks ranked by engagement, not by play count — produced a result I was hoping for and slightly dreading: my most-played track sits below 40% finish rate. Three of the top ten land below 50%. By raw play count, those are my “favorites.” By engagement, they’re songs Spotify queues confidently and I keep dismissing almost two times out of three. The metric works. The data is honest. And my taste, as it turns out, is meaningfully different from what Spotify believed it to be.

None of these require an AI that “knows my soul.” They require an AI with structured access to my own behavioral history.

That’s the foundation. Now watch what happens when you stack the next layer on top.

$ query layer2.sql

Layer 2: AI that knows what I’m doing right now

The Spotify archive tells the AI what I’ve listened to. It does not tell the AI what I’m doing right now. That’s a different data source — and it’s already sitting in tools I use every day. Claude Code knows which repo I have open. My calendar knows I have a customer call at 2pm. My git history knows I’ve been deep in a YAML config for the last forty minutes.

Each of those is a small signal. Stacked together, they answer “what is John doing right now?” more accurately than I’d answer it about myself.

Now imagine the music AI has access to those signals.

Don’t queue up the high-energy aggressive thing twenty minutes before a customer call.

John’s deep in a YAML config — give him the no-words mix, sorted by lowest energy. He needs flow, not motivation.

It’s 6:47pm on a Friday. The last commit was four hours ago. He’s done. Wind-down mode.

Nothing here is magic. Each decision is a calendar API call, a terminal-history check, or a “which window is focused right now?” lookup, combined with the engagement data the AI already has from Spotify. The intelligence isn’t in any single signal. The intelligence is in connecting them.

This is achievable today. The APIs exist. What’s missing for most people is that nobody’s connected them.

$ query layer3.sql

Layer 3: AI that knows what kind of week I’m having

The third layer is where it gets interesting. My inbox knows the shape of my week. My Slack knows when there’s been an incident. My health metrics know whether I slept well. My location knows whether I’m at the desk, in the car, or on a hike.

You don’t need an AI that “feels” anything to act on this. You need an AI with rules you’ve written:

If the inbox has three escalation threads and Slack has been quiet for an hour, John’s probably heads-down dealing with something. Don’t pick anything new — pull from his comfort tracks, the ones he commits to every time.

If the calendar shows back-to-back meetings ending in five minutes, queue the post-meeting decompression set. Five minutes of instrumental, then ramp to whatever’s next on the agenda.

If the heart rate is elevated and the location says “trail,” he’s hiking. The Spotify data already knows what works there — high-engagement tracks at moderate-to-high tempo, vocal-heavy is fine because his brain isn’t doing language work.

Each individual rule is boring. None of them require any AI sophistication. They’re if-this-then-that logic with a small amount of personal taste encoded into them.

The point isn’t that the AI is smart. The point is that the AI has access to enough context to apply boring rules in non-obvious combinations.

$ tail -f friction.log

The friction is real (and getting worse)

I want to flag something that happened while I was building this, because it’s part of the actual story.

Two days into the enrichment pass — the part where my scripts call Spotify’s catalog API to fill in track durations and artist genres — Spotify started blocking me due to rate-limits. Not for everything. Just for the bulk-lookup calls, on a freshly-registered developer app, hitting metadata for tracks that are already in my own listening history. Tightening that landed in February 2026 as part of a broader pattern: Spotify, like most large services, is steadily reducing what third-party developers can pull from their catalog at scale.

The fix wasn’t hard. Slow it down.

The AI is your accelerator. It is not your conscience. When I asked Claude Code to enrich my library, its instinct was to do it as fast as possible. Aggressive batching, parallel calls, retry-on-failure. None of that is wrong in isolation. All of it together looks like a scraper. The reason this enrichment pass works is that I supervised it — read the TOS, sized the call count before launching, told the AI to throttle, scoped to the Pareto slice first. That responsibility stays with you, especially when you’re hitting someone else’s infrastructure. There’s a longer treatment of this coming in a companion article on the schema and the AI-as-citizen pattern, publishing shortly.

But the broader signal is the part worth absorbing: every year, the friction to get your own data out of someone else’s silo grows a little. Not because of bad intent — Spotify has legitimate reasons to throttle API access. But the cumulative effect is real. The Spotify dump that arrived this week may not be available in the same form five years from now. The enrichment endpoints that worked yesterday may not work tomorrow. The exports your fitness tracker, your calendar provider, your inbox host all advertise today are quietly under more pressure each year as those companies decide AI-training-on-their-data is something they want to control.

That’s the second half of the bet I’m placing. Not just “I want my AI to know me.” Also: “I want to do this while I still can.”

$ cat mirror.md

The mirror

Here’s the bonus that nobody talks about.

Once the AI has access to all of this, it doesn’t just use the context to make better decisions. It can also reflect the context back at you in ways you’d never notice on your own.

John, you’ve started skipping every track over four minutes for the last six weeks. That’s new. Want me to look at what changed?

Your committed-listen rate on instrumental music dropped 30% in October. You also added forty-six calendar events that month. Probably worth noticing.

You’ve hit back-button on the same album eleven times in the last two weeks. That’s a record for any single album in your archive. You might genuinely love this one.

Your “winding down” listening has shifted thirty minutes earlier over the last six months. You’re going to bed earlier without realizing it.

This is the part of giving an AI context that I genuinely didn’t expect to find compelling, and it’s the one I now think might be the most valuable. The AI as mirror. Not as oracle — it doesn’t predict anything. It just notices, in the way a thoughtful friend who’s known you for years notices, except it has access to behavioral data no actual friend would ever have.

The data archive makes the mirror possible. No archive, no mirror.

$ man context

Why this is bigger than Spotify

The trap people fall into is letting the AI live in the moment. Every conversation starts from scratch. The AI knows nothing about you that you haven’t typed in the last fifteen minutes. That’s a brand-new colleague every single time, and you’re the one paying the onboarding tax.

I’ve written about this elsewhere — why I treat my AI context like infrastructure, and the practical companion piece on how to give an AI a memory. The Spotify project is one expression of that thesis. The data is just the medium.

Give your AI the right context, structurally and durably, and you stop talking to a chatbot. You start working with an assistant.

$ tree music-data/

For the curious: how the Spotify piece actually works

Built on SQLite. The whole archive is a single file. Idempotent ingest pipeline — re-running on the same input produces zero duplicates. Spotify’s Web API fills in the metadata the dump leaves out, including track duration (which is what makes the percent-played math possible). There’s a labeling layer for editorial judgments Spotify can’t make.

The data Spotify gives you (and what makes it useful)

You can request your full streaming history from Spotify’s privacy settings, and they’re legally obligated to give it to you. Thirty-day wait. Comes back as JSON.

Each play record includes a ms_played field — how many milliseconds you actually listened — and a reason_end field — why the play ended (trackdone, fwdbtn, backbtn, endplay, logout). Together those two fields tell you the truth that play count alone hides. A song queued four times and skipped in eight seconds each time is not a hit. It’s four rejections. The data has known this all along; Spotify just doesn’t expose that interpretation to you, and your AI can’t reach the data without a project like this.

The schema in three sentences

A normalized model with separate tracks, artists, and albums tables, plus a polymorphic plays fact table that handles tracks, podcast episodes, and audiobook chapters in one place. Every entity has a Spotify URI as its stable identity once enrichment runs. The interesting machinery — engagement bucketing, percent played, label inheritance — lives in SQL views, computed at query time so they’re always current.

The full design rationale will live in a companion article on the schema, publishing shortly. It’s open source; the repository link is below.

$ ls install/

Two ways in

Two install paths. Pick one.

Easy path: let your AI do it

Download setup-with-ai.md. Open it in Claude Code, Cursor, Cowork, or ChatGPT desktop. Say “set this up for me.”

The AI does:

Clones the repo and sets up the environment
Walks you through Spotify’s “create a developer app” screen
Configures the connection
Initializes the database and verifies it works

You do:

Click “approve” on the Spotify login
Request your Spotify data dump (the wait can be a few weeks)

Manual path: do it yourself

Clone the repo and follow INSTALL.md. Plain terminal commands, no AI involvement. If you’ve spent any time in InfoSec you already know which path you’ll pick.

⚡ Repo: github.com/TheDeLay/music-data

$ git log future/

Where this goes

Six months from now, I expect a few things to be true.

I expect the database will surface patterns I genuinely didn’t know about myself — the years I unconsciously drifted into a genre I never thought of as “mine,” the artists I returned to without realizing it, the years I burned out on something and never came back. I expect the playlists I generate from this database will sit at much higher avg-percent-played than the ones Spotify builds for me, because they’ll be drawn from songs I’ve already proven I commit to. I expect the contrast between “Spotify’s idea of me” and “the data’s idea of me” to be larger than I’d guess.

The next piece of this — the one I’ll write when it exists — is a small bridge service that lets my AI query this database directly the way it queries Spotify’s catalog today. Spotify’s connector knows the catalog. It doesn’t know me. The bridge is the layer that turns “I have a database” into “my AI knows my taste.” That article comes when the bridge does.

After that come the other layers. The calendar integration. The git-state awareness. The email-shape signals. Each one a small piece. Each one easy in isolation. The article when the compounding starts to actually feel like something — that one I’m looking forward to writing.

The first real queries are running this week. The next piece of this is being written as the data fills in.

$ exit

One last thing

The most surprising part of this whole project — to me — wasn’t building the database. The database was a weekend.

The surprising part was realizing that twelve years of my listening behavior had been sitting in someone else’s data center the whole time — and that getting it out, even for personal use, has friction that grows year over year. The same is true of nearly every other service I use. The data is there. The architecture for my AI to actually use it is the part I have to build.

Oh — and the first thing the data showed me, by the way, was that Spotify was even more wrong about me than I thought.

The data is yours. It always was.

Go get it — while you still can.

Companion piece coming shortly: “music-data: The Schema Story (or: Why I Didn’t Use Postgres)” — the schema, the design choices, and the things I considered and rejected.

Built in a homelab. The data was always mine. So is yours.

Spotify Has 12 Years of My Data. I Just Took It Back.

Spotify Has 12 Years of My Data. I Just Took It Back.

The thesis

The bet I’m placing

What this unlocks (Layer 1: just Spotify)

The no-words-while-coding mix.

The back-button discovery list.

The committed-but-forgotten list.

The mood-aware road trip generator.

Layer 2: AI that knows what I’m doing right now

Layer 3: AI that knows what kind of week I’m having

The friction is real (and getting worse)

The mirror

Why this is bigger than Spotify

For the curious: how the Spotify piece actually works

Two ways in

Easy path: let your AI do it

Manual path: do it yourself

Where this goes

One last thing

Context Engineering: How I Taught an AI to Coach Pinball

AI Curated Newsletter – Just 4U!

So You’re Good at AI? Build This Skill Before You Get Sued

Surviving Compaction

An Elegant Coach Claude

Ever Lost your Soul?

Leave a Reply Cancel reply

Spotify Has 12 Years of My Data. I Just Took It Back.

The thesis

The bet I’m placing

What this unlocks (Layer 1: just Spotify)

The no-words-while-coding mix.

The back-button discovery list.

The committed-but-forgotten list.

The mood-aware road trip generator.

Layer 2: AI that knows what I’m doing right now

Layer 3: AI that knows what kind of week I’m having

The friction is real (and getting worse)

The mirror

Why this is bigger than Spotify

For the curious: how the Spotify piece actually works

Two ways in

Easy path: let your AI do it

Manual path: do it yourself

Where this goes

One last thing

Similar Posts

Leave a Reply Cancel reply