Shostack + Friends Blog

 

Adventures in LLM Coding

Exploring LLM-driven coding as I get ready for Archimedes A DFD from Code.

Archimedes is the most concentrated medical device security event I’ve attended, and this year, their Health Care Security Week is smack dab in the middle of the two weekends of the New Orleans Jazz Fest. Which, really, only makes it more attractive. In this post, I want to share a bit about using modern tools to prepare for that, and also some of the lessons I learned along the way. Oh, and let you know seats are available in a training there, and I’ve hidden a discount code in this post.

Now, until I had an excuse to make this year’s festival, I didn’t really grasp just how enormous a festival it is. Eight days. The Rolling Stones. Queen Latifah. Bonnie Raitt shows up on the fourth line of the festival poster. It’s ... intimidating. And one of the things I really love about a good music festival is the chance to discover bands. And so I figured that there’d be a way to explore the music, like SXSW used to drop sampler torrents. Alas, no...

This turns out to be a useful opportunity to explore how LLMs can help me code. Based on some comments from Simon Willison, I decided to delve in with Anthropic’s recent Claude 3 models.

Writing code

A robot sorta playing a thing
like a slide trombone, as drawn by an AI

The first hurdle was deciding on a music player. I wanted to work with Apple Music, mostly for privacy reasons. So I asked Claude to write me Applescript that would drive Apple Music via a CSV, and ran into issues that I was never quite able to debug with permissions for that. (I think I needed an Apple Developer account.) I spent a few hours bouncing between Applescript and Python, which was a mistake. I should have stayed in one tool or abandoned it faster.

I took an unfortunate turn through getting Claude to move client secrets to their own file, and it never quite got that right, and, as it turns out that was important, even though I never leaked my secrets.

When I got to the expectation that I needed a developer account to automate Apple Music, I decided to swap to Spotify. Over about an hour or two, I got working code that would take a comma separated list of artists, add the latest album with a precise match for an artist, and prompt me if something else came back. Without further ado, here are 8 playlists, one per day:

Producing these is about 5 minutes of work each, with most of that being checking fuzzy matches. (See below.) I can focus my attention on the exception handling, not fiddly select, paste, scroll, select, right click, fiddle through to the playlist, repeat 100x.

What I learned

The code is at Github (as a Gist, I don’t plan to maintain it) and includes a program that will delete an album all at once. Some were bad matches, sometimes the album Spotify chose were... not good. The Spotify search API is biased, and gives different results from the web interface. The API will give you a popular artist with a name that’s similar to a less well known artist. I could probably get the code adjusted to deal with that, but what’s the right algorithm to handle a purported match like this:

Searched for 'Terence Blanchard featuring The E-Collective &
Turtle Island Quartet', found 'Terence Blanchard'.
Latest album: 'Perry Mason: Season 2 (Soundtrack from the
HBO® Series)'
It turns out that that Perry Mason album is Terence Blanchard’s work even though his name is not in the title. There are a lot of special cases here, which gives me some sympathy for the API’s biases. But other times, searching for “Patrice Fisher & Arpa with guests from Martinique“ missed “Patrice Fisher,” and albums she’d done with Arpa, and instead returned “Patrice Rushen.”

Another sort of issue was that Claude routinely lost details of what I’d told it to do. So why use Claude versus an IDE plugin? I really like the conversational-refinement approach to programming. I can talk about what I’d like and get code (which you can also do with programming by comment and letting the co-pilot interface make suggestions), but that doesn’t handle errors and iteration the same way. I didn’t have to learn the spotify API at all or worry about details of its data structures.

The big reveal

Which brings me to the big reveal: I have no idea what this code does. I mean, I’ve read it, largely before I pasted it into a local buffer and ran it, but the spotipy library? Well, yeah. Good question! So, recognizing that I’m not sure what the code does and that I have a stochastic parrot, why not see what the parrot can do? So I asked Claude to help me threat model the code it had written.

Some of what it did seemed credible. Like many people, I’ve been exploring how to make this work, and one of my lessons has been to ask for mermaid diagrams. So I did, and what came out... well, the best answer is the image that headlines this post, and I thought we were doing pretty well.

But then, as I was moving the files to a new directory, I discovered something: My authentication was broken. And as I debugged, I learned that there was a .cache file with some JSON access token in it...and that file isn’t mentioned in my data flow diagram. I had found myself drawn into the tool! Despite all my skepticism and work understanding LLMs, I got drawn in to thinking it was giving me ok answers.

In fact, you can lead an LLM to water, but you can’t make it think. Even when prompted specifically to do so, I couldn’t make it delve into dependencies. I could not get it to tell me about the .cache file. (And who knows what else?) So I started in with semgrep, but that’s a story for another day.

Join us in New Orleans

Learning, exploring, and getting better at what we do are important themes of Archimedes. They’re all crucial parts of how we secure medicine. We need to experiment, explore and report on how LLM coding works, because it’s an attractive idea: I got more done because of it than not. How we use LLMs to deliver safe and effective medical treatments and devices is an open question. Closely related is how we use LLMs to create those devices. In both cases, the answer may well be that they’re not ready for prime time.

For today, let me say that you’ll learn a lot at Archimedes. Maybe you’ll even add in the training that I’m doing there. Maybe you’ll use discount code DISCARCH at checkout and get a surprise discount? (Hey, this blog post is all about surprises and reveals! And it turns out that’s not very well hidden, but ya know?)