Feb 11, 2025
This last month has been fascinating. I guess LLMs have finally resonated with me on a deeper level. It wasn’t like I woke up and suddenly everything was different, but their impact is growing on me non-linearly, forcing me to rewire my brain.
I know there are probably tons of blog posts by the newly converted. I’m not trying to offer any grand insights, I’m just documenting my process and current ideas.
I’ve been a typical sceptic of Copilot and similar tools. Sure, it’s nice to generate boilerplate and throw-away scripts, but that’s a minor part of what we do all day, right? I even took a break from using them for many months, and I’ve had serious qualms with their use in some areas outside coding.
After messing around with copilot.lua in Neovim, I tried Cursor. Their vision and what they’ve already built opened my eyes, especially the shadow workspaces, Tab, and the rule files. At the same time, a critical mass of friends and peers were building new products on top of these models; things I highly respect and can see massive value in.
Since then I’ve actively been looking for how I can use LLMs — beyond the auto-complete and chat interaction modes, beyond making me a slightly more productive developer. Don’t get me wrong, I love being alone coding for hours in a cozy room. It’s great. But I’m also curious to see how far I can push this myself, and of course how far and where the industry goes.
One project that I’ve been working on goes under the working name site2doc. It’s a tool that converts entire web sites into EPUB and PDF books. Mostly because I want it for myself. I’d like to read online material, typeset minimalistically and beautifully, offline on my ebook reader. It turned out others want that too. There are great tools for converting single pages, but not for entire sites.
My main problem is that the web is highly unstructured and diverse.
To be frank, a lot of sites have really bad markup. No titles at all,
identical titles across pages, <h1>
elements used
inconsistently. The list goes on. This makes it very difficult for
site2doc to generate a useful table of contents.
A friend suggested using LLMs to extract the information, and I experimented with using screenshots as input to classify pages. Both Claude 3.5 Sonnet and Gemini 2.0 Flash performed well, but I haven’t been able to generalize the approach across many sites. There’s just too much variability in how websites are structured, and I’m not sure how to handle it. I’m open to suggestions!
The other project, temporarily named converge, is a bit closer to what everyone else is doing: using LLMs for programming. It’s an agent that, given some file or module, covers it with a generated test suite, and then goes on to optimize it. The key idea is that the original code is the source of truth. The particular optimization goal could be performance, resource usage, readability, safety, or robustness. So far I’ve focused only on performance, partly because evaluation is straightforward.
Going beyond example-based test suites, I’ve been thinking about how property-based testing (PBT) might fit in. The obvious approach is to have the LLM generate property tests rather than examples. I don’t know how well this would work, if the LLM can generate meaningful properties.
A more interesting way is to generate an oracle property that compares the behavior of new code generated by the LLM to the original code: , where is some generated input. This provides a rigorous way to verify that optimizations preserve the original behavior. I’m curious to see how PBT’s shrinking could guide the LLM to iteratively fix the generated code.
Another random idea: have the LLM explain existing code in natural language, then generate new code based only on the description. Run the old and new code side-by-side, and see how they differ, functionally and non-functionally.
I’ve only run converge on toy examples and snippets so far. I’m sure there are major challenges in applying it to larger code bases. Here’s what it currently does to Achilles numbers in Rosetta Code:
» converge -input AchillesNumbers.java
time=2025-02-11T09:35:44.877+01:00 level=INFO msg="test suite created"
time=2025-02-11T09:35:45.271+01:00 level=WARN msg="tests failed" attempt=1
time=2025-02-11T09:35:53.308+01:00 level=INFO msg="test suite modified"
time=2025-02-11T09:35:53.709+01:00 level=WARN msg="tests failed" attempt=2
time=2025-02-11T09:36:02.457+01:00 level=INFO msg="test suite modified"
time=2025-02-11T09:36:02.885+01:00 level=INFO msg="tests passed"
time=2025-02-11T09:36:09.508+01:00 level=INFO msg="increasing benchmark iterations" duration=0.038339998573064804 attempt=0
time=2025-02-11T09:36:15.056+01:00 level=INFO msg="increasing benchmark iterations" duration=0.2295980006456375 attempt=1
time=2025-02-11T09:36:21.225+01:00 level=INFO msg="increasing benchmark iterations" duration=0.5998150110244751 attempt=2
time=2025-02-11T09:36:28.010+01:00 level=INFO msg="benchmark run"
time=2025-02-11T09:36:35.737+01:00 level=INFO msg="code optimized" attempt=0
time=2025-02-11T09:38:19.094+01:00 level=INFO msg="tests failed"
...
time=2025-02-11T09:38:33.798+01:00 level=INFO msg="code optimized" attempt=9
optimization succeeded: 1.870953s -> 1.337585s (-28.51%)
That took about three minutes. It does all sorts of tricks to make it
faster, but one that caught my eye was the conversion from
HashMap<Integer, Boolean>
to
HashSet<Integer>
, and finally to BitSet
.
Here’s part of the diff:
...
public class AchillesNumbers {-
- private Map<Integer, Boolean> pps = new HashMap<>();
+ private final BitSet pps = new BitSet();
+ private final BitSet achillesCache = new BitSet();
+ private static final byte[] SMALL_PRIMES = {2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47};
+ private static final int[] POWERS_OF_TEN = {1, 10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000};
+ private static final short[] SQUARES = new short[317];
+ private static final int[] CUBES = new int[47];
+ private static final short[] TOTIENTS = new short[1000];
... }
I’m surprised to find myself as excited as I am now. I did not see it coming! Just during the last few days, I’ve realized how much I need to unlearn in order to make better use of what these models have learned.
I was implementing control flow, using Claude to generate various bits of code for the converge tool. Then I realized that, hey, maybe it should be the other way around? Claude plans the control flow, and my tool just provides the ways of interacting with the environment (modifying source files, running tests, etc). It’s not a revelation, but an example of how one might need to think differently.
Even more down to earth, things like saying “do X for me” rather than asking “how do I do X?”. Instead of asking for some chunk of code, I tell it to solve a problem for me. Of course, I still review the changes. Cursor and Cody have both been great ways of changing my thinking.
What other habits and thought patterns might need to change? I don’t know how programming will look in the future, but I’m actively working on keeping an open mind and hopefully playing a small role in shaping it.
Comment on Hacker News or X.