AI is forcing us to write good code

(bits.logic.inc)

90 points | by sgk284 6 hours ago

19 comments

AuthAuth 1 hour ago
>Statement about how AI is actually really good and we should rely on it more. Doesnt cover any downsides.
>CEO of an AI company
Many such cases
[-]
- heliumtera 30 minutes ago
  the fantastic machine will be 10^23x more productive than all of us combined, they will give it all away for 20 dollars a month and this people will be left without anything to sell. then, they will leave. so technically AI will force the world to heal, actually he is correct.
pgroves 1 hour ago
This is sort of why I think software development might be the only real application of LLMs outside of entertainment. We can build ourselves tight little feedback loops that other domains can't. I somewhat frequently agree on a plan with an LLM and a few minutes or hours later find out it doesn't work and then the LLM is like "that's why we shouldn't have done it like that!". Imagine building a house from scratch and finding out that it was using some american websites to spec out your electric system and not noticing the problem until you're installing your candadian dishwasher.
[-]
- toxic72 1 hour ago
  It's more like you're installing the dishwasher and the dishwasher itself yells at you "I told you so" ;)
  [-]
  - lvspiff 1 hour ago
    I think of it as you say "install dishwasher" and it plan looks like all the steps but as it builds it out it somehow you end up hiring a maid and buying a drying rack.
danieka 1 hour ago
I thought that the article would be about if we want AI to be effective, we should write good code.
What I notice is that Claude stumbles more on code that is illogical, unclear or has bad variable names. For example if a variable is name "iteration_count" but actually contains a sum that will "fool" AI.
So keeping the code tidy gives the AI clearer hints on what's going on which gives better results. But I guess that's equally true for humans.
[-]
- CharlieDigital 38 minutes ago
  What I find works really well: scaffold the method signature and write your intent in the comment for the inputs, outputs, and any mutations/business logic + instructions on approach.
  LLM has very high chance of on shotting this and doing it well.
  [-]
  - Philip-J-Fry 5 minutes ago
    This is what I tend to do. I still feel like my expertise in architecting the software and abstractions is like 10x better than I've seen an LLM do. I'll ask it to do X, and then ask it to do Y, and then ask it to do Z, and it'll give you the most junior looking code ever. No real thought on abstractions, maybe you'll just get the logic split into different functions if you're lucky. But no big picture thinking, even if I prompt it well it'll then create bad abstractions that expose too much information.
    So eventually it gets to the point where I'm basically explaining to it what interfaces to abstract, what should be an implementation detail and what can be exposed to the wider system, what the method signatures should look like, etc.
    So I had a better experience when I just wrote the code myself at a very high level. I know what the big picture look of the software will be. What types I need, what interfaces I need, what different implementations of something I need. So I'll create them as stubs. The types will have no fields, the functions will have no body, and they'll just have simple comments explaining what they should do. Then I ask the LLM to write the implementation of the types and functions.
    And to be fair, this is the approach I have taken for a very long time now. But when a new more powerful model is released, I will try and get it to solve these types of day to day problems from just prompts alone and it still isn't there yet.
    It's one of the biggest issues with LLM first software development from what I've seen. LLMs will happily just build upon bad foundations and getting them to "think" about refactoring the code to add a new feature takes a lot of prompting effort that most people just don't have. So they will stack change upon change upon change and sure, it works. But the code becomes absolutely unmaintainable. LLM purists will argue that the code is fine because it's only going to be read by an LLM but I'm not convinced. Bad code definitely confuses the LLMs more.
- sleepy_keita 1 hour ago
  Humans can work with these cases better though because they have access to better memory. Next time you see "iteration_count", you'll know that it actually has a sum, while a new AI session will have to re-discover it from scratch. I think this will only get better as time goes on, though.
  [-]
  - charcircuit 26 minutes ago
    You are underestimating how lazy humans can be. Humans are going to skim code, scroll down into the middle of some function and assume iteration count means iteration count. AI on the other hand will have the full definition of the function in its context every time.
  - rsyring 57 minutes ago
    Or you immediately rename it to avoid the need to remember? :)
mkozlows 2 hours ago
I like this. "Best practices" are always contingent on the particular constellation of technology out there; with tools that make it super-easy to write code, I can absolutely see 100% coverage paying off in a way that doesn't for human-written code -- it maximizes what LLMs are good at (cranking out code) while giving them easy targets to aim for with little judgement.
(A thing I think is under-explored is how much LLMs change where the value of tests are. Back in the artisan hand-crafted code days, unit tests were mostly useful as scaffolding: Almost all the value I got from them was during the writing of the code. If I'd deleted the unit tests before merging, I'd've gotten 90% of the value out of them. Whereas now, the AI doesn't necessarily need unit tests as scaffolding as much as I do, _but_ having them put in there makes future agentic interactions safer, because they act as reified context.)
[-]
- Waterluvian 2 hours ago
  It might depend on the lifecycle of your code.
  The tests I have for systems that keep evolving while being production critical over a decade are invaluable. I cannot imagine touching a thing without the tests. Many of which reference a ticket they prove remains fixed: a sometimes painfully learned lesson.
  [-]
  - zmgsabst 56 minutes ago
    Also the lifecycle of your system, eg, I’ve maintained projects that we no longer actively coded, but we used the tests to ensure that OS security updates, etc didn’t break things.
altmanaltman 2 hours ago
Wouldn't a better title be "How we're forcing AI to write good code (because it's normally not that good in general, which is crazy, given how many resources it's sucking, that we need to add an extra layer on top of it and use it to get anything decent)"
[-]
- Aerolfos 2 hours ago
  Don't forget "we're obligated to try and sell it so here's an ai generated article to fill up our quota because nobody here wanted to actually sit down and write it"
  [-]
  - sgk284 1 hour ago
    FWIW all of the content on our eng blog is good ol' cage-free grass-fed human-written content.
    (If the analogy, in the first paragraph, of a Roomba dragging poop around the house didn't convince you)
- add-sub-mul-div 2 hours ago
  Then it wouldn't be effective advertising/vanity blogging from some self-promoting startup.
cube00 21 minutes ago
I can't reconcile how the CEO of an AI startup is; on one hand pushing "100% Percent [sic] Code Coverage" while also selling the idea of "Less than 60 seconds to production" on their product (which is linked in the first screen-full of the blog post so it's not like these are personal thoughts).
If 100% code coverage is a good thing, you can't tell me anyone (including parallel AI bots) is going to do this correctly and completely for a given use case in 60 seconds.
I don't mind it mind it being fast, but to sell it as 60 second fast while trying to give the appearance you support high quality and correct code isn't possible.
brynary 2 hours ago
Strong agreement with everything in this post.
At Qlty, we are going so far as to rewrite hundreds of thousands of lines of code to ensure full test coverage, end-to-end type checking (including database-generated types).
I’ll add a few more:
1. Zero thrown errors. These effectively disable the type checker and act as goto statements. We use neverthrow for Rust-like Result types in TypeScript.
2. Fast auto-formatting and linting. An AI code review is not a substitute for a deterministic result in sub-100ms to guarantee consistency. The auto-formatter is set up as a post-tool use Claude hook.
3. Side-effect free imports and construction. You should be able to load all the code files and construct an instance of every class in your app without a network connection spawning. This is harder than it sounds and without it you run into all sorts of trouble with the rest.
3. Zero mocks and shared global state. By mocks, I mean mocking frameworks which override functions on existing types or global. These effectively are injecting lies into the type checker.
Should put to tsgo which has dramatically lowered our type checking latency. As the tok/sec of models keeps going up, all the time is going to get bottlenecked on tool calls (read: type checking and tests).
With this approach we now have near 100% coverage with a test suite that runs in under 1,000ms.
[-]
- frio 46 minutes ago
  A TypeScript test suite that offers 100% coverage of "hundreds of thousands" of lines of code in under 1 second doesn't pass the sniff test.
  [-]
  - brynary 34 minutes ago
    We're at 100k LOC between the tests and code so far, running in about 500-600ms. We have a few CPU intensive tests (e.g. cryptography) which I recently moved over to the integration test suite.
    With no contention for shared resources and no async/IO, it just function calls running on Bun (JavaScriptCore) which measures function calling latency in nanoseconds. I haven't measured this myself, but the internet seems to suggest JavaScriptCore function calls can run in 2 to 5 nanoseconds.
    On a computer with 10 cores, fully concurrent, that would imply 10 billion nanoseconds of CPU time in one wall clock second. At 5 nanoseconds per function call, that would imply a theoretical maximum of 2 billion function calls per second.
    Real world is not going to be anywhere close to that performance, but where is the time going otherwise?
- ManuelKiessling 2 hours ago
  I‘m on the same page as you, I‘m investing into DX and test coverage and quality tooling like crazy.
  But the weird thing is: those things have always been important to me.
  And it has always been a good idea to invest in those, for my team and me.
  Why am doing this 200% now?
  [-]
  - monatron 1 hour ago
    If you're like me you're doing it to establish a greater level of trust in generated code. It feels easier to draw out the hard guard-rails and have something fill out the middle -- giving both you, and the models, a reference point or contract as to what's "correct"
  - ManuelKiessling 1 hour ago
    Answering myself: maybe I feel much more urgency and motivation for this in the age of AI because the effects can be felt so much more acute and immediately.
  - mkozlows 1 hour ago
    Because a) the benefits are bigger, and b) the effort is smaller. When something gets cheaper and more valuable, do more of it.
  - 0x696C6961 1 hour ago
    For me it's because coworkers are pumping out horrible slop faster than ever before.
jillesvangurp 1 hour ago
This goes in the right direction. It could go further though. Types are indeed nice. So, why use a language why using those is optional? There are many reasons but many of those have to do with people and their needs/wants rather than tool requirements. AI agents benefit from good tool feedback, so maybe switch to languages and frameworks that provide plenty of that and quickly. Switching used to be expensive. Because you had to do a lot of the work manually. That's no longer true. We can make LLMs do all of the tedious stuff.
Including using more rigidly typed languages, making sure things are covered with tests, using code analysis tools to spot anti patterns and addressing all the warnings, etc. That was always a good idea but we now have even less excuses to skip all that.
sandblast2 1 hour ago
The expertise in software engineering typical in these promptfondling companies shine through this blog post.
Surely they know 100% code coverage is not a magical bullet because the code flow and the behavior can differ depending on the input. Just because you found a few examples which happen to hit every line of code you didn't hit every possible combination. You are living in a fool's paradise which is not a surprise because only fools believe in LLMs. You are looking for a formal proof of the codebase which of course no one does because the costs would be astronomical (and LLMs are useless for it which is not at all unique because they are useless for everything software related but they are particularly unusable for this).
[-]
- SR2Z 46 minutes ago
  It's a bold claim that LLMs are useless for formal verification when people have been hooking them up to proof assistants for a while. I think that it's probably not a terrible idea; the LLM might make some mistakes in the spec but 99% of the time there are a lot of irrelevant details that it will do a serviceable job with.
badgersnake 3 hours ago
I’m increasingly finding that the type of engineer that blogs is not they type of engineer anyone should listen to.
[-]
- yoyohello13 2 hours ago
  The value of the blog post is negatively correlated to how good the site looks. Mailing list? Sponsors? Fancy Title? Garbage. Raw HTML dumped on a .xyz domain, Gold!
  [-]
  - userbinator 1 hour ago
    on a .xyz domain
    That's a negative correlation signal for me (as are all the other weird TLDs that I have not seen besides SEO spam results and perhaps the occasional HN submission.) On the other hand, .com, .net, and .org are a positive signal.
  - llmslave2 2 hours ago
    The exception is a front end dev, since that's their bread and butter.
- sgk284 3 hours ago
  Can you say more? I see a lot of teams struggling with getting AI to work for them. A lot of folks expect it to be a little more magical and "free" than it actually is. So this post is just me sharing what works well for us on a very seasoned eng team.
  [-]
  - imron 2 hours ago
    As someone who struggles to realise productivity gains with AI (see recent comment history) I appreciate the article.
    100% coverage for AI generated code is a very different value proposition than 100% coverage for human generated code (for the reasons outlined in the article).
  - justatdotin 1 hour ago
    it is MUCH easier for solo devs to get agents to work for them than it is for teams to get agents to work for them.
- cube00 7 minutes ago
  Even some of the comments here can't help name dropping their own startups for no actual reason.
- throwatdem12311 1 hour ago
  It's just veiled marketing for their company.
- observationist 3 hours ago
  Badgersnake's corollary to Gell-Mann amnesia?
- iamjs 3 hours ago
  I find that this idea of restricting degrees of freedom is absolutely critical to being productive with agents at scale. Please enlighten us as to why you think this is nonsense
  [-]
  - mrkeen 3 hours ago
    Wearing seatbelts is critical for drunk-driving.
    All praise drunk-driving for increased seatbelt use.
    [-]
    - llmslave2 2 hours ago
      Finally something I can get behind.
jaredcwhite 2 hours ago
I'm sad programmers lacking a lot of experience will read this and think it's a solid run-down of good ideas.
[-]
- SoKamil 1 hour ago
  I’m more afraid that some manager will read this and impose rules on their team. On the surface one might think that having more test coverage is universally good and won’t consider trade offs. I have a gut feeling that Goodhart’s Law accelerated with AI is a dangerous mix.
- manmal 2 hours ago
  What’s bad about them? We make things baby-safe and easy to grasp and discover for LLMs. Understandability and modularity will improve.
- zem 1 hour ago
  "fast, ephemeral, concurrent dev environments" seems like a superb idea to me. I wish more projects would do it, it lowers the barrier to contributions immensely.
  [-]
  - frio 40 minutes ago
    Yeah, this is something I'd like more of outside of Agentic environments; in particular for working in parallel on multiple topics when there are long-running tasks to deal with (eg. running slow tests or a bisect against a checked out branch -- leaving that in worktree 1 while writing new code in worktree 2).
    I use devenv.sh to give me quick setup of individual environments, but I'm spending a bit of my break trying to extend that (and its processes) to easily run inside containers that I can attach Zed/VSCode remoting to.
    It strikes me that (as the article points out) this would also be useful for using Agents a bit more safely, but as a regular old human it'd also be useful.
- baobun 2 hours ago
  Could you be more specific in your feedback please.
  [-]
  - jaredcwhite 2 hours ago
    100% test coverage, for most projects of modest size, is extremely bad advice.
    [-]
    - CuriouslyC 1 hour ago
      Pre-agents, 100% agree. Now, it's not a bad idea, the cost to do it isn't terrible, though there's diminishing returns as you get >90-95%.
      [-]
      - marcosdumay 34 minutes ago
        LLMs don't make bad tests any less harmful. Nor they write good tests for the stuff people mostly can't write good tests for.
      - pca006132 40 minutes ago
        The problem is that it is natural to have code that is unreachable. Maybe you are trying to defend against potential cases that may be there in the future (e.g., things that are yet implemented), or algorithms written in a general way but are only used in a specific way. 100% test coverage requires removing these, and can hurt future development.
    - bdangubic 2 hours ago
      laziness? unprofessionalism? both? or something else?
      [-]
      - spc476 27 minutes ago
        You forgot difficult. How do you test a system call failure? How do you test a system call failure when the first N calls need to pass? Be careful how you answer, some answers technically fall into the "undefined behavior" category (if you are using C or C++).
      - rvz 2 hours ago
        all of the above.
sublinear 10 minutes ago
What? We're already so far down the list of things to try with AI that we're saying hallucinated tests are better than no tests at all?
Seems actively harmful, and the AI hype died out faster than I thought it would.
> Agents will happily be the Roomba that rolls over dog poop and drags it all over your house
There it is, folks!
bgwalter 5 hours ago
https://logic.inc/
"Ship AI features and tools in minutes, not weeks. Give Logic a spec, get a production API—typed, tested, versioned, and ready to deploy."
[-]
- travisgriggs 3 hours ago
  https://en.wikipedia.org/wiki/Drinking_the_Kool-Aid
- bgwalter 1 hour ago
  Someone is downvoting everything again. It seems to be a cronjob, always around the same time.
mrits 3 hours ago
Author should ask AI to write a small app with 100% code coverage that breaks in every path except what is covered in the tests.
[-]
- thih9 3 hours ago
  Example output if anyone else is curious:
```
    def fragile(x):
        lst = [None]
        lst[x - 42]
        return "ok"
    
    def test_fragile():
        assert fragile(42) == "ok"
```
- sgk284 3 hours ago
  I never claim that 100% coverage has anything to do with code breaking. The only claim made is that anything less than 100% does guarantee that some piece of code is not automatically exercised, which we don't allow.
  It's a footnote on the post, but I expand on this with:
```
  100% coverage is actually the minimum bar we set. We encourage writing tests for as many scenarios as is possible, even if it means the same lines getting exercised multiple times. It gets us closer to 100% path coverage as well, though we don’t enforce (or measure) that
```
  [-]
  - nicoburns 9 minutes ago
    > I never claim that 100% coverage has anything to do with code breaking.
    But what I care about is code breaking (or rather, it not breaking). I'd rather put effort into ensuring my test suite does provide a useful benefit in that regard, rather than measure an arbitrary target which is not a good measure of that.
  - reactordev 2 hours ago
    I feel this comment is lost on those who have never achieved it and gave up along the journey.
  - a3w 1 hour ago
    Brakes in cars here in Germany are integrated with less than 50 % coverage in the final model testing that goes to production.
    Seems like even if people could potentially die, industry standards are not really 100% realistic. (Also, redundancy in production is more of a solution than having some failures and recalls, which are solved with money.)
  - xcskier56 2 hours ago
    SimpleCov in ruby has 2 metrics, line coverage and branch coverage. If you really want to be strict, get to 100% branch coverage. This really helps you flesh out all the various scenarios
block_dagger 1 hour ago
I stopped reading at “static typing.” That is not what “good code” always looks like.
[-]
- kurtis_reed 33 minutes ago
  Ok
devhouse 5 hours ago
[dead]
jennyholzer3 3 hours ago
I don't know about all this AI stuff.
How are LLMs going to stay on top of new design concepts, new languages, really anything new?
Can LLMs be trained to operate "fluently" with regards to a genuinely new concept?
I think LLMs are good for writing certain types of "bad code", i.e. if you're learning a new language or trying to quickly create a prototype.
However to me it seems like a security risk to try to write "good code" with an LLM.
[-]
- sgk284 2 hours ago
  I suspect it will still fall on humans (with machine assistance?) to move the field forward and innovate, but in terms of training an LLM on genuinely new concepts, they tend to be pretty nimble on that front (in my experience).
  Especially with the massive context windows modern LLMs have. The core idea that the GPT-3 paper introduced was (summarizing):
```
  A sufficiently large language model can perform new tasks it has never seen using only a few examples provided at inference time, without any gradient updates or fine-tuning.
```
- rabf 2 hours ago
  You do realise they can search the web? They can read documentation and api specs?
  [-]
- manmal 2 hours ago
  They are retrained every 12-24 months and constantly getting new/updated reinforcement learning layers. New concepts are not the problem. The problem is outdated information in the training data, like only crappy old Postgres syntax in most of the Stackoverflow body.
  [-]
  - Aerolfos 1 hour ago
    > They are retrained every 12-24 months and constantly getting new/updated reinforcement learning layers
    This is true now, but it can't stay true, given the enormous costs of training. Inference is expensive enough as is, the training runs are 100% venture capital "startup" funding and pretty much everyone expects them to go away sooner or later
    Can't plan a business around something that volatile
    [-]
    - 0x696C6961 1 hour ago
      You don't need to retrain the whole thing from scratch every time.