I am in support. In general he was really good with names I thought, they always had an otherwordly flair while being clear to pronounce. Skaffen-Amtiskaw, Anaplian, Elethiomel...
I'm a huge fan of property based testing, I've built some runners before, and I think it can be great for UI things too so very happy to see this coming around more.
Something I couldn't see was how those examples actually work, there are no actions specified. Do they watch a user, default to randomly hitting the keyboard, neither and you need to specify some actions to take?
What about rerunning things?
Is there shrinking?
edit - a suggestion for examples, have a basic UI hosted on a static page which is broken in a way the test can find. Like a thing with a button that triggers a notification and doesn't actually have a limit of 5 notifications.
Hey, yeah the default specification includes a set of action generators that are picked from randomly. If you write a custom spec you can define your own action generators and their weights.
Rerunning things: nothing built for that yet, but I do have some design ideas. Repros are notoriously shaky in testing like this (unless run against a deterministic app, or inside Antithesis), but I think Bombadil should offer best-effort repros if it can at least detect and warn when things diverge.
Shrinking: also nothing there yet. I'm experimenting with a state machine inference model as an aid to shrinking. It connects to the prior point about shaky repros, but I'm cautiously optimistic. Because the speed of browser testing isn't great, shrinking is also hard to do within reasonable time bounds.
For re-running, I assume you want to do this all on a review app with a snapshot of the DB, so you start with a clean app state.
Should be pretty easy to make it deterministic if you follow that precondition.
(How I had my review apps wired up was I dumped the staging DB nightly and containerized it, I believe Neon etc make it easy to do this kind of thing.)
Ages ago I wired up something much more basic than this for a Python API using hypothesis, and made the state machine explicit as part of the action generator (with the transitions library), what do you think about modeling state machines in your tests? (I suppose one risk is you don’t want to copy the state machine implementation from inside the app, but a nice fluent builder for simple state machines in tests could be a win.)
That's true, clean app state gets you far. And that's something I'm going to add to Bombadil once it gets an ability to run many tests (broad exploration, reruns, shrinking), i.e. something in the spec where you can supply reset hooks, maybe just bash commands.
Regarding state machines: yeah, it can often become an as-complex mirror of the system your testing, if the system has a large complicated surface. If on the other hand the API is simple and encapsulates a lot of complexity (like Ousterhout's "Deep Modules") state machine specs and model-based testing make more sense. Testing a key-value store is a great example of this.
How effective is property based testing in practice? I would assume it has no trouble uncovering things like missing null checks or an inverted condition because you can cover edge cases like null, -1, 0, 1, 2^n - 1 with relatively few test cases and exhaustively test booleans. But beyond that, if I have a handful of integers, dates, or strings, then the state space is just enormous and it seems all but impossible to me that blindly trying random inputs will ever find any interesting input. If I have a condition like (state == "disallowed") or (limit == 4096) when it should have been 4095, what are the odds that a random input will ever pass this condition and test the code behind it?
Microsoft had a remotely similar tool named Pex [1] but instead of randomly generating inputs, it instrumented the code to also enable executing the code symbolically and then used their Z3 theorem proofer to systematically find inputs to make all encountered conditions either true or false and with that incrementally explore all possible execution paths. If I remember correctly, it then generated a unit test for each discovered input with the corresponding output and you could then judge if the output is what you expected.
An important step with property based testing and similar techniques is writing your own generators for your domain objects. I have used to it to incredible effect for many years in projects.
I work at Antithesis now so you can take that with a grain of salt, but for me, everything changed for me over a decade ago when I started applying PBT techniques broadly and widely. I have found so many bugs that I wouldn't have otherwise found until production.
I'm doing propety-based test since years for frontend stuff. The hardest part is, that there is so much between the test inputs and the application under test, that I find 50% of the time problems with the frontend test frameworks/libs and not in our code.
And sometimes you find errors in code that absolutely should never have errors: I found an (as of yet not-root-caused) error in sqlite (no crash or coredump, just returns the wrong data, and only when using sqlite in ram-only-mode). Had to move to postgres for that reason alone. This is part of the reason why I have a strong anti-library bias (and I sound like a lunatic to most colleagues because they "have never had a problem" with $favorite_library -- to which my response is: "how do you _know_?"[0], which often makes me sound like I'm being unreasonably difficult).
Sometimes, only thing you can do is let the plague spread, and hope that the people who survive start showering and washing their hands.
[0]: I once interviewed at a company that sold a kind of on-prem VM hosting and storage product. They were shipping a physical machine with Linux and a custom filesystem (so not ZFS), and they bragged about how their filesystem was very fast, much faster than ZFS or Btrfs on SSDs. I asked them, if they were allowed to tell me how they achieved such consistent numbers. They listed a few things, one of which was: "we disabled block-level check-summing". I asked: "how do you prevent corruption?". They said: "we only enable check-summing during the nightly tests". So, a little unsettled, I asked: "you do not do _any_ check-summing at any point in production"? They replied: "Exactly. It's not necessary". So, throwing caution to the wind (at this point I did not care to get the job), I asked: "And you've never had data corruption in production"? They said: "Never. None". To which I replied: "But how do you _know_"? My no-longer-future-coworker thought for a few seconds, and realization flashed across his face. This was a company that had actual customers on 2 continents, and was pulling in at least millions per year. They were probably silently corrupting customer data, while promising that they were the solution -- a hospital selling snake-oil, while thinking it really is medicine.
PBT allows us to test more combinations without writing hundreds of tests. Yes, it's about user flow inside a single module of our gigantic application.
I use quicktheories (Java) and generate a determistic random test scenario, then I generate input values and run the tests. This way I can create tests that should fail or succeed, but differ in the steps executed and in the order with "random input".
This looks genuinely awesome! I've been thinking about how to do good property based testing on UIs, and this elegantly solves that problem — I love the language they've designed here. It really feels like model checking or something.
I especially like that it's a single executable according to the docs.
Recently evaluated other testing tools/frameworks and if you're not already running the npm-dependencyhell-shitshow for your projects, most tools will pull in at least 100 dependencies.
I might be old fashioned but that's just too much for my taste. I love single-use tools with limited scope like e.g. esbuild or now this.
From a project management perspective, the 5 examples don't help me understand how/why I might switch from Playwright/Cypress to this framework. It seems like Bombadil is a much lower-level test framework focusing on DOM properties but in the "Why Bombadil?" introduction you say "maintaining suites of Playwright or Cypress tests takes a lot of work" ... I'd like if there was an example showing how this is true, perhaps a 1:1 example of Playwright vs Bombadil for testing something such as notifications clearing when I click clear. Basically, beefing up examples with real-world ones that Playwright users might have written is a good way to foster adoption.
This is a great point. Bombadil _is_ also tied to the DOM much like those tools, but as you focus on providing just a set of generators (which can be largely the defaults already supplied by Bombadil), you get a lot of testing from a small spec. You might need to specify some parts in terms of DOM selectors and such, and that has coupling, but I think the power-to-weight ratio is a lot better because of PBT.
All I can think of is "the Token Ring had no power over him" but then I realized that "token ring" has a completely different meaning now in the age of AI.
Hey Oskar ~ great project and looks promising. I would be curious to hear what is still work-in-progress for Bombadil.
It's helpful to know what the tool maintainers see as upcoming or incomplete work. It also saves a consultant like me a lot of time to evaluate new tools for clients if I also know the limitations before diving in. Maybe a section in the manual for "What Bombadil can't do".
Good feedback! Short answer: a lot of stuff is remaining. It's a very new projects and I've been trying to cover the basics. There's a ton to do around better state space exploration, reporting/debugging (working on this now!), integration with other tools and platforms like CI, etc. But a living section in the README or the Manual for "planned but not yet built" probably makes sense.
Sure, server-side or client-side generated UIs tend to have a lot more interesting complexity to test. But I do want to bring up that with the specification language being TypeScript, you can validate some basic properties even for statically generated sites. I wrote a spell checker for Bombadil that uses https://github.com/wooorm/nspell and a custom dictionary and found a bunch of old typos on my own blog.
I'd say it does! Bombadil is very new but its predecessor has found complicated and very real bugs in my work projects, and in the paper we wrote about it, we found bugs in more than half of the TodoMVC example apps: https://arxiv.org/abs/2203.11532
"Just Another Victim Of The Ambient Morality" is one of my favorites.
So occasionally I got mails by "some colleague on behalf of Sauron" back then
Jokes aside, great project and documentation (manual)! Getting started was really simple, and I quickly understood what it can and cannot do.
Something I couldn't see was how those examples actually work, there are no actions specified. Do they watch a user, default to randomly hitting the keyboard, neither and you need to specify some actions to take?
What about rerunning things?
Is there shrinking?
edit - a suggestion for examples, have a basic UI hosted on a static page which is broken in a way the test can find. Like a thing with a button that triggers a notification and doesn't actually have a limit of 5 notifications.
Rerunning things: nothing built for that yet, but I do have some design ideas. Repros are notoriously shaky in testing like this (unless run against a deterministic app, or inside Antithesis), but I think Bombadil should offer best-effort repros if it can at least detect and warn when things diverge.
Shrinking: also nothing there yet. I'm experimenting with a state machine inference model as an aid to shrinking. It connects to the prior point about shaky repros, but I'm cautiously optimistic. Because the speed of browser testing isn't great, shrinking is also hard to do within reasonable time bounds.
Thanks for the questions and feedback!
Should be pretty easy to make it deterministic if you follow that precondition.
(How I had my review apps wired up was I dumped the staging DB nightly and containerized it, I believe Neon etc make it easy to do this kind of thing.)
Ages ago I wired up something much more basic than this for a Python API using hypothesis, and made the state machine explicit as part of the action generator (with the transitions library), what do you think about modeling state machines in your tests? (I suppose one risk is you don’t want to copy the state machine implementation from inside the app, but a nice fluent builder for simple state machines in tests could be a win.)
Regarding state machines: yeah, it can often become an as-complex mirror of the system your testing, if the system has a large complicated surface. If on the other hand the API is simple and encapsulates a lot of complexity (like Ousterhout's "Deep Modules") state machine specs and model-based testing make more sense. Testing a key-value store is a great example of this.
If you're curious about it, here's a very detailed spec for TodoMVC in Bombadil: https://github.com/owickstrom/bombadil-playground/blob/maste... It's still work-in-progress but pretty close to the original Quickstrom-flavored spec.
Microsoft had a remotely similar tool named Pex [1] but instead of randomly generating inputs, it instrumented the code to also enable executing the code symbolically and then used their Z3 theorem proofer to systematically find inputs to make all encountered conditions either true or false and with that incrementally explore all possible execution paths. If I remember correctly, it then generated a unit test for each discovered input with the corresponding output and you could then judge if the output is what you expected.
[1] https://www.microsoft.com/en-us/research/publication/pex-whi...
I work at Antithesis now so you can take that with a grain of salt, but for me, everything changed for me over a decade ago when I started applying PBT techniques broadly and widely. I have found so many bugs that I wouldn't have otherwise found until production.
Sometimes, only thing you can do is let the plague spread, and hope that the people who survive start showering and washing their hands.
[0]: I once interviewed at a company that sold a kind of on-prem VM hosting and storage product. They were shipping a physical machine with Linux and a custom filesystem (so not ZFS), and they bragged about how their filesystem was very fast, much faster than ZFS or Btrfs on SSDs. I asked them, if they were allowed to tell me how they achieved such consistent numbers. They listed a few things, one of which was: "we disabled block-level check-summing". I asked: "how do you prevent corruption?". They said: "we only enable check-summing during the nightly tests". So, a little unsettled, I asked: "you do not do _any_ check-summing at any point in production"? They replied: "Exactly. It's not necessary". So, throwing caution to the wind (at this point I did not care to get the job), I asked: "And you've never had data corruption in production"? They said: "Never. None". To which I replied: "But how do you _know_"? My no-longer-future-coworker thought for a few seconds, and realization flashed across his face. This was a company that had actual customers on 2 continents, and was pulling in at least millions per year. They were probably silently corrupting customer data, while promising that they were the solution -- a hospital selling snake-oil, while thinking it really is medicine.
Recently evaluated other testing tools/frameworks and if you're not already running the npm-dependencyhell-shitshow for your projects, most tools will pull in at least 100 dependencies.
I might be old fashioned but that's just too much for my taste. I love single-use tools with limited scope like e.g. esbuild or now this.
Will give this a try, soon.
Nice name, now who is he?
It's helpful to know what the tool maintainers see as upcoming or incomplete work. It also saves a consultant like me a lot of time to evaluate new tools for clients if I also know the limitations before diving in. Maybe a section in the manual for "What Bombadil can't do".
Great work!
Ok I will see myself out
(Yes I know it's actually from the Tolkien book)