Thanks for the shoutout! Tiny clarification that in the experiment it wasn't the debugging itself that was tricky. It was actually trivial: a Python program tries to access a nonexistent column and spits out an error saying exactly that. The trickiness is that the person or agent doing the debugging needs to appreciate that the debugging task is basically a trick question: you can't fix it without knowing what column the code was meant to refer to. So the only correct thing for a coding agent to do is ask the clarifying question. Changing the code to make the error shut up without any basis for that being the correct behavior is a particularly insidious failure.
I'd actually love to understand why exactly my results differed from those in the IEEE article. I did have to make a lot of guesses in replicating their setup. So I'm not sure if there are details of their setup or of mine that made one or the other less realistic. I included everything needed to replicate my results in the appendix of my post. It would be great to see more replication attempts.
Thanks for the clarification! Maybe my wording wasn't the best but I was glad to see *some* test of the claim; the idea that debugging has gotten worse with newer models seems obviously wrong? I'm not just talking from my own experience; it's not like debugging can be so different that whatever is being tested by SWE-bench or any of the other benchmarks? the IEEE example is one in which the newer models do something kinda like cheating, which is definitely a problem, but I find it hard to believe that this is a general "models are getting worse" problem even if maybe you can find the one weird trick where older models avoid a trap that newer models fall into.
Beautifully said. I guess the reason I was motivated to try replicating it was out of worry that it's not just One Weird Trick, as you say, but a symptom of something that could get worse as models get smarter.
(It's like the more general problem, where an AI trained to make people happy will, when it's still dumber than people, maximize its reward by being as helpful and honest as possible. But then at some point if it gets smart enough it maximizes its reward by manipulating you and telling you what you want to hear. Will that backfire on it, you ask? Not if it's so much smarter that it knows it won't be detected!)
But so far it seems to be closer to how you put it: the models are overall getting drastically more capable and more useful as coding assistants. It might even be fair to say that the ambiguous regression with the IEEE example was a fluke. That's my feeling at this point. But it will be worth keeping a close eye on.
Haha you beat me to it; I actually wrote an essay on Michael Polanyi for a philosophy of science course when I was in college (called "Science as a Social Endeavor") that I'm hoping to dust off at some point and update with the AI angle.
Thanks for the shoutout! Tiny clarification that in the experiment it wasn't the debugging itself that was tricky. It was actually trivial: a Python program tries to access a nonexistent column and spits out an error saying exactly that. The trickiness is that the person or agent doing the debugging needs to appreciate that the debugging task is basically a trick question: you can't fix it without knowing what column the code was meant to refer to. So the only correct thing for a coding agent to do is ask the clarifying question. Changing the code to make the error shut up without any basis for that being the correct behavior is a particularly insidious failure.
I'd actually love to understand why exactly my results differed from those in the IEEE article. I did have to make a lot of guesses in replicating their setup. So I'm not sure if there are details of their setup or of mine that made one or the other less realistic. I included everything needed to replicate my results in the appendix of my post. It would be great to see more replication attempts.
Thanks for the clarification! Maybe my wording wasn't the best but I was glad to see *some* test of the claim; the idea that debugging has gotten worse with newer models seems obviously wrong? I'm not just talking from my own experience; it's not like debugging can be so different that whatever is being tested by SWE-bench or any of the other benchmarks? the IEEE example is one in which the newer models do something kinda like cheating, which is definitely a problem, but I find it hard to believe that this is a general "models are getting worse" problem even if maybe you can find the one weird trick where older models avoid a trap that newer models fall into.
Beautifully said. I guess the reason I was motivated to try replicating it was out of worry that it's not just One Weird Trick, as you say, but a symptom of something that could get worse as models get smarter.
(It's like the more general problem, where an AI trained to make people happy will, when it's still dumber than people, maximize its reward by being as helpful and honest as possible. But then at some point if it gets smart enough it maximizes its reward by manipulating you and telling you what you want to hear. Will that backfire on it, you ask? Not if it's so much smarter that it knows it won't be detected!)
But so far it seems to be closer to how you put it: the models are overall getting drastically more capable and more useful as coding assistants. It might even be fair to say that the ambiguous regression with the IEEE example was a fluke. That's my feeling at this point. But it will be worth keeping a close eye on.
Thanks for the mention!
I published a short commentary on the RAND report this morning. Centered on the report's deep dive into what "tacit knowledge" is, its history, etc.
https://blog.stephenturner.us/p/tacit-knowledge-biosecurity-rand
Haha you beat me to it; I actually wrote an essay on Michael Polanyi for a philosophy of science course when I was in college (called "Science as a Social Endeavor") that I'm hoping to dust off at some point and update with the AI angle.
I look forward to it! I'll link to it when you post.
This was my first exposure to Polanyi and the long history of tacit knowledge.