The biggest issue with vibe coding

The biggest issue with vibe coding

Most experienced software engineers will tell you that majority of time they spend is not about creating new code.

It is debugging the existing one.

Vibe coding is an amazing way to create new code effectively, but AI based systems are still struggling with debugging.

So the real vibe would be vibe debugging not vibe coding!

I wanted to understand little more the challenges of self debugging systems.

But instead of highly polished perfect code (it is never like that), I tried to play with something that is closer to real life example of software development:

  1. Code works but is not perfect – yes, in real life, when systems are growing and new code is added over time there is always some technical dept.
  2. There are existing tests that you can run to validate that code
  3. Software engineers will analyze results from tests, debug code and fix issues.

I will try to automate step 3 and see what challenges AI will encounter (and what effective strategies I find to resolve them).

(stick with me—even if you're not a software engineer—because I'll share interesting insights that will help you craft better AI prompts, even for non-coding tasks!)

The simple self-debugging sandbox I created looked like this:


Article content

So let's play with Debug and Create New Version of Code box.

Here are the experiments and results

Experiment 1: Just ask simple prompt

I started with just simple prompt asking AI to debug a code and update it:

Debug provided code and create updated version of code that will solve issues from failing tests. 

        here is code:
        {code}
        
        here is failed test results of function 'segment_text_pro'
        {failed_test}        

{code} is entire code I'm running and {failed_test} contains info about input parameters, expected result and actual result of failing test case.

Unfortunately updated versions were not even close to fixing issues. AI followed completely wrong path. It claims that it need to fix code inside branch of 'if' condition even there was no way that path was executed.

n = 3

if n > 3: 
   <I want to fix this code !!!!>
else:
   <defect is definitely not here>        

I even tried more advanced model but it won't helped at all.

Experiment 2: Give me some steps

I decided to give AI some steps it can follow while debugging. Here is the prompt I used:

What return statement actually return value in failing case? 
Follow these steps:
1. return only the line that finishes the code
2. propose code changes that will fix that specific issue  

here is code:
{code}

here are tests results of function 'segment_text_pro'
{tests}        

that did not help at all but I've noticed that it finally started to focus on right part of the code.

So the next experiment was:

Experiment 3: Here is the solution

I gave AI the exact location of the code that it should focus on

This element of the code is wrong:
     if n == len(segments['text']):
        return segment_simple_text(input_string, n)        

That was a significant progress. Even the test was still failing AI made some progress.

It fixed right place in code. Yet the solution was not perfect - it would require another iteration of debug.

The fact that I literally gave AI the answer where to look is also not a scalable solution but the insight was important - divide action to simple steps instead of 1 big task "fix it".

And if you think about that for a second that is what software engineer will do. Start with understanding what path was executed, why that one and where to look for the rootcause.

Experiment 4: Baby steps

Now debugging is divided into 2 steps which would look like:

Step 1: Identify what return statement was executed in failed case. 
Step 2: Knowing that program executed to that path and returned wrong value, focus on that part of code and fix the bug        

Bug was still not fixed but identification of executed path worked perfectly.

Experiment 5: From baby steps to walking

Step 1: Identify what return statement was executed in failed case. 
Step 2: Explain why that code was executed for the failing test 
Step 3: Based on that explanation fix the issue for the failing case        

Dividing the flow into 3 separate steps (each was separate prompt) significantly improved the outcome.

That structure was able to get similar results like experiment 3. But instead of case 3 it was able to do this automatically.

However the results were not always right. In some cases it fixes bug in some cases the solution was wrong.

Experiment 6: Rethink what you just did

Knowing that the flow from experiment 5 gives quite good results I tried to see what will happen if it would iterate and repeat same process until issue is fixed.

If new code fixes bug - we are done!

If new code causes regression (passrate went down) - revert code to previous version

If new code does not break anything but bug is not fixed - run tests with new code, debug and create new fix.


Article content


The results were amazing - after no more than 3 iterations it was able to fix the issue correctly.

I've mentioned before that initial fix was exposing another issue but AI diagnosed it in next iteration and also fixed it.

Test passed but there was one significant issue with the code. But I will get back to this later. There is one more experiment I tried:

Experiment 7: Let's make it harder now

Knowing that this specific test case is passing now I created new one. This was really tricky (yes after spending so many years in testing software I'm always looking for a new corner case).

So far the AI has to debug only the main function. For this new test case it had to find out that one of the subfunctions is returning incorrect value and move its debugging efforts to different function. On top of that, even thought the fix was very simple, it required to connect two facts from different places of the code.

It failed miserably.

It continues to focus on main function and either produces the same code changes over and over or just trying to rewrite entire code.

It was clear that it needs some hint to also consider how other functions behave.

Here is updated flow:

Step 1: Identify what return statement was executed in failed case. 
Step 2: Explain why that code was executed for the failing test 
           Things to consider while debugging:
           - what if functions used for preparing data are not working correctly?
Step 3: Based on that explanation proposed 3 ideas how code can be updated to produce correct result for this failing case:
Step 4: select one of the ideas for code fixes and fix the issue        

2 things changed here:

  • hint to consider other functions: that immediately extended analysis to all functions in code
  • divide step 3 into two steps: idea generation and actual fixing. Without ideation phase it looped over and over around same wrong solutions. Ideation step helped to move forward with different ideas.

These changes led to correct identification of the issue. AI was able to identify and explain where there is an issue in code but it was not able to create correct fix.

I decided to stop at this point and summarize what I've learned so far—it's already a lot, don't you think?

So here they are.

Insights

  1. Multistep approach is much more effective than 1 step. Even advanced model was not able to achieve better results than simple model in multistep processing. This is in line with current trend of agentic AI where bigger tasks are divided into smaller.
  2. There's a major issue with the generated code (the one I promised to revisit). While the code itself was correct, it used Python's built-in functions instead of the custom functions specifically designed for that purpose. Although the output remained the same, the new code was harder to read. The whole point of using those custom functions was to keep the main function cleaner and more readable, but now the same logic was duplicated in two places.
  3. Instead of fixing just a failing element of the code AI re-wrote other functions that should not be changed. Yes the improvements were correct but I would rather have isolated and controlled changes over redesign of entire code.
  4. AI was good on driving conclusions about the code and explaining the flow. Dividing into steps helps with that reasoning.

Limitations

  1. Relatively small code base (approx 200 lines of code and 5 functions in total). I was able to put it into context window for each prompt. Bigger code base will be more challenging.
  2. No advanced reasoning. Each flow just followed predefined steps - I need to try giving AI more options to plan debugging steps by itself.
  3. It was more exploratory testing rather than solid proof that these steps will work on bigger scale. I would consider them hints how to build AI self debugger instead of 100% solid proof process.


Do you want to play with this self debugging AI?

Reply 'yes' in comments

I can put it online and give you a private access to it so you can try that experimental code.









Andreas Dirring

🦾🦾 Cognitive AI, Edge Case Simulation, V&V for ADS and ADAS with cogniPROVE + Synthetic Data Creation in Human Behavior Modeling Traffic environments @cogniBIT.ai

1mo

Helpful as always, Jan, thank you!

Like
Reply

Very helpful

To view or add a comment, sign in

More articles by Jan Flik

  • AI Agents Explained For Normal People

    When ChatGPT first appeared, it felt like a game-changer—an AI assistant that could help with writing, brainstorming…

    4 Comments

Insights from the community

Others also viewed

Explore topics