Tool Calling: Magic or Gimmick?

Q: Can AI run code by itself?

No. AI only predicts and outputs text. The execution environment like your Python server or Node.js is what actually runs the code based on the AI's instructions.

Q: Which model currently handles tool calling the best?

Claude Opus 4.6 is currently doing a great job at following complex JSON schemas with fewer hallucinations than GPT-5.2. However, Gemini 3.1 Pro's speed is superior if you only need to call simple tools with fewer branches.

Q: How do I prevent the AI from calling the wrong tool?

Write very clear descriptions for your tools. Don't name a function processdata. Name it extractuseremailfromtext. The more specific the function name and description, the less likely the AI is to get confused.

Beyond the AI agent hype, tool calling is a fragile API chain that costs more than you realize.

May 13, 2026 · Andrew · 5 min read

AI Tool Calling LLM

a woman in a black shirt is pointing at buttons

I remember staring blankly at my API cost management dashboard last week, wondering how a tiny agent script managed to burn through $43.50 in a single afternoon. As it turns out, giving AI the freedom to “press buttons” isn’t cheap—and it’s certainly not as magical as social media hype would have you believe.

What is Tool Calling, really?

Practically speaking, tool calling isn’t about the AI suddenly growing hands to type or directly querying your database. It is simply a tacit agreement between you and the language model.

I used to think that when configuring a tool, the AI would directly execute tasks like sending emails or reading files. But after three months of real-world use with GPT-5.2, it turns out things are much more primitive: the AI simply returns a JSON-formatted text block. Your code receives that JSON, parses the data, and it is actually your server that executes the corresponding function.

If that JSON is malformed, or if the AI hallucinates a non-existent parameter, your entire system will crash.

The Exorbitant Price of Convenience

Context Window Bloat

Every time you provide the AI with a new tool, you must attach the entire description (schema) of that tool to the system prompt. If you have 10 tools, your prompt suddenly carries thousands of meaningless tokens for every single chat turn.

Back in March, I tested an agent designed to crawl stock data. When I added four auxiliary tools, the input token count skyrocketed. Instead of responding in 3 seconds, the system began taking 14.8 seconds for even the simplest request.

Architectural Risks

Entrusting control flow to a language model is a risky move. You are replacing if/else statements that run in 1 millisecond with a network call that takes 3 seconds, costs money, and has a random error rate. This is the kind of clunky design I mentioned when discussing System Architecture Thinking: Don’t Fall into the Trap—overusing AI for fixed logic only adds junk to your system.

Comparison Table

Criteria	Hardcoded Logic (Standard Code)	Tool Calling (AI Agent)	Notes
Latency	< 0.1 seconds	2 - 15 seconds	LLM depends on network and server
Cost	Nearly $0	A few cents to a few dollars / run	Schema tokens are billed every call
Reliability	100% (if code is correct)	~85-95%	AI can hallucinate parameters
Flexibility	Low (only understands rigid input)	Extremely high	Handles messy natural language well

Practical Implementation Without Breaking the Bank

1. Group and Minimize Schemas

Don’t throw a tool with 15 optional parameters at the AI. Break them down, or better yet, aggregate small steps. I once optimized a personal agent flow by forcing it to combine text processing steps, reducing 28 cumbersome API calls to just 3 network requests. Costs dropped by 90%.

2. Always Have Fallback Logic

Don’t place absolute trust in the JSON returned by Claude Sonnet 4.6 or GPT-5. No matter how good the model is, you must write careful try/catch code. If the AI calls a tool with a malformed email address, your code should block it immediately and ask the AI to fix the error, rather than blindly calling a third-party API.

3. Test with Smaller Models First

You don’t need massive models to test function-calling logic. You can set up a simulation environment right on your local machine. Re-read the post Ollama: Don’t Rush to Quit GPT-5.2 to Run Local LLMs to learn how to offload simple routing tasks to your own computer for free.

Frequently Asked Questions

Can AI run code by itself?

No. AI only predicts and outputs text. The execution environment (like your Python server or Node.js) is what actually runs the code based on the AI’s instructions.

Which model currently handles tool calling the best?

Claude Opus 4.6 is currently doing a great job at following complex JSON schemas with fewer hallucinations than GPT-5.2. However, Gemini 3.1 Pro’s speed is superior if you only need to call simple tools with fewer branches.

How do I prevent the AI from calling the wrong tool?

Write very clear descriptions for your tools. Don’t name a function process_data. Name it extract_user_email_from_text. The more specific the function name and description, the less likely the AI is to get confused.

Conclusion

Tool calling is a fantastic bridge between messy natural language and the world of structured data. But it is by no means a universal silver bullet. It’s like hiring a very smart intern who occasionally gets sleepy: they can handle difficult tasks, but you absolutely should not hand them the keys to the vault without double-checking every single invoice.

01 Burning Out Despite Using AI: The Productivity Paradox Jul 13, 2026 → 02 Stop Hopping Between AI Coding Tools Jul 10, 2026 → 03 Second Brain: Stop Hoarding Trash, Start Delivering Results Jul 8, 2026 →