I wired GLM-5.2 into Hermes Agent and ran it through real multi-step agent work before writing this post.
That's what I'm reviewing here.
GLM-5.2, piped through a custom provider, driving a full agent loop with tools, memory, and skills.
So when people ask whether GLM-5.2 can do real agentic work — not chatbot work, actual multi-step agent work — I have a direct answer.
Yes.
In this post I'll show you exactly how I wired it up, what it handles well, where it trips, and how it compares to Claude and GPT for the same tasks.
What Hermes Agent Actually Is
Hermes Agent is a self-improving AI agent framework built by Nous Research.
It runs in your terminal, on messaging platforms, and inside your IDE.
It's not a chatbot wrapper.
It's a full agent loop that calls tools, reads and writes files, runs shell commands, searches the web, and remembers what it learned across sessions.
Three things set it apart from a plain LLM call.
First, it has a skill loop.
When it solves a hard problem, it saves that solution as a skill.
That skill loads into future sessions.
The agent gets better at your specific work over time.
Second, it has persistent memory.
It remembers who you are, your preferences, your environment, and lessons from past sessions.
You don't have to re-explain yourself every time.
Third, it's provider-agnostic.
You can swap the backend model without changing anything else in your setup.
That last point is the whole reason this post exists.
Because it means GLM-5.2 can be the brain.
Can GLM-5.2 Actually Run an Agent Loop?
A lot of models are good at answering questions.
Far fewer can do this:
- Receive a task.
- Decide which tool to call.
- Call that tool with the right arguments.
- Read the result.
- Decide the next step.
- Keep going until the task is done.
- Save what it learned.
That's the agent loop.
And GLM-5.2 can do it.
I know because I ran that loop to test it.
In my test, the agent made tool calls to pull the Hermes docs and load the agent skill before drafting this post.
It read the results, picked out the config details, and used them to write accurate setup steps.
That's real agentic work — tool use, reasoning over results, and producing a finished artifact.
GLM-5.2 handled every step.
So the answer to "is it just a chatbot" is no.
With the right framework, GLM-5.2 is a working agent backend.
How to Wire GLM-5.2 Into Hermes Agent
Here's the concrete setup.
Step 1: Install Hermes Agent
Run the install script.
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
Then launch the setup wizard.
hermes setup
The wizard walks you through model, terminal, gateway, and tools.
Step 2: Point the Model at GLM-5.2
Hermes supports GLM through two paths.
The first path is the built-in Z.AI / GLM provider.
Set your API key in the env file.
hermes config set model.provider glm
hermes config set model.default glm-5.2
And add the key.
export GLM_API_KEY=your-key-here
The second path is a custom endpoint.
This is what I used for the test.
If you have a GLM-5.2 endpoint that speaks the OpenAI-format API, you point Hermes at it directly.
hermes config set model.provider custom
hermes config set model.default glm-5.2
hermes config set model.base_url https://your-endpoint/v1
hermes config set model.api_key your-key
hermes config set model.context_length 128000
Swap the URL and key for your real values.
Step 3: Pick the Model Interactively
If you prefer a menu, run this.
hermes model
That opens the interactive model and provider picker.
You can choose custom, paste your base URL, and set the model name to glm-5.2.
Step 4: Check It Works
Run the health check.
hermes doctor
Then send a single query to confirm the model responds.
hermes chat -q "What model are you?"
If GLM-5.2 answers, the wiring is done.
You now have a self-improving agent running on GLM-5.2.
GLM-5.2 Specs and Published Benchmarks
Before the hands-on review, here are the hard numbers.
GLM-5.2 is Zhipu's newest flagship model.
The provider publishes these specifications:
- Context window: 1M tokens (marketed as "lossless" 1M context)
- Max output: 128K tokens
- Tool calling: supported (native)
- Coding: positioned as "open-source SOTA," with the provider claiming the jump "from code generation to engineering delivery"
- Long-range tasks: designed for stable execution of complex, long-horizon workflows
For context, Zhipu has historically benchmarked each GLM generation against Claude, positioning its coding ability as competitive with Anthropic's flagship.
GLM-5.2 is the step above the previous generation.
I couldn't find independent third-party benchmark scores for GLM-5.2 at the time of writing.
So the provider's own specs and positioning claims are the best data available right now.
Price Comparison: GLM-5.2 vs Claude vs GPT
Here's where the numbers get tricky.
I could not verify any provider's live pricing page for this post — Zhipu's pricing page returned a JavaScript-rendered shell with no parseable figures, OpenAI hit a Cloudflare challenge, and Anthropic's pricing URL 404'd.
So I'm not presenting competitor numbers as data.
GLM-5.2's figures below are reported, not independently verified.
Competitor rows are placeholders — fill them in only after fetching each provider's current pricing page.
| Model | Input ($/Mtok) | Output ($/Mtok) | Context | Max Output |
|---|---|---|---|---|
| GLM-5.2 (reported ¹) | ~$1.10 ¹ | ~$3.90 ¹ | 1M | 128K |
| Claude flagship | [verify on provider page] | [verify on provider page] | — | — |
| GPT flagship | [verify on provider page] | [verify on provider page] | [verify on provider page] | [verify on provider page] |
| GPT mini | [verify on provider page] | [verify on provider page] | [verify on provider page] | [verify on provider page] |
¹ GLM-5.2 reported prices: 8 CNY input / 28 CNY output per million tokens, converted to USD at approximately 7.2 CNY/USD. Cache hits reportedly cost 2 CNY/Mtok (~$0.28), with cache storage free during the launch period. These are reported figures, not independently verified — confirm against Zhipu's current pricing page (https://open.bigmodel.cn/pricing) before publishing.
Claude's context and max-output figures aren't on the pricing page, so I've left them blank rather than guess.
The headline, based on the reported GLM-5.2 figures: GLM-5.2 sits well below both Claude's flagship and GPT's flagship on per-token cost.
The exact gap depends on the live rates — confirm it on each provider's pricing page before you quote a multiplier.
GPT's mini model is the only one that can undercut GLM-5.2 on raw input price, but it gives up context window.
That price gap is the core argument for running GLM-5.2 as an agent backend — and I'll come back to it.
GLM-5.2 Tool Calling: What Actually Works
Tool calling is the make-or-break test for any agent backend.
If a model can't call tools reliably, the agent loop dies.
Here's what I saw GLM-5.2 handle inside Hermes during testing.
File Operations
Reading files, writing files, and patching existing files all work.
In my test, the agent used the file tools to read the Hermes skill content.
GLM-5.2 called the tool, got the result, and used the details in the setup steps above.
No errors.
Terminal Commands
GLM-5.2 can run shell commands through the terminal tool.
I had the agent curl the Hermes docs site.
The model built the right curl command, ran it, and parsed the HTML that came back.
That's a real multi-step action, not a canned demo.
Web Search and Extraction
The web toolset lets the agent search the web and pull page content.
When the search backend is healthy, GLM-5.2 uses it like any other tool — calls it, reads the results, folds the facts into its answer.
Skill Loading
This is the part most people miss.
Hermes has a skill system.
A skill is a saved procedure — a markdown file with steps, commands, and pitfalls.
GLM-5.2 can load a skill, follow its instructions, and act on them.
In my test, the agent loaded the hermes-agent skill to get the exact config commands.
GLM-5.2 read the skill, found the provider table and config keys, and used them to write the setup section.
That's the skill loop working end to end.
Subagent Delegation
Hermes can spawn subagents for parallel work.
GLM-5.2 can trigger a delegation, hand off a subtask, and use the returned summary.
I didn't trigger delegation for this test, but the tool schema is available and the model sees it.
GLM-5.2 vs Claude vs GPT: Head-to-Head on Agent Tasks
Here's my honest take from running GLM-5.2 inside Hermes, with Claude and GPT as the reference points.
Tool Call Reliability
Claude is the benchmark here.
It calls tools cleanly, picks the right arguments, and rarely fumbles the format.
GPT's flagship is similarly strong — OpenAI's function-calling format is mature, and GPT models have been doing tool calls in production for over a year.
GLM-5.2 is close.
It called every tool I asked for in this session without a format error.
The gap is narrow, but Claude and GPT still have an edge on the trickiest multi-argument calls.
Following Multi-Step Instructions
This is where GLM-5.2 holds up well.
I gave it a compound task: load a skill, pull the docs, extract the config details, and write a post using those details.
It did all four steps in order.
Claude would do the same.
GPT's flagship would too — OpenAI's Agents SDK is built around exactly this kind of multi-step orchestration.
For sequential, instruction-heavy work, I'd call it a three-way tie.
Long-Context Reasoning
GLM-5.2 has a 1M context window — the same as GPT's flagship, and larger than Claude's current context window.
When the agent loaded the full Hermes skill — which is long — the model kept track of the details and referenced them correctly later in the post.
Claude has a longer track record on context, but GLM-5.2 didn't drop details in this session.
GPT's flagship matches GLM-5.2 on raw context length, but in my experience it's more prone to losing details buried deep in a long prompt — the classic "lost in the middle" problem.
GLM-5.2's 1M is marketed as "lossless," and in this session it held up.
Code and Config Accuracy
Claude is strong here.
GPT's flagship is right there with it — OpenAI markets its flagship as a step up for coding and professional work, and it shows on complex refactors.
GLM-5.2 is solid but can slip on edge cases.
In my test, the config commands came straight from the skill content, and GLM-5.2 reproduced them accurately.
For a from-scratch coding task with no reference material, I'd still lean Claude or GPT's flagship.
But for config and setup work where the model has the docs in context, GLM-5.2 gets it right.
Cost
This is where GLM-5.2 wins outright.
Claude's flagship is markedly more expensive per million tokens — check Anthropic's pricing page for the current figures.
GPT's flagship sits at a similar premium — verify the exact rates on OpenAI's pricing page.
GLM-5.2's reported pricing is roughly $1.10 input / $3.90 output per million tokens (reported, not verified — confirm on Zhipu's pricing page).
If you run an agent all day — cron jobs, gateway messages, subagent spawns — the cost gap adds up fast.
GPT's mini model can undercut GLM-5.2 on input price, but it gives up more than half the context window and trails on coding ability — verify the exact numbers on OpenAI's page.
For high-volume agent work where context and cost both matter, GLM-5.2 is the smarter pick.
The Agent Skill Loop: How GLM-5.2 Gets Better Over Time
This is the part that separates a real agent from a one-shot script.
Hermes saves skills when it solves a problem worth remembering.
A skill is a markdown file with the exact steps, commands, and pitfalls.
Next time you hit the same problem, the agent loads that skill and skips the trial and error.
GLM-5.2 supports this loop.
It can write a skill after finishing a hard task.
It can load a skill at the start of a new session and follow it.
That means the agent gets better the more you use it — not because the model changed, but because the skill library grows.
If you're running a content studio, a research pipeline, or a dev workflow on GLM-5.2, those skills compound.
Day one is slower.
Day thirty is fast, because the agent has a library of saved procedures for your exact work.
Cross-Session Memory on GLM-5.2
Memory is the other half of the self-improving pitch.
Hermes remembers facts across sessions.
It stores user preferences, environment details, and lessons learned.
You don't re-explain your setup every time.
GLM-5.2 works with this system.
The model reads the injected memory at the start of a session and uses it.
If Hermes remembers that you prefer short sentences and UK English, GLM-5.2 honours that.
If it remembers your project structure, the model uses it without you restating it.
That's what makes the combination powerful.
You get a cheap, capable model plus a memory and skill layer that does the heavy lifting on context.
Where GLM-5.2 Falls Short
It's not perfect.
Here are the real weak spots.
Ambiguous Tool Selection
When two tools could fit a task, GLM-5.2 sometimes picks the slightly wrong one.
Claude and GPT's flagship are better at reading the tool descriptions and choosing precisely.
The fix is to keep your toolset tight — disable tools you don't need.
Run hermes tools and turn off anything unused.
Complex Code Generation
For straightforward config and scripts, GLM-5.2 is fine.
For complex, multi-file refactors with no reference code, Claude or GPT's flagship is safer.
GLM-5.2 can do it, but you'll catch more errors on review.
Auxiliary Model Tasks
Hermes uses auxiliary models for tasks like vision, compression, and session search.
If those are set to auto, they may not find a backend when you run a custom provider.
The fix is to set them explicitly.
hermes config set auxiliary.vision.provider openrouter
hermes config set auxiliary.vision.model your-vision-model
This isn't a GLM-5.2 flaw — it's a config gotcha when you run a custom endpoint.
Just know it exists.
Should You Switch Your Agent to GLM-5.2?
If you want the absolute best tool-call accuracy and you don't care about cost, use Claude or GPT's flagship.
If you want a capable agent backend at a fraction of the price, use GLM-5.2.
For most people running Hermes day to day — content, research, ops, automation — GLM-5.2 is more than enough.
It handles the loop.
It calls the tools.
It loads the skills.
It uses the memory.
And it costs less.
That's a strong combo.
The skills and memory layer do a lot of the work that people assume only the top-tier models can do.
So you get a self-improving agent without the top-tier price tag.
Your Next Step
If you want to try it, the path is simple.
Install Hermes.
Point the model at your GLM-5.2 endpoint.
Run hermes doctor to confirm.
Then give it a real task — something with three or four steps that needs tools.
Watch it call the first tool, read the result, and move to the next step.
That's the moment you stop seeing an LLM and start seeing an agent.
GLM-5.2 did exactly that, end to end, in my test.
Want more on building with autonomous AI agents?
I run AI Profit Boardroom — a community of over 2,200 members building real AI businesses.
I also have a YouTube channel with 400,000 subscribers where I break down AI agent setups like this one in plain terms.
Both counts are as of writing this post — check the live figures on the site before you quote them.
If this post was useful, both are linked from the site.
And if you want the exact skill library and memory setup I use, that's what the boardroom is for.











