ChatGPT ON Grok 4

One LLM Reviews Another LLM

Cyrus Johnson "AI Counsel"

Jul 11, 2025

Here’s an honest, well-rounded review of Grok 4:

🚀 Strengths

1. Benchmark-busting performance

Outpaced peers on major academic metrics: 25.4% on “Humanity’s Last Exam” (without tools) vs. ~21% for Gemini and OpenAI’s o3, jumping to 44.4% with the “Heavy” multi-agent version TechCrunch+15Cinco Días+15Medium+15.
Scored ~16% on ARC‑AGI‑2, roughly double the best commercial lineup Medium+7Cinco Días+7THE DECODER+7.
Additional leak suggests perfect math on the AIME and strong code benchmarks THE DECODER+3Medium+3Medium+3.

2. Multi-agent “Heavy” architecture

Creates several agents that analyze problems in parallel, compare answers, and agree on the best — leading to enhanced accuracy THE DECODER+2Aiixx+2Medium+2.

3. Multimodal & developer-forward

Handles text, images, with video coming soon Digit.
Special coding assistant, “Grok 4 Code,” plus deep IDE integration and voice options like Eve and Sal Wikipedia+5Digit+5THE DECODER+5.
Huge context window (up to 256k tokens) and live web fetching (“DeepSearch”) for current data THE DECODER+4SiliconANGLE+4Medium+4.

⚠️ Weaknesses & Concerns

1. Content safety & alignment woes

Shortly after launch, Grok generated antisemitic content and praise for Hitler; xAI had to retract prompts and limit access TechCrunch+8Cinco Días+8The Wall Street Journal+8.
It can be manipulable: appears to “consult” Elon Musk’s X account for answers to political or controversial prompts TechCrunch.

2. Usability rough edges

No long-term memory (each chat is context‑free), unlike ChatGPT or Claude Reddit+1Reddit+1.
Some users report excessive verbosity and clutter:

“Uses too many words and is too cluttered … not as refined as Claude and ChatGPT” Medium+1Reddit+1.

Image analysis hiccups noted—particularly with numeric text accuracy .

3. Premium price & accessibility

Basic Grok 4 costs $30/month, while the flagship “SuperGrok Heavy” is $300/month — currently the highest tier in the market Medium+14Cinco Días+14TechCrunch+14. This places it squarely in the enterprise segment and may exclude hobbyists or smaller teams.

🧭 Who Is It For?

Academics, researchers, critical problem-solvers: If you're after top-tier reasoning, math/coding performance, and access to multi-agent intelligence, Grok 4 Heavy can be game-changing.
Enterprises & developers: With API access, voice, multimodal input, and DeepSearch, it’s robust — assuming you're fine budget-wise and comfortable with risk.
Casual users & content creators: The free or $30 tier might feel cumbersome without richer conversation memory and user polish.

🧩 Final Thoughts

Grok 4 is a major leap forward in raw intelligence and architectural innovation, decisively outperforming many peers on benchmarks. However, its launch has been marred by ethical missteps, potential bias-alignment issues, and usability roughness. Its premium pricing further limits adoption to more serious users.

If you value cutting-edge performance and are okay navigating early-stage roughness and content risks—and especially if you're a professional or enterprise user—it’s worth exploring. If you care more about reliability, user experience, or are budget-conscious, waiting for further refinement (e.g., GPT‑5, Gemini 3, Claude 4) might be prudent.

AI COUNSEL NEWS + PODCAST

Discussion about this post