This review is based on documented features, verified pricing, and community sentiment — not hands-on testing. See how we research →
AIToolGrade uses Claude (Anthropic) for content production. Claude Opus 4.7/4.8 is a direct competitor to MiniMax M3. We have applied our standard research methodology — documented features, verified pricing, community sentiment — and have not received compensation from MiniMax. Where M3's vendor-reported benchmarks compare it favorably to Claude, we have flagged that those figures are not yet independently verified.
RELATED REVIEWS
DeepSeek V4 Review 2026 — Frontier Coding Benchmarks at 1/30th the Cost → Kimi Code Review 2026 — Open-Source Claude Code Alternative → Claude Code Review 2026 — Anthropic's Agentic Coding Tool →MiniMax M3 is an open-weight frontier language model launched June 1, 2026 by MiniMax, a Shanghai-based AI lab. The pitch is unusually specific for a model release: M3 is positioned as the first open-weight model to do three things simultaneously — code at a frontier level, hold a million tokens of context, and accept native multimodal input across text, image, and video. Plenty of models do one or two of those. M3 claims all three in a single open-weight system, and that combination is what makes it worth a serious look rather than a passing one.
The headline coding number is 59.0% on SWE-Bench Pro, the hardest of the real-world coding benchmarks — narrowly ahead of GPT-5.5's 58.6%. Carry the obligatory caveat with that figure everywhere it appears: it's vendor-reported, measured under MiniMax's own evaluation setup, and not yet independently confirmed. The architecture underneath is the genuinely new part. MiniMax Sparse Attention (MSA) brings back sparse attention the company had deliberately stripped out of the previous M2 generation, this time with a lightweight index branch that scans tokens and decides which key-value blocks actually need attention. The result MiniMax reports is 15.6x faster decoding and 9.7x faster prefill at 1M context versus M2 — a direct attack on the cost of running long-context inference.
Then there's the price. At $0.60/M input tokens (with a launch-week promo at $0.30/M), M3 costs roughly 12x less than Claude Opus 4.7 at what MiniMax claims is comparable benchmark performance. The open weights — under an MIT-style license — are expected around June 10-11, 2026 — verify current availability on HuggingFace before planning self-hosted deployments. Once they land, full self-hosting is on the table for anyone with the infrastructure. If the benchmarks hold up to independent scrutiny, that is one of the strongest cost-performance propositions in the open-weight category. The conditional in that sentence is the whole review.
M3 is an API-first frontier model, and the fit is sharpest for developers who actually need its specific combination rather than any one piece. The clearest match is multimodal agentic work. If your workload involves reading screenshots, parsing diagrams, processing short video, and then writing or fixing code off that input, M3 is unusual: it does all of that natively in a single open-weight model, with no separate vision model bolted on. That combination — multimodal input plus frontier coding plus a million-token window — doesn't have a direct open-weight equivalent right now.
It also fits high-volume agentic pipelines where token cost is the binding constraint. At $0.60/M input against Claude Opus 4.7's $15, the per-call math changes what's economically sane to run — batch refactors, long-running agent loops, large-corpus analysis. And for teams that need full data sovereignty, the open weights (once they ship) make self-hosting a real option: every token stays inside your own environment, API cost drops to infrastructure-only. Researchers evaluating the MSA sparse-attention architecture are another natural audience — the technical contribution here is the kind of thing worth studying directly. So are teams already running DeepSeek or Kimi who want to compare their open-weight options head to head.
The misfits are worth naming just as plainly. Enterprise teams with strict US or EU data-residency requirements face the same Shanghai-hosting concerns covered below — the same gate that applies to DeepSeek V4 and Kimi Code. Anyone whose production decision depends on independently verified benchmarks should wait; at the time of this review, the numbers are MiniMax's own. Developers who want native IDE integration won't find it — M3 is an API and a set of weights, not a VS Code extension. And if you needed the weights at launch, they weren't there: the HuggingFace release was expected around June 10-11, 2026 — verify current availability on HuggingFace before planning self-hosted deployments. If you want a polished, turnkey agentic coding tool today, Claude Code or Cursor are the more practical starting points.
The shape is lopsided in an instructive way. Features and Value for Money both sit at 9.5 — the feature set (59% SWE-Bench Pro, 1M context, MSA, native multimodal, desktop operation) is among the most complete in the open-weight category, and at $0.60/M input with a self-host option, the price-performance is hard to argue with on paper. Ease of Use (7.0) and Integration (7.5) sit in the mid-7s: the OpenAI-compatible API and OpenRouter access ease onboarding, but it's API-first with no IDE extension, and self-hosting demands real infrastructure. Support & Documentation drags hardest at 5.5 — this is a ten-day-old model, the docs are still developing, support runs through a Chinese-company structure, and the community is only beginning to form. The 7.9 overall deliberately holds the line: the feature and value scores would push higher, but unverified benchmarks, data-residency exposure, and new-model maturity cap it. We expect the number to move once independent verification arrives — likely upward, in 60 to 90 days.
Every number in this section carries the same asterisk: the scores are vendor-reported, measured under each company's own evaluation setup, and may not be directly comparable. Independent verification is pending for M3 specifically. That isn't a throwaway disclaimer — it's the single most important fact about M3 at the time of this review. A 59% SWE-Bench Pro result that edges GPT-5.5 is a meaningful claim if it survives third-party testing, and a marketing line if it doesn't. We're reporting the claim and the uncertainty together because that's the honest state of the evidence.
| Benchmark | MiniMax M3 | Kimi K2.6 | DeepSeek V4-Pro | Claude Opus 4.7 |
|---|---|---|---|---|
| SWE-Bench Pro | 59.0% | 58.6% | — | — |
| SWE-Bench Verified | — | 80.2% | 80.6% | 87.6% |
| Terminal-Bench 2.1 | 66.0% | — | — | — |
| BrowseComp | 83.5 | 86.3 (swarms) | — | — |
| Context window | 1M tokens | 1M tokens | 1M tokens | 1M tokens |
| Input price / M | $0.60 | $0.60 | $0.435 | $15.00 |
| Open weight | ✓ | ✓ Apache 2.0 | ✓ MIT | ✗ |
| Multimodal | ✓ Text/Image/Video | ✗ | ✗ | ✓ Text/Image |
Read the table as a profile, not a leaderboard. M3 and Kimi K2.6 report SWE-Bench Pro within half a point of each other, but the benchmarks don't line up cleanly across models — M3 leans on Pro and Terminal-Bench, the others on Verified — so cross-model comparison is approximate at best. The two facts that do hold up regardless of evaluation noise are price and modality: M3 matches Kimi on input cost while undercutting Claude Opus 4.7 by roughly 12x, and it's the only model in this group accepting native video input. Its 66% on Terminal-Bench 2.1 and 83.5 on BrowseComp point to real agentic and web-research capability, but again — vendor-reported, awaiting confirmation. The TechTimes analysis is worth weighing here too: it argues M3 trails Claude Opus 4.8 by meaningful margins on directly comparable agent evaluations, a reminder that "beats GPT-5.5 on SWE-Bench Pro" is one slice of a larger picture.
Pricing is verified June 2026 and splits into three paths: MiniMax's own standard API, OpenRouter access, and self-hosting the open weights. The launch week carried a 50%-off promotion that's worth understanding before you build a cost model on it — the promo rate reverts to standard, and your projections should use the standard figure.
Standard pricing: $0.60/M input, $2.40/M output. Launch promotional rate (first week): $0.30/M input, $1.20/M output — verify current rate at platform.minimax.io before committing to cost projections.
| Access path | Input / M | Output / M | Notes |
|---|---|---|---|
| Standard API (platform.minimax.io) | $0.60 | $2.40 | Standard rate |
| Launch promo (first week) | $0.30 | $1.20 | 50% off — reverts to standard |
| Via OpenRouter | $0.30–$0.60 | $1.20–$2.40 | Promo / standard; routing redundancy |
| Self-hosted (open weights) | Infrastructure only | Weights on HuggingFace ~June 11, 2026 | |
The cost case is the cleanest part of the M3 story. At $0.60/M standard input, it's roughly 12x cheaper than Claude Opus 4.7's $15/M, and it matches Kimi K2.6 on input price while sitting just above DeepSeek V4-Pro's $0.435/M. For high-volume agentic workloads, that order-of-magnitude gap against the closed frontier alternatives is the number that drives adoption. The MiniMax Code subscription product — a coding-workflow tool built on M3 — gives a managed path for teams that don't want to wire up the raw API. And the self-hosting route, once the weights land, drops the per-token fee to zero entirely; you trade API spend for server and infrastructure cost. One planning note: the promotional $0.30/M is a launch lever, not the baseline. Model your spend on $0.60/M input and $2.40/M output, and treat anything cheaper as upside.
MiniMax M3 launched June 1, 2026. It's the first MiniMax model to use the MSA sparse-attention architecture — the M2 generation had removed sparse attention, and M3 brings it back with a lightweight index branch (15.6x faster decoding, 9.7x faster prefill at 1M context vs M2). It's positioned as the first open-weight model to combine frontier coding, a 1M context, and native multimodal input. The open weights are expected around June 10-11, 2026 on HuggingFace. M3 reports 59% SWE-Bench Pro, edging GPT-5.5 — vendor-reported, independent verification pending.
59% SWE-Bench Pro coding. M3 reports 59.0% on SWE-Bench Pro, the most demanding real-world coding benchmark, narrowly ahead of GPT-5.5's 58.6%. SWE-Bench Pro measures resolution of genuinely hard GitHub issues rather than synthetic puzzles, so a frontier-adjacent score there is the headline capability claim. The caveat travels with the number: vendor-reported, independent verification pending.
1M-token context window. A single call can hold an entire codebase, a full documentation set, or a long research corpus. For agents reasoning across large systems, that's the difference between whole-repo context and chunking compromises — and it's a capability M3 shares with Kimi, DeepSeek, and Claude at the top of the category.
MiniMax Sparse Attention (MSA). The architectural centerpiece. A lightweight index branch scans tokens and selects which key-value blocks actually need attention, rather than attending to everything. MiniMax reports 15.6x faster decoding and 9.7x faster prefill at 1M context versus M2. Notably, M2 had removed sparse attention; M3 brings it back in a more refined form. Independent developers have called this a genuine contribution rather than a marketing claim — it targets the real cost of long-context inference.
Native multimodal input. Text, image, and video in a single model, with no separate vision component. In the open-weight category this is currently unique — Kimi and DeepSeek are text-only, and even Claude Opus 4.7's multimodal stops at text and image. For coding-adjacent tasks that involve screenshots, diagrams, or short clips, the inputs aren't limited to text.
Desktop computer operation. M3 can operate a desktop computer for agentic workflows — comparable in scope to Claude Cowork and Google Antigravity. This pushes M3 beyond a pure API model into agentic territory, where it can take actions in a real environment rather than only returning text.
66% Terminal-Bench 2.1. Strong performance on realistic shell tasks — the kind of work an agentic coding tool actually does at the command line. Combined with the coding scores, it points to a model built for agent loops, not just chat completion. Vendor-reported, like the rest.
83.5 BrowseComp (vendor-reported). Autonomous browsing and web-research capability. For agents that need to gather information from the live web as part of a task, this is the relevant signal — though Kimi's swarm configuration reports a higher 86.3 on the same benchmark.
Open weights. Released under an MIT-style license, self-hostable, with weights expected around June 10-11, 2026 — verify current availability on HuggingFace before planning self-hosted deployments. Self-hosting carries no per-token fee and keeps every token inside your environment — the path to full data sovereignty. The catch at review time: the weights hadn't actually shipped yet.
OpenAI-compatible API. A drop-in replacement for most OpenAI API calls. Migrating an existing pipeline — whether it currently targets OpenAI, Anthropic, or anything OpenAI-shaped — is usually a single endpoint change rather than a rewrite.
MiniMax Code. A subscription product built on M3 for coding workflows, giving teams a managed path into the model without wiring up the raw API. It's the productized front door for developers who want the coding capability without the integration work.
Open-weight, OpenAI-compatible, multimodal, and priced from $0.60/M input. Call the standard API, route through OpenRouter, or self-host the weights once they ship.
Visit MiniMax →These are the three open-weight reference points the target audience actually weighs against each other. Kimi K2.6 leads on agent swarms and MCP compatibility; DeepSeek V4-Pro leads on raw cost efficiency; M3's distinguishing move is folding multimodal input and a 1M context into the same open-weight coding model. The table lays out where each one earns its place.
| MiniMax M3 | Kimi K2.6 | DeepSeek V4-Pro | |
|---|---|---|---|
| Primary strength | Multimodal + coding + 1M context | Agent swarms + MCP | Cost efficiency + coding |
| SWE-Bench | 59% Pro (vendor) | 80.2% Verified | 80.6% Verified |
| Multimodal | ✓ Text/Image/Video | ✗ | ✗ |
| Context window | 1M tokens | 1M tokens | 1M tokens |
| Architecture | MSA sparse attention | MoE 1T params | MoE |
| License | Open weight (MIT-style) | Apache 2.0 | MIT |
| Input price / M | $0.60 ($0.30 promo) | $0.60 | $0.435 |
| Chinese company | ✓ | ✓ | ✓ |
| Desktop operation | ✓ | ✗ | ✗ |
| Independent verification | Pending | ✓ | ✓ |
| Best for | Multimodal agentic workflows | Parallel coding agents | High-volume API workloads |
The pattern is clean. If you need multimodal input or desktop operation in an open-weight model, M3 is the only one of the three that offers it — that's its lane, and nothing here competes in it. If you want maximum parallelism and a frictionless migration off Claude Code, Kimi's swarms and MCP compatibility win. If you want the lowest raw token cost for high-volume API workloads, DeepSeek edges it on price. The one row that should give a cautious buyer pause is the last technical line: M3's benchmarks are still pending independent verification, while Kimi's and DeepSeek's have been third-party confirmed. For a production decision, that's not a tiebreaker — it's a reason to evaluate M3 in parallel and commit once the numbers are confirmed.
This is the part of the evaluation that has nothing to do with the benchmark and everything to do with whether you can deploy it. MiniMax is a Shanghai-based lab, and for US and EU organizations that raises the same concern that applies to DeepSeek and Kimi: China's 2017 National Intelligence Law raises data-access concerns for hosted use — the law requires Chinese companies to cooperate with government intelligence requests. For any deployment involving sensitive code, proprietary data, or regulated information, that's a hard gate, not a soft preference — the same framing the TechTimes analysis applied, and the same one we apply here.
These are legitimate questions, not a reason to dismiss the tool — and the distinction matters. A regulated enterprise handling customer data or proprietary source under EU residency rules has a genuine blocker on the hosted API. A solo developer, a startup, or a team working non-sensitive code faces a far lower bar. And the open-weight release changes the calculus for anyone with infrastructure: self-hosting keeps every token inside your own environment, which neutralizes the data-routing concern entirely — at the cost of running the model yourself, and contingent on the weights actually shipping as scheduled around June 11. OpenRouter and other third-party hosts also provide routing paths outside MiniMax's own servers, a middle option worth investigating when the tool fits but the default endpoint doesn't.
The honest framing is the same one we applied to DeepSeek V4 and Kimi Code: treat data residency as a hard gate to clear before the cost savings matter, not as a footnote. If your compliance posture rules out a Chinese-hosted API and you can't self-host, the 12x price advantage is irrelevant — the tool isn't deployable for you. If your posture allows it, or you can route around it via self-hosting or a non-China host, the cost case stands on its own. Apply the same compliance review you would to any Chinese-developed open-weight model.
MiniMax M3 is the clearest evidence yet that open-weight models are pushing into territory the closed frontier used to own outright. MiniMax took a frontier-adjacent coding model (59% SWE-Bench Pro, vendor-reported), wrapped it in a genuinely new sparse-attention architecture that runs 15.6x faster at 1M context, added native text/image/video input and desktop operation, and priced the underlying API at roughly a twelfth of Claude Opus 4.7. On paper, the combination of capabilities per dollar is among the strongest the open-weight category has produced in 2026. The combination is the achievement — no other open-weight model folds multimodal input, a million-token context, and frontier coding into one system.
The reasons to hold back are real and specific, and none of them is about whether the architecture is interesting. The benchmarks are MiniMax's own, measured on MiniMax's setup, and not yet independently verified — and the SWE-Bench Pro lead over GPT-5.5 needs third-party confirmation before anyone bets production on it. The open weights weren't shipped at launch, which means the architecture and safety behavior were unverifiable at review time, pending the ~June 11 HuggingFace release. There's no native IDE extension. It's a ten-day-old model with unproven production reliability. And the data-residency question for a Shanghai-hosted API is a genuine gate for regulated organizations — clear it before the savings mean anything, because if you can't, they don't.
So the recommendation is conditional and specific. Best for: developers who need native multimodal processing in a single open-weight model; teams running high-volume agentic workflows where $0.60/M versus $15/M changes the math; organizations comfortable with self-hosting who want full data sovereignty once the weights ship; researchers evaluating the MSA architecture; and teams already on DeepSeek or Kimi comparing open-weight options. Not for: enterprise teams with strict US/EU data-residency requirements; production deployments that need independently verified benchmarks before commitment; developers who need native IDE integration; or anyone who needed the weights at launch. Evaluate M3 seriously now, wait for independent verification before you commit production to it, and self-host if data sovereignty is non-negotiable. This 7.9 reflects the June 2026 state — unverified benchmarks, data-residency exposure, new-model maturity — and we expect to revisit it, likely upward, in 60 to 90 days as third-party verification arrives. If you want absolute verified capability or a turnkey IDE assistant today, Claude Code and Cursor remain the more practical picks.