
AI INFRASTRUCTURE - THOUGHT LEADERSHIP
By Erol Ozguner
Brilliant algorithms and bold ambitions are not enough. After years on the front lines of urban digital transformation, I have seen too many promising AI initiatives collapse under the weight of an underestimated question: what hardware is actually running this thing?
70%AI PILOTS STALL AT SCALE-UP | 5xCOST GAP: WRONG VS RIGHT CHIPSET |
AI INFRA
Chipset selection is not a procurement decision — it is an architectural one.
A Lesson from the Field
A few years ago, my team launched what we believed would be Istanbul's flagship real-time traffic prediction system. We had secured funding, assembled a talented data science team, and partnered with one of Europe's leading AI vendors. The neural network architecture was elegant. The training data— years of sensor readings, GPS traces, and incident reports — was plentiful. We were confident.
Six months later, the system was barely functional. Not because the algorithms were wrong. Not because the data was insufficient. The system failed because the infrastructure underneath it was never designed to carry that kind of load. We were running inference on general-purpose servers meant for data base workloads. Latency that should have been measured in milliseconds stretched to seconds. By the time the model predicted a jam, the jam had already dissolved — or worsened beyond rescue. The project was quietly shelved.
That experience reshaped how I think about smart city AI. I came to understand that in this field, the gap between a proof of concept and a production system is almost always an infrastructure gap, not an algorithm gap.
"Approximately 70% of enterprise AI pilot programmes never reach production scale. In almost every post-mortem I have reviewed, the root cause was not model quality—it was infrastructure mismatch."
—Gartner AI Implementation Survey, 2024; reflects author's field observations across 12 municipal
Why Smart Cities Are a Uniquely Demanding Environment
Smart city AI is not like enterprise AI. A bank's fraud-detection model can tolerate a 300-millisecond response. A traffic management system commanding signal timings across thousands of intersections cannot. An energy-grid balancing AI that takes two seconds to react to a load spike may cause a brownout before it acts. The physics of the city impose hard latency budgets that expose every weakness in your compute stack.
In my experience overseeing Istanbul's digital infrastructure — a metropolitan area of 16 million people with more than 4,000 monitored intersections — I routinely encountered three categories of AI failure, all traceable to hardware choices: latency failures (the model is correct but too slow), throughput failures (the model works for 100 simultaneous queries, not 100,000), and memory failures (the model runs in the lab but exhausts GPU memory in production with real-world batch sizes). Each failure mode demands a different hardware remedy.
The Chipset Landscape: A Practical Map
The AI accelerator market has matured dramatically. City technology leaders are no longer limited to a single vendor. Below is an honest comparative view of the platforms I have evaluated across smart city use cases.
Table 1—AI Chipset Comparison for Smart City Workloads (2025–2026)
PLATFORM | BEST-FIT USE CASE | KEY ADVANTAGE | KEY CONSIDERATION |
NVIDIA GH200 / H100 NV Link, 144 GB HBM3e, FP8 Tensor Cores | LLM inference, city- video analytics, digital twin simulation | Unified CPU+GPU memory; broadest software ecosystem (CUDA, TensorRT-LLM) | Premium acquisition cost; power draw at full load |
Google TPU v5e / v5p Custom ASIC, bfloat16, TF-native | Large-scale training (traffic forecasting, demand prediction models) | Exceptional training throughput per dollar within Google Cloud | Cloud-only; TensorFlow dependency; data sovereignty risk |
AMD Instinct MI300X 192 GB HBM3, 5.3 TB/s bandwidth | Large model inference on- premise (70B+ LLMs, citizen service AI) | Largest HBM capacity on market; competitive price; maturing ROCm 6.x stack | Smaller ecosystem than CUDA; fewer pre-optimised operator libraries |
Intel Gaudi 2 / Gaudi 3 RoCE fabric, FP8/BF16, x86-native | Cost-sensitive inference: chatbots, permit processing, benefit portals | Best cost-per-inference in class; strong EU data- residency; x86 familiarity | Smaller community; requires model re-optimisation for peak gains |
AWS Trainium 2 / Inferentia 2 AWS-native, Neuron SDK | Cloud-bursting, seasonal spikes, model experimentation | Deep AWS integration; pay-as- you-go; rapid provisioning | Vendor lock-in; Neuron SDK learning curve; not for sovereign deployments |
Matching Hardware to the Use Case
Real-time traffic and emergency dispatch demand sub-100-milli second inference at the edge. Large data-centre GPUs are the wrong answer here. The right answer is distributed edge inference — NVIDIA Jetson Orin or AMD Kria SoCs mounted close to the sensors, with a central H100 or GH200 cluster handling model training and updates offline. In Istanbul, centralising inference added 80 milliseconds of network round-trip — enough to make "real-time" meaningless.
City-wide video surveillance and public safety analytics generate enormous memory-band width demands. Streaming thousands of camera feeds through an AI pipe line simultaneously requires hardware with massive parallel throughput. NVIDIAH100 (3.35TB/sHBM3 bandwidth) and AMD (5.3 TB/s across 192 GB HBM3) are genuinely competitive here. In my evaluations, the MI300X's memory-capacity advantage translated directly into lower cost-per-camera at high—though NVIDIA's software maturity remained a practical advantage for teams without specialised ML-ops capability.
Energy grid optimisation is fundamentally a training problem: models must be continuously on consumption patterns, weather data, and generation Google TPU clusters outstanding training throughput for TensorFlow-native workloads. For cities already invested in Google Cloud infrastructure, TPUs can reduce model retraining cycles by 40 to 60 percent compared with equivalent GPU configurations — a compelling case where cloud residency is permissible.
Citizen-facing services—chatbots, permit processing, social benefit eligibility—require always-on inference. This is Intel Gaudi and on-AMD configurations compelling economic cases. Running a 70-billion-parameter language model on four MI300X cards costs roughly one-fifth the equivalent NVIDIA configuration at list price. For a municipal budget serving hundreds of thousands of daily queries, that arithmetic is decisive.
The Data Sovereignty Dimension
Smart city AI processes some of the most sensitive data imaginable: citizen identities, movement patterns, health indicators, financial vulnerability. The data sovereignty question is not peripheral to hardware selection — it is central to it. Cloud-native chipsets such as AWS Trainium and Google TPUs are extraordinarily capable, but they bind your workload to a foreign jurisdiction. European and national data-residency regulations are tightening, not relaxing.
My strong recommendation: design your architecture with a sovereign core — on-premise accelerator infrastructure that keeps sensitive inference local — and reserve cloud chipsets for non-sensitive training workloads where data can be appropriately anonymised. This is not an either/or choice; it is a hybrid strategy that balances capability with compliance.
A Decision Framework for City Leaders
When evaluating AI infrastructure, I apply four questions in sequence. First: What is my latency budget? If it is under 100 ms, compute must be at the edge. Second: What is my throughput requirement? If concurrency exceeds hundreds of simultaneous requests, memory bandwidth dominates — favour HBM3-based platforms. Third: Where must my data reside? If the answer is sovereign, prioritise on-premise accelerators. Fourth: What is my team's operational capability? An exotic chipset with a thin software ecosystem will cost more in engineering hours than it saves in hardware cost. Match platform sophistication to team maturity.
The cities that get this right do not necessarily buy the most powerful hardware. They buy the right hardware, correctly sized, correctly placed, and correctly supported. In more than a decade of witnessing municipal AI programmes succeed and fail, I have never seen a project collapse because the algorithm was too simple. I have seen dozens collapse because infrastructure was the last thing anyone considered.
The Path Forward
We are entering a period when the quality gap between AI chipset vendors is narrowing rapidly. AMD, Intel, and the hyper scalers are producing genuinely competitive alternatives to NVIDIA's dominant position. For city technology leaders, this is excellent news: more options, better pricing, and less vendor dependency. But the diversity of choice also raises the stakes for getting the selection right.
My call to action is simple: treat AI infrastructure as a first-class architectural decision, not a procurement afterthought. Engage your infrastructure team at the same moment you engage your data science team. Define your latency, throughput, and sovereignty constraints before you evaluate a single vendor. Budget not just for acquisition, but for operation and upgrade — because the models will evolve faster than any committee can approve a new server rack.
The city of the future will be powered by intelligent systems. Whether those systems actually work — at scale, in real time, within budget, and in compliance with citizens' rights — will depend less on the brilliance of the algorithms than on the solidity of the infrastructure beneath them. That is the infrastructure imperative. And it starts with a conversation about chips.
About the Author
Erol Ozguner is a technology executive and smart city strategist with more than 25 years of experience leading large-scale digital transformation and urban technology programmes across Europe, the Middle East, and Central Asia. With a background in telecommunications, AI, smart infrastructure, and public sector innovation, he has built a strong international reputation for delivering citizen-focused technology outcomes at scale.
Throughout his career, Erol has led major digital infrastructure and smart city initiatives across both the private and public sectors. He spent 16 years in senior leadership roles with Turkcell, including director-level positions across Turkey, Georgia, Ukraine, and Belarus, before serving as Chief Information Officer for Istanbul Metropolitan Municipality, where he oversaw digital transformation programmes supporting more than 16 million citizens.
Erol holds engineering and business qualifications from Yildiz Technical University, Kocaeli University, and Istanbul Bilgi University, and has completed postgraduate studies in machine learning and artificial intelligence through Yale University and Massachusetts Institute of Technology.
His work has earned 3 global championship awards, 21 international industry prizes, and recognition by IDC Turkey as the country’s leading technology executive in 2023. Erol is also an active international speaker and contributor on topics including smart cities, AI, urban innovation, and digital transformation.
For more information on Smart Cities Council programs:

