backend & ml infrastructure · alexandria, egypt

I build the systems that keep models running in production.

Founding engineer at Nover. I run GPU inference infrastructure on AWS, build distributed systems in Go and Python, and occasionally trace deep learning all the way back to statistical mechanics.

2.27 ≈ Tc |m| = 0.00

fig. 0 — a live 2D Ising model (Metropolis–Hastings), taken from my Physics-AI Bridge research. Drag the temperature through the critical point Tc ≈ 2.27 and watch order emerge from noise. This isn't a looping gif — it's computing in your browser right now.

§0 — brief

Hey, I'm Yehia.

I'm a final-year Electrical Engineering (Electronics & Communications) student at Alexandria University who didn't wait for graduation to build real things.

Right now I'm the founding engineer at Nover, an AI image-generation startup, where I built the production backend — fine-tuning diffusion models on H100s, serving them from a GPU fleet on AWS, and owning everything in between. We're raising pre-seed.

Before that, I spent a year and a half as a software engineer at a US neuromodulation research company, building a real-time EEG "digital twin" pipeline — streaming brainwave data from OpenBCI hardware through Hilbert transforms into coupled-oscillator models — and authored a research paper on Kuramoto-based EEG synchronization.

On the side, I led Mind Cloud, a 70+ member robotics organization, from a 9th-place finish at the European Rover Challenge to 2nd and 3rd place at UGVC — migrating the whole stack to ROS 2 along the way.

I like building things that are hard. I like shipping them even more.

status
open to remote software & ML-infrastructure roles
currently
founding engineer @ nover
based in
alexandria, egypt (utc+2)
email
yehyaheya@gmail.com
elsewhere
github · linkedin

§1 — current work

Production inference at Nover

python · aws · gpu inference · diffusion models — 2025–present — nover.studio ↗

Nover generates images for creative teams. Every generation request flows through infrastructure I own end to end: an API gateway, a distributed job queue, an autoscaled GPU worker fleet, and global content delivery — all on AWS. The specifics stay under wraps until we launch; the responsibility doesn't.

Beyond serving, I run full fine-tunes of our proprietary diffusion models on rented H100s, and own the release infrastructure — including the pipeline that delivered our waitlist launch announcement, with throttling and resume capability built in.

FROM THE TRENCHES

Our heaviest generation workloads kept hitting a memory-bound failure mode that looked like a GPU problem. It wasn't — the bug lived in the request path. Fixed the path, then upgraded the fleet anyway. The full story comes after launch.

24/7
production gpu fleet on aws
H100
fine-tuning runs
pre-seed
and building
Nover inference architecture client api gateway job queue gpu workers autoscaled inference engine model store cdn delivery h100 fine-tunes → new checkpoints solid = request path · dashed = model weights
fig. 1 — the shape of the system. labels kept deliberately generic — we're pre-launch.

§2 — case study

Go distributed queue

go · redis · docker · prometheus — 2025 — repo ↗

A fault-tolerant task queue built on the Reliable Queue Pattern: at-least-once delivery through atomic Redis LMOVE operations, multi-tier priority queues, delayed scheduling via sorted sets, batch ingestion, and dead-letter queues for poison messages.

The interesting part is failure. A lease-based reaper watches for workers that died mid-task and atomically reclaims their work, and the whole system is instrumented with native Prometheus metrics. I validated it the only way that counts — automated chaos tests that kill workers at random while the queue is under load.

WHAT BROKE

Chaos testing exposed a race between a slow worker renewing its lease and the reaper reclaiming the same task — two consumers, one job. The fix was making check-and-extend a single atomic Lua script on Redis, closing the window entirely.

Distributed queue architecture producers redis pending → processing delayed zset priority tiers worker 1 worker 2 worker n reaper (lua) dlq reclaims dead leases chaos tests kill workers at random under load
fig. 2 — reliable queue pattern with lease reaper and dead-letter path for poison messages.

§3 — case study

Real-time crypto arbitrage engine

python · kafka · apache flink · streamlit · docker — 2025 — repo ↗

A streaming pipeline that watches BTC trade on two exchanges at once and spots the moments they disagree. WebSocket feeds from Coinbase and Binance land in Kafka; Flink tumbling windows align the two streams in time and compute the spread; an alerter pushes live notifications to Discord while a Streamlit dashboard charts spreads and alert history.

The system is decomposed into four services — producer, stream processor, alerter, dashboard — each independently deployable and orchestrated with Docker Compose, so any piece can fail or restart without taking down the pipeline.

DESIGN NOTE

Two exchanges tick at different rates, so comparing "latest price vs latest price" lies to you. Tumbling windows force both streams onto the same clock before the spread is computed — the alignment, not the math, is the hard part.

Arbitrage pipeline architecture coinbase ws binance ws kafka flink tumbling windows alerter dashboard discord spread > threshold four services · docker compose
fig. 3 — two exchange feeds aligned in Flink windows; alerts fan out to Discord and the live dashboard.

§4 — case study

NetProbe: real-time traffic analyzer

c++20 · npcap · dear imgui — 2025 — repo ↗

A high-performance packet sniffer in modern C++. NetProbe captures live traffic through Npcap, performs deep inspection of TCP/IP headers, and extracts TLS SNI — so you can see which hosts a machine is talking to even when the payload is encrypted — rendering everything in a real-time Dear ImGui interface.

Packets arrive in bursts far faster than any UI can draw. The architecture is a multi-threaded producer–consumer pipeline using std::jthread: a capture thread feeds a lock-guarded buffer, parser workers drain it, and the render thread never blocks on the network.

DESIGN NOTE

The rule that shaped everything: the thread that draws pixels must never wait on the thread that reads the wire. Decoupling them is the difference between a tool engineers trust during a traffic spike and one that freezes exactly when it matters.

NetProbe threading architecture nic capture thread npcap buffer parser parser parser imgui render thread never blocks on capture std::jthread pipeline
fig. 4 — producer–consumer pipeline decoupling packet ingestion from the 60 fps render loop.

§5 — research

Physics → AI: the bridge

python · numpy — 2025, with omar hosney — repo ↗

Deep learning didn't appear from nowhere — a surprising amount of it is statistical mechanics wearing a different hat. This research project traces that lineage explicitly, with working code at every step: from the 2D Ising model through Hopfield networks toward Neural Network Gaussian Processes.

Phase 1 is a fully vectorized Ising simulation — the one running at the top of this page — that reproduces the critical temperature to within 0.05% of Onsager's exact solution. Phase 2 maps Ising dynamics onto Hopfield networks as learnable energy models, recovering 68% strict pattern recall at 25% corruption, with a full LaTeX write-up.

WHY IT MATTERS

Energy landscapes, temperature, phase transitions — these aren't metaphors in machine learning, they're the actual math. Understanding a model as a physical system is the difference between tuning hyperparameters by superstition and knowing why they work.

From Ising model to neural networks 2d ising model Tc within 0.05% hopfield nets 68% recall @ 25% nngp phase 3 — next same energy statistical mechanics → modern deep learning, one proof at a time
fig. 5 — the lineage this project walks, with working code and write-ups at each stage.

§6 — also built

More from the workshop

§7 — toolbox

What I work with

LANGUAGES

python · go · c++ · java
typescript · sql · c

BACKEND & INFRA

fastapi · spring boot · redis
kafka · flink · docker · prometheus
aws: ecs · s3 · efs · cloudfront · ses

ML & AI SYSTEMS

pytorch · comfyui · lora fine-tuning
diffusion inference · modal/h100
hugging face · dsp

§8 — contact

Building something interesting? I'd love to hear about it.