Yehia Gewily — field notes

§0 — brief

Hey, I'm Yehia.

I'm a final-year Electrical Engineering (Electronics & Communications) student at Alexandria University who didn't wait for graduation to build real things.

Right now I'm the founding engineer at Nover, an AI image-generation startup, where I built the production backend — fine-tuning diffusion models on H100s, serving them from a GPU fleet on AWS, and owning everything in between. We're raising pre-seed.

Before that, I spent a year and a half as a software engineer at a US neuromodulation research company, building a real-time EEG "digital twin" pipeline — streaming brainwave data from OpenBCI hardware through Hilbert transforms into coupled-oscillator models — and authored a research paper on Kuramoto-based EEG synchronization.

On the side, I led Mind Cloud, a 70+ member robotics organization, from a 9th-place finish at the European Rover Challenge to 2nd and 3rd place at UGVC — migrating the whole stack to ROS 2 along the way.

I like building things that are hard. I like shipping them even more.

status: open to remote software & ML-infrastructure roles
currently: founding engineer @ nover
based in: alexandria, egypt (utc+2)
email: yehyaheya@gmail.com
elsewhere: github · linkedin

§1 — current work

Production inference at Nover

python · aws · gpu inference · diffusion models — 2025–present — nover.studio ↗

Nover generates images for creative teams. Every generation request flows through infrastructure I own end to end: an API gateway, a distributed job queue, an autoscaled GPU worker fleet, and global content delivery — all on AWS. The specifics stay under wraps until we launch; the responsibility doesn't.

Beyond serving, I run full fine-tunes of our proprietary diffusion models on rented H100s, and own the release infrastructure — including the pipeline that delivered our waitlist launch announcement, with throttling and resume capability built in.

FROM THE TRENCHES

Our heaviest generation workloads kept hitting a memory-bound failure mode that looked like a GPU problem. It wasn't — the bug lived in the request path. Fixed the path, then upgraded the fleet anyway. The full story comes after launch.

24/7

production gpu fleet on aws

H100

fine-tuning runs

pre-seed

and building

fig. 1 — the shape of the system. labels kept deliberately generic — we're pre-launch.

§2 — case study

Go distributed queue

go · redis · docker · prometheus — 2025 — repo ↗

A fault-tolerant task queue built on the Reliable Queue Pattern: at-least-once delivery through atomic Redis LMOVE operations, multi-tier priority queues, delayed scheduling via sorted sets, batch ingestion, and dead-letter queues for poison messages.

The interesting part is failure. A lease-based reaper watches for workers that died mid-task and atomically reclaims their work, and the whole system is instrumented with native Prometheus metrics. I validated it the only way that counts — automated chaos tests that kill workers at random while the queue is under load.

WHAT BROKE

Chaos testing exposed a race between a slow worker renewing its lease and the reaper reclaiming the same task — two consumers, one job. The fix was making check-and-extend a single atomic Lua script on Redis, closing the window entirely.

fig. 2 — reliable queue pattern with lease reaper and dead-letter path for poison messages.

§3 — case study

Real-time crypto arbitrage engine

python · kafka · apache flink · streamlit · docker — 2025 — repo ↗

A streaming pipeline that watches BTC trade on two exchanges at once and spots the moments they disagree. WebSocket feeds from Coinbase and Binance land in Kafka; Flink tumbling windows align the two streams in time and compute the spread; an alerter pushes live notifications to Discord while a Streamlit dashboard charts spreads and alert history.

The system is decomposed into four services — producer, stream processor, alerter, dashboard — each independently deployable and orchestrated with Docker Compose, so any piece can fail or restart without taking down the pipeline.

DESIGN NOTE

Two exchanges tick at different rates, so comparing "latest price vs latest price" lies to you. Tumbling windows force both streams onto the same clock before the spread is computed — the alignment, not the math, is the hard part.

fig. 3 — two exchange feeds aligned in Flink windows; alerts fan out to Discord and the live dashboard.

§4 — case study

NetProbe: real-time traffic analyzer

c++20 · npcap · dear imgui — 2025 — repo ↗

A high-performance packet sniffer in modern C++. NetProbe captures live traffic through Npcap, performs deep inspection of TCP/IP headers, and extracts TLS SNI — so you can see which hosts a machine is talking to even when the payload is encrypted — rendering everything in a real-time Dear ImGui interface.

Packets arrive in bursts far faster than any UI can draw. The architecture is a multi-threaded producer–consumer pipeline using std::jthread: a capture thread feeds a lock-guarded buffer, parser workers drain it, and the render thread never blocks on the network.

DESIGN NOTE

The rule that shaped everything: the thread that draws pixels must never wait on the thread that reads the wire. Decoupling them is the difference between a tool engineers trust during a traffic spike and one that freezes exactly when it matters.

fig. 4 — producer–consumer pipeline decoupling packet ingestion from the 60 fps render loop.

§5 — research

Physics → AI: the bridge

python · numpy — 2025, with omar hosney — repo ↗

Deep learning didn't appear from nowhere — a surprising amount of it is statistical mechanics wearing a different hat. This research project traces that lineage explicitly, with working code at every step: from the 2D Ising model through Hopfield networks toward Neural Network Gaussian Processes.

Phase 1 is a fully vectorized Ising simulation — the one running at the top of this page — that reproduces the critical temperature to within 0.05% of Onsager's exact solution. Phase 2 maps Ising dynamics onto Hopfield networks as learnable energy models, recovering 68% strict pattern recall at 25% corruption, with a full LaTeX write-up.

WHY IT MATTERS

Energy landscapes, temperature, phase transitions — these aren't metaphors in machine learning, they're the actual math. Understanding a model as a physical system is the difference between tuning hyperparameters by superstition and knowing why they work.

fig. 5 — the lineage this project walks, with working code and write-ups at each stage.

§6 — also built

More from the workshop

Agent MeshDistributed task orchestration engine sustaining 750+ RPS at sub-200ms latency with atomic Redis distribution.go · react · redis
CerebralFlow"Digital twin" framework for brain dynamics — coupled-oscillator simulation with surrogate-data statistical validation.python · scipy
Intelligent diagnostic systemGraduation project: telehealth platform streaming ESP8266 wearable data into Spring Boot with an ML diagnostic pipeline.spring boot · flutter
Rover navigation stackMind Cloud's autonomy stack migrated to modular ROS 2 — LiDAR processing, TF2 state management, fully containerized.ros 2 · docker
Pintos kernelFull OS coursework implementation — priority-donation scheduling, MLFQS, 13 syscalls. 100% on the official grading suite.c · x86

§7 — toolbox

What I work with

LANGUAGES

python · go · c++ · java
typescript · sql · c

BACKEND & INFRA

fastapi · spring boot · redis
kafka · flink · docker · prometheus
aws: ecs · s3 · efs · cloudfront · ses

ML & AI SYSTEMS

pytorch · comfyui · lora fine-tuning
diffusion inference · modal/h100
hugging face · dsp