Alessandro Di Stefano, PhD
Cloud-Native & Distributed Systems Engineer - DevOps, MLOps, Platform Engineering, SRE+44 (0) 747 64 386 43 [email protected] aleskandro aleskandro
My deeper mission is to bridge the rigor of academic research with the fast-paced, practical demands of industry—translating ideas into robust infrastructure that delivers value at scale. I hold a PhD in Distributed Computing, where I specialized in AIOps for PaaS systems, focusing on how AI-driven automation can support infrastructure decision-making and operational resilience.
Before joining Red Hat, I spent over five years as an independent consultant specializing in software architecture and design while pursuing my studies. I've mentored students at the University of Catania's Distributed Computing Lab, helping them build microservices-based Cloud-Native applications to run in Kubernetes clusters—sharing the same spark that inspired me as a kid, when I played a Commodore 64.
I'm driven by a deep belief in open collaboration, education, and a decentralized, censorship-resistant, and free internet. To me, Free and Open Source Software is more than code—it's a philosophy of transparency, empowerment, and collective progress.
When I'm not engineering systems, you'll find me hiking, climbing, at a live music concert, or reading sci-fi and non-fiction books.
Experience
- Principal Software Engineer
- Red Hat Inc.
- 09/2025 - Now
- UK (Remote)
- Defining and tracking key performance indicators (KPIs) and service level objectives (SLOs) for large-scale distributed LLM inference services running on Kubernetes and OpenShift.
- Contributing performance roadmap for distributed AI/LLM inference, including multi-node and multi-GPU scaling studies, interconnect performance analysis, and competitive benchmarking.
- Optimizing distributed inference systems for throughput, latency, and resource efficiency by tuning scheduling, GPU utilization, and communication patterns across Kubernetes and OpenShift clusters.
- Designing and executing performance test plans and benchmarks to characterize system behavior, identify bottlenecks, and guide performance improvements through data-driven analysis and visualization.
- Technologies: vLLM, LLM-D, Open Data Hub, Openshift AI, Kubernetes, eBPF, Open Telemetry, Python AI & Data Science Ecosystem.
- Senior Software Engineer
- Red Hat Inc.
- 08/2021 - 08/2025
- UK (Remote)
- Research and develop multi-architecture compute node features for OpenShift and Kubernetes, focusing on scheduling and autoscaling workloads across different architectures. Lead architect engineer for the Multiarch Tuning Operator.
- Led the QE team responsible for Openshift on Arm64 and Multi-arch. I migrated upstream and downstream QE test suites and CI automation to support arm64 offerings on major cloud providers (AWS, Azure, Bare Metal on-prem) and multi-arch OpenShift clusters.
- Designed and maintained an internal network and automation software infrastructure to deploy ephemeral, parallel OpenShift clusters on Bare Metal, enabling efficient large-scale testing in on-prem data centers.
- Mentor and support colleagues, sharing expertise in multi-arch enablement, Kubernetes, OpenShift, and automation, fostering a culture of collaboration, learning, and technical excellence.
- Active contributor in the OKD and Kubernetes community, participating in sig-autoscaling, sig-scheduling and sig-cluster-lifecycle, focusing on the development of multi-arch features.
- Technologies: GNU/Linux, Docker, Kubernetes, OpenShift, Golang, Python, Rust.
- Research Engineer (Contractor)
- Aucta Cognitio srl
- 08/2020 - 07/2021
- Italy (Remote)
- Designed a prototype decision support system for wind farm management to enhance operational efficiency, predictive maintenance, and resilience.
- Conducted in-depth research on KPIs and data visualization strategies, architecting a scalable microservices-based system using Golang, Python, and JS/Angular.
- Developed a data pipeline enabling Prometheus as a time-series warehouse, facilitating ingestion, monitoring, and long-term storage of SCADA system telemetry from wind turbines.
- Designed, trained, and deployed machine learning models for anomaly detection and predictive analytics, leveraging time-series forecasting, LSTMs, ARIMA, and statistical regression models to preemptively identify performance degradation, equipment failures, and operational inefficiencies.
- Developed and open-sourced a Prometheus backfiller, enabling historical data reconstruction to support time-series analysis, and backtesting.
- Technologies: Linux, Kubernetes, GitLab, Golang, Python, Angular, Kafka, Prometheus, Machine Learning (LSTMs, ARIMA, Regression Models), Time-Series Analysis, SCADA Systems.
- Self-Employed
- Software Architecture & DevOps Consulting
- 01/2012 - 07/2021
- Italy (Remote)
- While pursuing and after completing my studies in computer engineering and distributed computing, I worked as a self-employed consultant for small businesses and local technology firms, delivering software architecture, DevOps, and IT infrastructure solutions to support on-premise and cloud-native environments.
- Designed and implemented software architectures, DevOps workflows, and IT infrastructure for end-user companies and local technology firms, supporting both on-premise and cloud-native environments.
- Architected and modernized distributed systems, leading projects such as migrating monolithic platforms to Kubernetes (Google Cloud, OKD) and adopting microservices patterns to improve scalability and reliability.
- Provided DevOps consulting for businesses, implementing CI/CD pipelines, infrastructure as code, and containerized deployment strategies, ensuring streamlined development and operational efficiency.
- Led system and network administration efforts, including LAN design, virtualization, security hardening, and hybrid cloud integration**, deploying technologies like Proxmox, pfSense, Azure AD Connect, VMware, and Active Directory.
- Developed and deployed scalable solutions for real-time streaming services, medical IT infrastructures, and production networks, integrating Kafka, Elasticsearch, MinIO, and Prometheus for monitoring and data processing.
- Mentored development teams on best practices in software architecture, cloud computing, and distributed systems, fostering innovation and technical growth in client organizations.
- Technologies: GNU/Linux, Docker, Kubernetes, Golang, Python, Rust, Ruby on Rails, Ansible, Terraform, Proxmox, pfSense, Active Directory, Kafka, Elasticsearch, MinIO, Prometheus.
Education
- PhD in Distributed and Parallel Computing
- University of Catania
- 10/2018 - 11/2021
- Italy
- Conducted advanced research in distributed computing, focusing on AI-driven automation (AIOps) in Kubernetes to optimize workload orchestration, performance tuning, and resource efficiency in large-scale infrastructure.
- Developed machine learning models and time-series forecasting techniques for anomaly detection and automated failure remediation in cloud-native environments.
- Designed and implemented experimental frameworks to analyze the impact of AI-driven scheduling and scaling strategies on containerized workloads.
- Published research in peer-reviewed conferences and journals, contributing to the broader scientific community in cloud computing, AIOps, and Kubernetes-based architectures.
- Mentored and supported undergraduate and master's students in their thesis as co-supervisor and teaching assistant in the distributed computing class, providing technical guidance on distributed computing, distributed transactions, microservices design, DevOps, and Kubernetes.
- MSc in Computer Engineering
- University of Catania
- 10/2016 - 10/2018
- Italy
- BSc in Computer Engineering
- University of Catania
- 01/2011 - 07/2016
- Italy
- Research Engineer
- Queen Mary University of London
- 03/2018 - 10/2018
- UK
- I worked in the team responsible for Raphtory, an Open Source distributed streaming graph processing middleware system;
- I designed and developed in Scala some components of Raphtory: Interfaces/Traits for the Spouts, implementations of the distributed partitions manager with a focus on concurrent data-structures and their behavior, implementations of the live analysis manager based on the bulk synchronous parallel pattern.
Publications
MORA on the Edge: a testbed of Multiple Option Resource Allocation
Wendlasida Ouedraogo, A. Araldo, Al. Di Stefano, An. Di Stefano
2022 IEEE 11th International Conference on Cloud Networking (CloudNet)
Improving QoS through network isolation in PaaS
Al. Di Stefano, An. Di Stefano, G. Morana
Elsevier - Future Generation Computer Systems - Volume 131, June 2022, Pages 91-105
Prometheus and AIOps for the orchestration of Cloud-native applications in Ananke
Al. Di Stefano, An. Di Stefano, G. Morana
2021 IEEE 30th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE)
EdgeMORE: Improving Resource Allocation with Multiple Options from Tenants
A. Araldo, Al. Di Stefano, An. Di Stefano
IEEE Consumer Communications \& Networking Conference (CCNC), Las Vegas (USA) 2020
Resource Allocation for Edge Computing with Multiple Tenant Configurations
A. Araldo, Al. Di Stefano, An. Di Stefano
Proceedings of the 35th ACM/SIGAPP Symposium on Applied Computing, Brno, Czech Republic 2020
Ananke: A framework for Cloud-Native Applications smart orchestration
Al. Di Stefano, An. Di Stefano, G. Morana
2020 IEEE 29th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE)
Scheduling communication-intensive applications on Mesos
Al. Di Stefano, An. Di Stefano, G. Morana
International Journal of Grid and Utility Computing (IJGUC), 2019
Coope4M: A Deployment Framework for Communication-Intensive Applications on Mesos
Al. Di Stefano, An. Di Stefano, G. Morana, D. Zito
2018 IEEE 27th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE)
Certifications
- 3rd International Summer School on Deep Learning , Warsaw, Poland
- 01/2019
- VI Mediterranean school of complex networks , Salina, Italy
- 07/2019
- Lipari School on Network and Computer Sciences , Lipari, Italy
- 07/2017
- Angular.JS certificate , University of Catania
- 01/2016
- Degree in Music Theory , Conservatory of Music "Vincenzo Bellini", Catania, Italy
- 01/2009
Societies
- Scout in the Italian Scout Association "Agesci" , Italy
- 01/2000 - 01/2010
- Co-founder of the Scordia Linux User Group , Italy
- 01/2008 - 01/2012
- Hacktivist at Catania GNU/Linux User Group , Italy
- 01/2008 - 01/2017
- Hacktivist at Freaknet Medialab , Italy
- 01/2008 - 01/2017
- Scoutmaster in the Italian Scout Association "Agesci" , Italy
- 01/2012 - 01/2020
- Volunteer at MOCA Olografix Camp Hackmeeting , Italy
- 08/2016 - 08/2016
Skills
Programming And systems:
Python, Golang, Rust, C/C++, GNU/Linux, Docker, Kubernetes, OpenShift
DevOps and Cloud:
Ansible, Terraform, Jenkins, GitLab, AWS, Azure, Google Cloud, Bare Metal
Database and Observability:
PostgreSQL, MySQL, MongoDB, Redis, Prometheus, Grafana, ELK Stack, Jaeger
Soft Skills:
Team Leadership, Mentoring, Public Speaking, Technical Writing