Aaryaman Katoch — Senior Cloud Security & Reliability Engineer

About Me

The engineer behind the uptime

I'm the person teams calls when production needs to be bulletproof. Over the past 5+ years, I've built and hardened cloud platforms across GCP, AWS, and Azure for enterprises that can't afford outages — healthcare systems under HIPAA, financial services needing 99.99% availability, and fast-scaling startups breaking through traffic ceilings.

My day-to-day sits at the intersection of security, reliability, and velocity. I architect infrastructure that self-heals, write pipelines that ship code safely in minutes, and build observability stacks that catch issues before customers notice. When something goes wrong at 3 AM, I'm the one writing the blameless postmortem the next morning — and making sure it never happens again.

I'm equally comfortable in a customer-facing room as I am in a terminal. I've led technical discovery workshops, designed reference architectures for C-suite stakeholders, and translated business pain points into cloud solutions that actually ship. I've directly driven $5M+ in signed engagements from proof-of-concept through production go-live — including FinOps initiatives that optimized $2M+ in annual cloud spend by 30-40% through rightsizing, committed use discounts, spot/preemptible workloads, and idle resource cleanup.

I hold a Master's in Computer Science from Stevens Institute of Technology, backed by four Google Cloud Professional certifications spanning Architecture, DevOps, Data Engineering, and Network Engineering. I've also mentored and grown an entire cohort of junior engineers into confident L2 contributors.

Security as Code

Compliance isn't a checkbox — it's woven into every Terraform module, every pipeline gate, and every IAM policy I write.

Toil Killer

If I do it twice, I automate it. GitOps workflows, SOAR playbooks, self-healing infra — manual is the enemy of reliable.

Force Multiplier

Grew 6 junior engineers into promoted L2s through pairing, code reviews, and blameless RCA culture. Teams I touch ship faster.

Cloud FinOps

Optimized $2M+ in annual cloud spend through rightsizing, committed use discounts, spot/preemptible workloads, and idle resource cleanup — delivering 30-40% cost reduction without SLA degradation.

Trusted Technical Advisor

Led discovery workshops, designed reference architectures for C-suite stakeholders, and drove $5M+ in signed engagements from proof-of-concept through production go-live.

My Approach

How I think about reliability

01

SLOs Over Gut Feelings

Every system I own has clearly defined Service Level Objectives. Error budgets drive release decisions, not hunches. When the budget is healthy, we ship aggressively. When it's burning, we pause and stabilize.

02

Observability is Non-Negotiable

You can't fix what you can't see. I instrument everything — metrics, traces, structured logs; and build dashboards that tell a story. The goal is to detect anomalies before they become incidents, and resolve incidents before they become outages.

03

Automate the Boring, Own the Hard

Toil is the tax on your team's creativity. I relentlessly automate repetitive operational work — deployments, certificate rotations, scaling decisions, remediation — so engineers can focus on building, not babysitting.

04

Security Shifts Left, Not Bolted On

Security controls belong in the CI pipeline, not in a quarterly review. Policy-as-code, least-privilege IAM, encrypted-by-default, vulnerability scanning at build time — hardening happens before merge, not after breach.

05

Blameless Postmortems, Always

Incidents are learning opportunities, not blame games. Every outage gets a structured RCA focused on systemic causes — what controls failed, what monitoring gaps existed, and what changes prevent recurrence.

06

Infrastructure as Code or It Doesn't Exist

If it's not in a Terraform plan or a Helm chart, it's a liability. Reproducible, version-controlled, peer-reviewed infrastructure is the only kind I trust in production.

07

Every Dollar Should Earn Its Keep

Cloud spend without visibility is just waste. I build FinOps practices into every engagement — rightsizing underutilized resources, leveraging committed use discounts, scheduling non-critical workloads, and setting up billing alerts that keep stakeholders informed before costs spiral.

08

Start with the Customer's Problem

The best architecture starts with listening, not building. I invest time understanding a customer's business constraints, compliance landscape, and growth trajectory before touching a single config file. Technical excellence means nothing if it doesn't solve the real problem.

Career

Building reliability from the ground up

Senior Cloud Security & Reliability Engineer

Searce Inc

Houston, TX Jul 2025 — Present

Architected an Agentic AI-powered cloud migration platform that autonomously discovers, assesses, and migrates AWS, Azure, and on-prem workloads to Google Cloud, generating production-ready Terraform IaC and reducing multi-million-dollar migration timelines by 50% while hardening security posture and compliance readiness by 40%.
Spearheaded enterprise-wide HIPAA and SOC 2 compliance transformation for healthcare clients — implementing Zero Trust architecture, centralized IAM with least-privilege RBAC, encryption at rest and in transit, and automated audit logging. Achieved 100% audit readiness with zero critical findings across three consecutive annual assessments.
Engineered fully automated CI/CD pipelines (Jenkins, Argo CD, SonarQube, JFrog Artifactory) with SAST/DAST security gates, container image signing, and GitOps-driven promotion across dev/staging/prod Kubernetes clusters — cutting release cycles from 2 weeks to 2 days with zero-downtime deployments.
Optimized $2M+ annual cloud spend across enterprise accounts through FinOps practices — rightsizing, committed use discounts, spot/preemptible workloads, idle resource cleanup, and real-time billing alerts — delivering sustained 30–40% cost reduction without SLA degradation.
Championed pre-sales and solution architecture as the primary technical advisor to C-suite stakeholders, leading discovery workshops, designing reference architectures, and driving $5M+ in signed engagements from initial proof-of-concept through production go-live and ongoing optimization.

Agentic AITerraformHIPAASOC 2Zero TrustJenkinsArgo CDKubernetesGitOpsFinOpsSAST/DAST

Cloud Reliability Engineer

Searce Inc

Houston, TX May 2023 — Jun 2025

Deployed a cloud-native intrusion prevention system (IPS) using NGFW appliances and Terraform IaC, integrating threat intelligence feeds, auto-remediation workflows, and policy-as-code enforcement to block 99% of malicious traffic with zero false positives in production.
Established a blameless postmortem culture and authored incident response runbooks aligned with SRE best practices, reducing MTTR by 35% and improving system availability to 99.95% through structured root cause analysis and preventive action tracking.
Mentored six junior engineers in cloud architecture, security best practices, and Terraform module development through pairing sessions and code reviews — promoting all to L2 within 12 months and improving team delivery velocity by 20%.

IPS/IDSNGFWTerraformPolicy-as-CodeSREPostmortemsMentoring

Cloud Engineer

Searce Cosourcing Services Pvt. Ltd.

Pune, India Jan 2021 — Jul 2022

Designed and delivered 10+ proof-of-concept architectures on AWS & Azure, presenting technical feasibility and cost analysis to C-suite stakeholders — converting 5 prospects into signed engagements and growing the sales pipeline by 10%.
Conducted technical discovery sessions with prospective customers to map existing workloads, identify migration blockers, and scope end-to-end solution architectures — bridging the gap between sales and engineering delivery teams.
Configured continuous integration via GitHub Actions for tag-based and scheduled deployments with automated testing and artifact publishing, improving deployment efficiency by 70% and integrating real-time status notifications to Slack.
Triaged production incidents through PagerDuty on-call rotation, performing real-time diagnostics with log correlation and metric analysis, coordinating cross-team response, and documenting root causes to prevent recurrence.

AWSAzureGitHub ActionsPagerDutyPre-SalesSolution ArchitectureOn-Call

Projects

Things I've built and shipped

2025

Enterprise Cloud Migration to GCP

Refactored monolithic applications into containerized microservices on GKE with horizontal pod autoscaling, wired up blue/green deployments through Istio service mesh with full rollback capability, and automated multi-region GCP provisioning — all driven by GitLab CI and Terraform. Built a unified observability platform (ELK + Prometheus + Grafana) with custom alerting rules and SLO-based dashboards that reduced mean time to detect (MTTD) by 40% and accelerated incident response across all environments.

8x Throughput gain
(400 → 3,200 RPS)

60% Faster releases
via blue/green

40% Faster detection
(MTTD)

TerraformHelmIstioGKEELK StackPrometheusGrafanaSCC

2025 — 2026

Security Operations — SIEM & SOAR Platform

Directed end-to-end Google Security Operations (SIEM & SOAR) implementation for a regulated healthcare enterprise — designing log ingestion architecture, IAM strategy, and threat detection framework to process 50K+ EPS and 2+ TB/day of security telemetry. Integrated 30+ log sources via BindPlane with UDM normalization, authored 45+ YARA-L detection rules, 12 automated SOAR playbooks, and 8 SOC dashboards with real-time KPIs — reducing MTTR by 40% and enabling 24/7 threat visibility.

50K+ Events ingested
per second

2+ TB Logs processed
daily

40% Faster response
(MTTR)

Google SecOpsSIEMSOARBindPlaneYARA-LPagerDutyUDM

2025

Disaster Recovery & Business Continuity

Provisioned automated backup, cross-region replication, and event-driven failover pipelines across Cloud SQL, Firestore, Firebase, and GCS — achieving 99.99% availability with zero data loss (RPO 0) and sub-minute recovery time objectives. Secured compliance posture by integrating Security Command Center and Cloud Monitoring for real-time vulnerability alerting, HIPAA/SOC 2 audit tracking, and self-healing auto-remediation workflows.

99.99% Availability
target

0 Data loss
(RPO)

Auto Cross-region
failover

Cloud SQLFirestoreCloud FunctionsPub/SubSCCSOC2HIPAA

2025

HIPAA & SOC 2 Compliance Transformation

Spearheaded enterprise-wide HIPAA and SOC 2 compliance transformation for healthcare clients — implementing Zero Trust architecture, centralized IAM with least-privilege RBAC, encryption at rest and in transit, and automated audit logging. Achieved 100% audit readiness with zero critical findings across three consecutive annual assessments.

IAMHIPAASOC 2Zero TrustRBACCompliance

2022 — Present

Terraform Master Modules for GCP

Building and maintaining a library of reusable, opinionated Terraform modules that serve as the foundation for all GCP resource provisioning. Covers compute, networking, IAM, storage, and database resources with built-in security defaults, tagging conventions, and compliance guardrails — enabling any team to spin up production-grade infrastructure in minutes instead of days.

TerraformGCPIaCModulesHCL

2021

AWS to GCP Migration

Migrated customer-facing applications and their backing databases from AWS to Google Cloud Platform — handling networking, data transfer, DNS cutover, and validation. This was my first large-scale migration and where I developed the playbook and instincts I still use today.

AWSGCPMigrationDatabaseDNS

Lab & Personal Work

Side projects & academic exploration

Search Engine with Elasticsearch

Built a full-text search engine backed by Elasticsearch and Kibana, with custom analyzers, relevance tuning, and a query interface. Applied text mining techniques to optimize result ranking.

ElasticsearchKibanaText Mining

NLP Tag Ranker (BERT)

Fine-tuned a BERT-based language model to automatically rank and predict relevant tags for programming questions — turning unstructured text into structured, searchable metadata.

BERTNLPPythonTransformers

Topic Discovery with pLSA

Implemented Probabilistic Latent Semantic Analysis with the EM algorithm to identify underlying programming languages in synthetic code snippets — a probabilistic approach to code classification.

pLSAEM AlgorithmNLPPython

TF-IDF vs BM25 Benchmark

Compared TF-IDF and BM25 retrieval algorithms on the large-scale LinkSO community Q&A dataset, analyzing precision, recall, and ranking quality for question-answer similarity matching.

PythonJupyterInformation Retrieval

Airline Booking Platform

Full-stack flight booking web app with search, seat selection, and reservation management. Built with vanilla JavaScript on the frontend and MongoDB for persistent storage.

JavaScriptMongoDBBootstrapNode.js

Web Music Player

Browser-based music player with playlist management, playback controls, and a responsive UI — a clean exercise in DOM manipulation and audio API integration.

HTMLCSSJavaScript

Flight Departure Widget

Real-time flight departure board powered by a third-party API from RapidAPI. Displays live departure data with a clean, airport-style interface.

Node.jsREST APIHTML/CSS

Student Results DBMS

Database-driven web app for managing student records and academic results — featuring CRUD operations, search, and reporting. Built with Java, MySQL, and a web frontend.

JavaMySQLJavaScriptHTML/CSS

Toolkit

Technologies in my daily rotation

Cloud & Infrastructure

GCPAWSAzureTerraformAnsibleDockerPodmanKubernetesHelmIstioVMwareLinuxBare MetalHybrid CloudGPU ClustersVagrantPackerPXE Provisioning

Security

SIEM / SOARNGFWIDS / IPSUEBAEDRXDRWAFDLPNessusQualysWiresharkZero TrustRBACLDAP / SSOSecrets ManagementAPI Key GovernanceNetwork SegmentationMITRE ATT&CKOWASP

Compliance & Governance

HIPAASOC 2NISTPCI-DSSISO 27001GDPRYARA-LPolicy-as-CodeCIS BenchmarksIAMITILChange Management

CI/CD & Automation

JenkinsArgo CDGitHub ActionsGitLab CIBitrise CITektonSonarQubeSkaffoldJFrog ArtifactoryMLOps

Observability & Monitoring

PrometheusGrafanaELK StackLokiFluentdPagerDutyDatadogSplunkAlertmanagerLookerOpenTelemetrySLOs / SLIsCapacity Planning

Databases & Storage

PostgreSQLMySQLMongoDBRedisCassandraDynamoDBNeo4jInfluxDBElasticsearch

Languages & Scripting

PythonGoJavaScriptTypeScriptBash / ShellPowerShellHCL (Terraform)GroovyYAMLSQL

Networking

TCP/IPUDPDNSDHCPHTTP/HTTPSTLS / mTLSLoad BalancingVPNCDNVPCNATService MeshBGPOSPFRoCEv2InfiniBandFirewall Rules

Web & Frameworks

ReactNode.jsExpress.jsREST APISpring BootHTML5CSS3BootstrapNPMTomcat

AI/ML & Platform Ops

AI PlatformsGPU ComputeNVIDIA / CUDAMCP ServersModel ServingInference ServicesMLflowKubeflowRunbooksPlaybooks

Credentials

4x Google Cloud Professional Certified

Professional Cloud Architect

Google Cloud

Designing and governing scalable, resilient, and secure cloud solutions end-to-end

Professional DevOps Engineer

Google Cloud

Building and operating continuous delivery systems with site reliability best practices

Professional Data Engineer

Google Cloud

Architecting data pipelines, processing systems, and machine learning workflows

Professional Network Engineer

Google Cloud

Engineering robust, secure, and high-performance network architectures at scale

Education

Where it started

M.S.

Master of Science in Computer Science

Stevens Institute of Technology

Hoboken, NJ

Class of 2024

Cloud Computing, Database Systems, Text Mining & NLP, Web Development

B.E.

Bachelor of Engineering in Computer Science

RV College of Engineering

Bengaluru, India

Class of 2021

Operating Systems, Algorithms, UNIX, Parallel Programming, Database Design

Contact

Let's talk

Looking for my next challenge in SRE, DevSecOps, Customer Engineering, Platform Engineering or Solutions Architecture. I love helping customers solve hard infrastructure problems — whether that's in a pre-sales room or a production war room.

Email aaryamantpkatoch@gmail.com LinkedIn linkedin.com/in/aaryaman-katoch Phone 551-254-8477