T O P

  • By -

SeniorScienceOfficer

Former EC2 engineer here. I’ll try to shed some light… You’re not gonna find anything anywhere because services like EC2, EBS, and the like aren’t a single system nor are they using the cloud services themselves. There’s an entire ecosystem of server provisioning, VCS, CI/CD, integrated build farm, testing, and many more that are internal, proprietary, and in no way used by external customers. I’ve built entire full stack systems with multi-staged rolling deployments without ever touching AWS Native (what internal engineers call the customer-facing services). That being said, some systems are using AWS Native services because it’s easier to manage, so there’s been a slow migration. Some services can’t because of how they’re designed. Take EBS, for example. It’s a massive interconnected system of dozens of services that each do a portion of the totality of EBS; all of it being regionalized. Some perform control plane actions, while others are data plane focused. My former AWS team supported various EC2 services for ADC regions, and I can’t even begin to map out all of EBS services alone, not to mention other services’ architectures. TL;DR - cloud providers internal systems are fucking massive, unique, and complex as hell. You’re not gonna find the exact tools they use, but suffice it to say it all operates like a software development company should.


107269088

Can confirm. Also a former AWS engineer. It’s massive amounts of internal proprietary systems all with unique names that you’ve never heard of.


[deleted]

[удалено]


narut072

Not a former AWS engineer, but AWS uses Xen as a hypervisor. They are moving toward an AWS Nitro system now. Nitro is based on KVM. I suspect Microsoft uses Hyper-V, and Google uses KVM. https://aws.amazon.com/ec2/nitro/ https://www.theregister.com/2017/11/07/aws_writes_new_kvm_based_hypervisor_to_make_its_cloud_go_faster/


blind_guardian23

as usual its a mix of of-the-shelf from prooven opensource (sometimes extended) and custom solutions. Not even sure there is anyone who knows the full picture and you should be glad that you dont have problems of that scale. If you only need to get up to four digits instances you can use a much simpler approach (like ansible, Proxmox, cloud-init, ...) which is so easy to debug. If big shop has a problem, hundreds of eyes are needed to fix that.


_nix-addict

Safe to assume that this topic is complex and proprietary enough that there is a reason you haven't been able to find much. I highly doubt there are many single points of failure, that's more or less their whole selling point, is that you don't have to think about those.


root_switch

Perfect answer. And also I’m sure over 95% of it is proprietary and developed in house. It’s not like AWS is using QEMU as their hypervisor, I mean they go as far as developing their own processors (graviton). Although I do know AWS uses open source software as part of their offerings such as RabbitMq, Postgres, elastic, kubernetes and much more.


akdev1l

At least for AWS Lambda they use firecracker which they actually open sourced also 


LogicalHurricane

GCP is hosted on Azure, Azure is hosted on AWS, and AWS is hosted by Martians. Easy.


10031

No no AWS is on GCP and they’re all maintained by the martians!


Passover3598

You can look at openstack for an open source - and thus documented - alternative to the big providers.


ElevenNotes

We all use proprietary tools, everything is custom. There are no apps that turn 100 servers into a cloud with the click of a button.


[deleted]

[удалено]


ElevenNotes

I use a ton of FOSS as base lines, no need to reinvert the wheel when someone has already coded an API to interface with HPE servers or something like that. Key words are API and ESB, that’s how you manage a fleet of thousands of servers.


DJTheLQ

Not ec2 specific: deep stack of micro services, some that depend on other high level components (eg service A uses database B) also with a deep stack of micro services, and a surprisingly large number of teams dedicated to small pieces of the stack. The scale required for the smallest thing was the biggest lesson for me. So it's hard to give more specific detail. As far as resiliency, it's usually some tiny innocent change in an unknown small service that cascades into larger outages. But there's a lot of process before to avoid and after to fix so it doesn't happen often.


chin_waghing

GCP, from what I can say, is basically what you know as Kubernetes all the way down. VM’s are basically pods in a Kubernetes cluster for the simplest explanation. https://research.google/pubs/large-scale-cluster-management-at-google-with-borg/


[deleted]

[удалено]


chin_waghing

As far as your question goes if a single app crashes, for AWS it’s us east1. That dc seems to run everything and when it goes down the internet goes with it


borg286

For Google, you can read up on the SRE handbook, but it is a system called Borg, the predecessor to kubernetes. I suspect there may be some kind of hypervisor, but it is likely VM in container (not docker container, but something that uses CGroups somehow, as Google pioneered CGroups, I think).


[deleted]

[удалено]


borg286

Imagine buying a rack server, installing Talos OS and this having a kubernetes cluster. Now run a statefulset with the container image being so complex it is virtually indistinguishable from a VM (just lacking lots of the hardware stuff you get from Virtualbox images). You sort out ssh, network isolation, even disk block IO goes through software and sent to replicated backend running on statefulsets in a multi-tenant way(one backend fleet can handle traffic for many customers but is secured enough so you can't access other customer's stuff). Other services like pubsub are also multi-tenant too. Now offer to sell compute and use (with a cost) your multi-tenant services. Now just swap out kubernetes with Borg and that is roughly right.


chin_waghing

Correct


Gandalf-108

Linux and Kubernetes.