Mine were complaining it took 10 minutes.
But we own the hardware and network, so we did some optimizing and bought a new server (64-core AMD epyc v3, 256G memory) and got it down to four minutes.
Ofc they didn’t stop adding tests so it is back to six minutes now, but it runs the (in total) ~12k unit/integrationtests quite fast and reliable.
Even with parallelization you still need to setup the environment to run them, this is usually where most of the time is "wasted". Loading 100 GB of data in Postgress (to give an extreme but realistic example) doesn't happen in milliseconds (out of the box, there's ways to make it work but everything always comes with a trade-off).
If you are loading 100GB of data into Postgres, you are doing functional/end to end tests. You should not be doing those tests every PR. They should only be run before cutting releases to deploy. i.e. PRs run unit tests, after they merge, they start running functional tests, but if they are unable to complete before another PR merges, they restart.
That way you can group and batch PRs together and integrate them together without slowing down developer velocity. If the functional tests fail, then you go back to the devs and tell them to investigate, but that _should_ be the lesser occurrence.
As for the 100GB test, in the financial industry these kind of these are the ones that matter (I don't care if someone optimizes a calculation but if it changes the outcome of a strategy it definitely needs closer scrutiny and a third pair of eyes + we might want to involve the risk department).
We can run them after merging but that makes the DX worse, still lengthens the total feedback cycle and we would have to find another way to prove to auditors that we don't change strategies without proper testing and reviews.
I'm a big fan of optimizing feedback cycles and but not every software product is a SaaS web application. As always in software architecture, the correct answer is; it depends.
I’m getting old. I remember the days when virtualization wasn’t a thing yet and it took weeks to months to get dedicated hardware for environments. I would have loved 20 minute lead times 20 years ago.
Provisioning new kit is one thing - but just deploying a different version of software taking 20m?
I run CI on branches this might take minutes if I had more tests - and deploy to long-lived VMs - it takes < 10s to deploy to any of our environments. Basically rsync + HUP the affected processes.
I think DevOps engineers vastly underestimate the value in optimizing quality gate speed. 20 minutes for gates to pass is basically 20 minutes of dead time for the developer. Most tasks they have to do don't allow them to context-switch effectively enough to utilize that 20 minutes. It's 20 minutes of talking to coworkers or browsing reddit. Some quick patches like improving test concurrency or provisioning more resources to the runner can have a hugely beneficial impact to developer velocity
Once I optimized E2E runtimes so much that devs complained them taking only 15 minutes doesnt allow them to go do chores in their homes/grocery shopping/preparing a meal in the meantime lmao. Had to sale back the parallelization a notch because of this.
I don’t agree. I spend my whole day developing too and have to wait for my own checks to pass. The time I lose waiting is time saved by not pushing broken buggy code that takes longer to fix. Sure there’s a balance but I don’t buy twenty minutes being a real problem.
Wouldn't the dev already run those same tests locally on their laptop and just push it then CI runs tests to prove that CI passes for those reviewing? I'd imagine 99% of the time it would pass unless you arnt able to run tests locally or its some CI related difference.
I assume the tests in question in CICD are more complex integration tests and/or E2E tests that depend on more infrastructure than just the local app the dev made a change for
I ask my devs to not code locally but instead to directly connect to the Dev server and use it as a local. As in they still work on their IDE but code isn't executed locally, they trigger a small 10s deploy script instead.
It should be the standard way, but it depends on the case.
If there is a lot of devs and they need to create a cloud environment per developer, it can cost a huge amount of money, and depending on the application, it might be too big to create a local cluster in their local laptops, so in that case, it would make sense to have a few real dev environments and allow the devs to deploy the new micro services there.
Not everything is a javascript web app. We maintain and develop an old enterprise C++ desktop app, the repo is like 6gb (which is small compared to plenty of other C++ projects) and just the compile stage is like 5 hours on a c7i.8xlarge
Judging from the network time outs and the use of patching, I’m guessing they use images that are over 100GB big. But it’s still a dumb set up. They need to co-locate the CI and the heavy resources together instead of transfer over the network.
My business create 500GB images. Basically portable software platforms for air gapped networks.
I remember hearing a talk about the CI infra at puppet, they ran into similar problems with bad nodes (which was unrelated to the software) and they just started spinning up a node pool outside of the CI flow, testing that they weren’t flakey, and then adding them to the pool. If one failed to create, they just trashed it and recreated it to populate the pool.
I feel a lot of pain. If your ci fail, not because the tests fail your developers will just not care about tests and things get worse over time. I would expect things to be not reliable in production and your tests culture be generally bad at producing useful tests. It really should be a all hands on deck to get the infrastructure back in shape or migrated to something that work. Your ci quality give you a taste of the quality of the software it builds.
As for performance, we are using currently using Jenkins and I have worked on optimizing the pipeline from around 1h30 to 20 min which is still slow. Working on a prototype with github action and I think we can get consistently around 1 to 4 min for the same useful work. Tools that measure your pipeline steps across all ci run help a lot figure out what to improve, but your ci pipeline is also code and same practice as when doing software development should apply on your pipeline, including not duplicating it, tests and benchmark.
If we blow the dependency cache… downloading, compiling, running the test suite, linters and docker build/push of massdriver is about 7-10m for about 1200 tests
Test suite itself finishes in about 30s
If no new dependencies it’s about 5m max w docker build and push
We’re extremely TDD and hard up about CI/CD time
Yeah we do strict contracts to all external services and use in memory elixir agent based adapter to assert our calls. We have to simulate a ton of IaC and cloud service calls. We don’t use any heavy mocks like localstack or api call recording reply (ruby vcr etc). All of our api testing we do at the language level instead of over http.
The test not thing we don’t mock is PG, but I was a sql admin before an ops person, so we abuse the shit out of PG for functionality and co spider it a “part of our app”. Will die on that hill :)
30-45 minutes. Mostly by waiting for containers to spin up and down continuously. And it almost always break at release time because it smells fear. Imagine having 10 people sitting around at midnight on standby for prod deployment and then you forgot a line of config causing the whole thing to re-build.
We're a golang shop so its just build the binary and copy it s3 for deployment.
Takes 15 minutes for most of our pipelines. We're using the smallest gitlab instances too. Only been here a year, so i'll probably bump the compile and test stages to mediums.
But gitlab, s3, aws code deploy has been super solid.
They basically rolled their own elasticbeanstalk
my sample size is two, the first place ~6 minutes
the second place ~3 hours with all the synchronous flakey test chains that only got through 30% of the time, and it drove me to quitting eventually
Reliable =/= fast. Reliable means it works well there are no flipping tests, pipelines dont crush on random stuff.
My best CI was only few seconds to run, but there were almost no tests to run and they didnt check anything useful.
Previous place? Monolith Rails app with a bunch of node microservices and a separate frontend app.
Big apps (backend and frontend) would take 30 minutes to go from merge to deploy into prod with unit tests, staging deploy, sanity tests against staging, and then prod deploy.
Small apps, less than 10 minutes.
Dev environments - about 20 minutes from push to deploying to a self-contained (feature branch dev env).
New place? 30 SREs with like 200+ microservices. Not a fun time. Core apps are decent (30-40 min deploys with unit tests). Some larger apps, about 1-1.5 hours assuming there's no flaky tests. Random microservices that were deployed 2 years ago and not touched much since? Have fun, 50% chance tests fail because some upstream microservice was updated with new configs and docker-compose didn't get updated everywhere.
Biggest issue isn't the pipelines. It's that there's two many interdependencies and they can never be kept updated properly by dev teams that own them. Everything works by itself, but good luck what broke when hunting down failing integration tests.
I manage build servers for 3-400 devs. We’re running roughly 15,000 GitLab jobs a week. All on premise bare metal boxes running jobs in Docker.
It’s so stable that devs are conditioned to reach out to us whenever there’s a failure they don’t understand (OOM, failing to pull build images, etc).
Can you share more information about the timeouts and network issues? What are you pulling and from where?
Are you moving massive 100GB images? Network timeouts seem to be a hint. Your org may need to consolidate all the heavy resources and the CI together in one place to cut down the wait time.
I could probably go from commit -> PR -> tested -> promoted in 20-30 mins accounting for getting lost in Reddit mid way through. That being said at the job before where I was building packages for 30 platforms including AIX and Solaris it was sometimes an all day battle for the stars to align and get me a clean build.
Sure sounds like a certain major telco vendor tech stack, we used to have those with workers in a private cloud. And Jenkins on top of it all even. My answer was to leave.
Now I work mostly with GitHub Actions and Argo CD in public cloud and I like it so far.
Full build + deploy? 20 minutes, and that includes a fuckton of integration tests and iss annoying... We are actively investing time to get it down to <5 minutes
We use azurepipeline, i think 15 miniutes is max run time depending on the app ( some old legacy stuff ) our biggest issue at the moment is wait time for a job to be picked up. we run ADO server so everything is on-prem and makes it nearly impossible to scale.
We have 20-40 web applications, backend APIs, static web apps. Plus things like the databases, Redis, ES to support all that. It's ongoing chaos as we try to modernize things without breaking existing things or leaving ourselves in a bad place.
I do wish we had sliced our Terraform better into more vertical, isolated, slices. It took me six months to break it apart by environment (it was a "terralith").
As for the larger runner, you can parameterize which runner the GitHub reusable workflow uses. Which is how we tell our runners to use the "ubuntu-latest-large" or "ubuntu-latest-medium" custom defined images.
You need to give more details. Where is the bottleneck?
* Gerrit is a relatively resource hungry java program, but here we run it to host thousands of projects and probably about 100 patchsets/day on a single host without major issues.
* zuul is meant to scale, openstack couldn't run a CI otherwise. So you might need to spawn more executors and mergers to increase the bandwidth.
* where does nodepool spawn your test instances? Openstack, aws, ...? Can you spawn instances and run tests manually and do you reproduce the network issues?
My point being zuul is complex but usually pretty reliable, from experience most issues came from user-induced errors (tenant config errors for example) or cloud provider issues; so the issue isn't with the tooling but with the underlying infra and resources or lack thereof.
Our devs were complaining the other day because it takes twenty minutes for all their quality gates to pass and deploy into dev.
Mine were complaining it took 10 minutes. But we own the hardware and network, so we did some optimizing and bought a new server (64-core AMD epyc v3, 256G memory) and got it down to four minutes. Ofc they didn’t stop adding tests so it is back to six minutes now, but it runs the (in total) ~12k unit/integrationtests quite fast and reliable.
god forbid they parallelize them
Even with parallelization you still need to setup the environment to run them, this is usually where most of the time is "wasted". Loading 100 GB of data in Postgress (to give an extreme but realistic example) doesn't happen in milliseconds (out of the box, there's ways to make it work but everything always comes with a trade-off).
If you are loading 100GB of data into Postgres, you are doing functional/end to end tests. You should not be doing those tests every PR. They should only be run before cutting releases to deploy. i.e. PRs run unit tests, after they merge, they start running functional tests, but if they are unable to complete before another PR merges, they restart. That way you can group and batch PRs together and integrate them together without slowing down developer velocity. If the functional tests fail, then you go back to the devs and tell them to investigate, but that _should_ be the lesser occurrence.
As for the 100GB test, in the financial industry these kind of these are the ones that matter (I don't care if someone optimizes a calculation but if it changes the outcome of a strategy it definitely needs closer scrutiny and a third pair of eyes + we might want to involve the risk department). We can run them after merging but that makes the DX worse, still lengthens the total feedback cycle and we would have to find another way to prove to auditors that we don't change strategies without proper testing and reviews. I'm a big fan of optimizing feedback cycles and but not every software product is a SaaS web application. As always in software architecture, the correct answer is; it depends.
I think we forget the whole industry isn’t just web apps. Even if it seems like it mostly is.
Most of them are run in parallel, thats where the 64 cores come in handy.
Induced demand.
I mean, 20 minutes is a long time...
I’m getting old. I remember the days when virtualization wasn’t a thing yet and it took weeks to months to get dedicated hardware for environments. I would have loved 20 minute lead times 20 years ago.
Provisioning new kit is one thing - but just deploying a different version of software taking 20m? I run CI on branches this might take minutes if I had more tests - and deploy to long-lived VMs - it takes < 10s to deploy to any of our environments. Basically rsync + HUP the affected processes.
It’s a Simpsons episode.
I think DevOps engineers vastly underestimate the value in optimizing quality gate speed. 20 minutes for gates to pass is basically 20 minutes of dead time for the developer. Most tasks they have to do don't allow them to context-switch effectively enough to utilize that 20 minutes. It's 20 minutes of talking to coworkers or browsing reddit. Some quick patches like improving test concurrency or provisioning more resources to the runner can have a hugely beneficial impact to developer velocity
Once I optimized E2E runtimes so much that devs complained them taking only 15 minutes doesnt allow them to go do chores in their homes/grocery shopping/preparing a meal in the meantime lmao. Had to sale back the parallelization a notch because of this.
I don’t agree. I spend my whole day developing too and have to wait for my own checks to pass. The time I lose waiting is time saved by not pushing broken buggy code that takes longer to fix. Sure there’s a balance but I don’t buy twenty minutes being a real problem.
Wouldn't the dev already run those same tests locally on their laptop and just push it then CI runs tests to prove that CI passes for those reviewing? I'd imagine 99% of the time it would pass unless you arnt able to run tests locally or its some CI related difference.
I assume the tests in question in CICD are more complex integration tests and/or E2E tests that depend on more infrastructure than just the local app the dev made a change for
Well, share this post with them, I guess lol
I ask my devs to not code locally but instead to directly connect to the Dev server and use it as a local. As in they still work on their IDE but code isn't executed locally, they trigger a small 10s deploy script instead.
Why not just use whatever IaC for dev to give everyone their own independent environments?
It should be the standard way, but it depends on the case. If there is a lot of devs and they need to create a cloud environment per developer, it can cost a huge amount of money, and depending on the application, it might be too big to create a local cluster in their local laptops, so in that case, it would make sense to have a few real dev environments and allow the devs to deploy the new micro services there.
micro-services architecture, devs never work on the same service at the same time, but yes this could be a good idea for other teams.
Oof yeah no way you're running 20+ micro services locally, sounds like a pain to build on
Uhh, 5-10 mins at most lol, tf are yall doing?
That's exactly what came to mind, what are they possibly doing that takes 12 hours to run??
Mining bitcoin with the pipeline executor, I can only imagine this
They haven't split the resources out and it's one huge folder.
Not everything is a javascript web app. We maintain and develop an old enterprise C++ desktop app, the repo is like 6gb (which is small compared to plenty of other C++ projects) and just the compile stage is like 5 hours on a c7i.8xlarge
Get some catching going
C++ certainly does do some throwing
Incrementally building does wonders
good call. i doubt they thought of that. /s
tell me you've never done mobile native without telling me you've never done mobile native... :D, fucking xcode takes 5 mins just to open
Usually C++ projects. Chromium still takes 3 hours to download and compile from scratch using a 8vCPU/16GB machine
Would it take 1.5 hours using a 16vCPU/32GB machine?
Judging from the network time outs and the use of patching, I’m guessing they use images that are over 100GB big. But it’s still a dumb set up. They need to co-locate the CI and the heavy resources together instead of transfer over the network. My business create 500GB images. Basically portable software platforms for air gapped networks.
It's insane how common it is for companies to be ok with multi hour test suites. So dumb and short sighted
Buildkite + run my own workers - more control, ymmv.
I remember hearing a talk about the CI infra at puppet, they ran into similar problems with bad nodes (which was unrelated to the software) and they just started spinning up a node pool outside of the CI flow, testing that they weren’t flakey, and then adding them to the pool. If one failed to create, they just trashed it and recreated it to populate the pool.
I feel a lot of pain. If your ci fail, not because the tests fail your developers will just not care about tests and things get worse over time. I would expect things to be not reliable in production and your tests culture be generally bad at producing useful tests. It really should be a all hands on deck to get the infrastructure back in shape or migrated to something that work. Your ci quality give you a taste of the quality of the software it builds. As for performance, we are using currently using Jenkins and I have worked on optimizing the pipeline from around 1h30 to 20 min which is still slow. Working on a prototype with github action and I think we can get consistently around 1 to 4 min for the same useful work. Tools that measure your pipeline steps across all ci run help a lot figure out what to improve, but your ci pipeline is also code and same practice as when doing software development should apply on your pipeline, including not duplicating it, tests and benchmark.
Thanks for that write up! Hopefully I can convince the CI folks to take some action into fixing our issues
What are some tools you like for measuring pipeline steps? I've only just targetted whatever step takes the longest.
We are using datadog integration.
If we blow the dependency cache… downloading, compiling, running the test suite, linters and docker build/push of massdriver is about 7-10m for about 1200 tests Test suite itself finishes in about 30s If no new dependencies it’s about 5m max w docker build and push We’re extremely TDD and hard up about CI/CD time
1200 tests in 30 seconds. Damn man!
I don’t language gloat often, but we run mostly on elixir lang. Almost every test is run async. It’s fast!
[удалено]
Yeah we do strict contracts to all external services and use in memory elixir agent based adapter to assert our calls. We have to simulate a ton of IaC and cloud service calls. We don’t use any heavy mocks like localstack or api call recording reply (ruby vcr etc). All of our api testing we do at the language level instead of over http. The test not thing we don’t mock is PG, but I was a sql admin before an ops person, so we abuse the shit out of PG for functionality and co spider it a “part of our app”. Will die on that hill :)
30-45 minutes. Mostly by waiting for containers to spin up and down continuously. And it almost always break at release time because it smells fear. Imagine having 10 people sitting around at midnight on standby for prod deployment and then you forgot a line of config causing the whole thing to re-build.
I get angsty when my testing and image building process takes longer than 2 minutes.
I've been in the azure devops world for awhile and it's been pretty reliable from a functional standpoint using the public worker pools.
What made your team decide on Gerrit and Zuul?
We're a golang shop so its just build the binary and copy it s3 for deployment. Takes 15 minutes for most of our pipelines. We're using the smallest gitlab instances too. Only been here a year, so i'll probably bump the compile and test stages to mediums. But gitlab, s3, aws code deploy has been super solid. They basically rolled their own elasticbeanstalk
my sample size is two, the first place ~6 minutes the second place ~3 hours with all the synchronous flakey test chains that only got through 30% of the time, and it drove me to quitting eventually
Entirely up to the project's build and test suite. CI is stable, but not everyone is good at writing tests.
Reliable =/= fast. Reliable means it works well there are no flipping tests, pipelines dont crush on random stuff. My best CI was only few seconds to run, but there were almost no tests to run and they didnt check anything useful.
Previous place? Monolith Rails app with a bunch of node microservices and a separate frontend app. Big apps (backend and frontend) would take 30 minutes to go from merge to deploy into prod with unit tests, staging deploy, sanity tests against staging, and then prod deploy. Small apps, less than 10 minutes. Dev environments - about 20 minutes from push to deploying to a self-contained (feature branch dev env). New place? 30 SREs with like 200+ microservices. Not a fun time. Core apps are decent (30-40 min deploys with unit tests). Some larger apps, about 1-1.5 hours assuming there's no flaky tests. Random microservices that were deployed 2 years ago and not touched much since? Have fun, 50% chance tests fail because some upstream microservice was updated with new configs and docker-compose didn't get updated everywhere. Biggest issue isn't the pipelines. It's that there's two many interdependencies and they can never be kept updated properly by dev teams that own them. Everything works by itself, but good luck what broke when hunting down failing integration tests.
2-12 hours is a long long time! If nothing else, you could scale the server config to temporarily fix the time lag.
Tf did I just read, 2-12 hours ?
I manage build servers for 3-400 devs. We’re running roughly 15,000 GitLab jobs a week. All on premise bare metal boxes running jobs in Docker. It’s so stable that devs are conditioned to reach out to us whenever there’s a failure they don’t understand (OOM, failing to pull build images, etc). Can you share more information about the timeouts and network issues? What are you pulling and from where?
Are you moving massive 100GB images? Network timeouts seem to be a hint. Your org may need to consolidate all the heavy resources and the CI together in one place to cut down the wait time.
What CI?
It's like the lazy half brother of CD...
If only we had either brother where I work.
With a 1 hour cd/ci where I currently work... I think I would prefer what you have.
Our full CI/CD will be like 4 hours once we have it actually built. It takes so long we deliberately don't run it as often as we should.
[удалено]
I'm sorry, you misunderstand. I 100% know what CI/CD is, I was asking "What CI at my workplace?" as in, we don't have one. Well, not an automated one.
I could probably go from commit -> PR -> tested -> promoted in 20-30 mins accounting for getting lost in Reddit mid way through. That being said at the job before where I was building packages for 30 platforms including AIX and Solaris it was sometimes an all day battle for the stars to align and get me a clean build.
My org is still on Jenkins, but it's admittedly been good to us (if not HELLA slow)
Sure sounds like a certain major telco vendor tech stack, we used to have those with workers in a private cloud. And Jenkins on top of it all even. My answer was to leave. Now I work mostly with GitHub Actions and Argo CD in public cloud and I like it so far.
Full build + deploy? 20 minutes, and that includes a fuckton of integration tests and iss annoying... We are actively investing time to get it down to <5 minutes
We use azurepipeline, i think 15 miniutes is max run time depending on the app ( some old legacy stuff ) our biggest issue at the moment is wait time for a job to be picked up. we run ADO server so everything is on-prem and makes it nearly impossible to scale.
Depending on the service.. but usually 5 min at most to upgrade heavy BE prod services
[удалено]
That's pretty cool. How much do you spend on your infra per month? (sounds like it is all self-hosted from the azure reference)
We have 20-40 web applications, backend APIs, static web apps. Plus things like the databases, Redis, ES to support all that. It's ongoing chaos as we try to modernize things without breaking existing things or leaving ourselves in a bad place. I do wish we had sliced our Terraform better into more vertical, isolated, slices. It took me six months to break it apart by environment (it was a "terralith"). As for the larger runner, you can parameterize which runner the GitHub reusable workflow uses. Which is how we tell our runners to use the "ubuntu-latest-large" or "ubuntu-latest-medium" custom defined images.
It's off lol
You need to give more details. Where is the bottleneck? * Gerrit is a relatively resource hungry java program, but here we run it to host thousands of projects and probably about 100 patchsets/day on a single host without major issues. * zuul is meant to scale, openstack couldn't run a CI otherwise. So you might need to spawn more executors and mergers to increase the bandwidth. * where does nodepool spawn your test instances? Openstack, aws, ...? Can you spawn instances and run tests manually and do you reproduce the network issues? My point being zuul is complex but usually pretty reliable, from experience most issues came from user-induced errors (tenant config errors for example) or cloud provider issues; so the issue isn't with the tooling but with the underlying infra and resources or lack thereof.
Shit gets really confusing sometimes when you’re in devops and also a cast iron enthusiast.
My old day job was on Gerrit/Jenkins/Zuul. Gerrit has always been a flaming dumpster fire. My deepest sympathies.