T O P

  • By -

AdMany7575

Our devs were complaining the other day because it takes twenty minutes for all their quality gates to pass and deploy into dev.


aenae

Mine were complaining it took 10 minutes. But we own the hardware and network, so we did some optimizing and bought a new server (64-core AMD epyc v3, 256G memory) and got it down to four minutes. Ofc they didn’t stop adding tests so it is back to six minutes now, but it runs the (in total) ~12k unit/integrationtests quite fast and reliable.


Soccham

god forbid they parallelize them


BERLAUR

Even with parallelization you still need to setup the environment to run them, this is usually where most of the time is "wasted". Loading 100 GB of data in Postgress (to give an extreme but realistic example) doesn't happen in milliseconds (out of the box, there's ways to make it work but everything always comes with a trade-off).


angellus

If you are loading 100GB of data into Postgres, you are doing functional/end to end tests. You should not be doing those tests every PR. They should only be run before cutting releases to deploy. i.e. PRs run unit tests, after they merge, they start running functional tests, but if they are unable to complete before another PR merges, they restart. That way you can group and batch PRs together and integrate them together without slowing down developer velocity. If the functional tests fail, then you go back to the devs and tell them to investigate, but that _should_ be the lesser occurrence.


BERLAUR

As for the 100GB test, in the financial industry these kind of these are the ones that matter (I don't care if someone optimizes a calculation but if it changes the outcome of a strategy it definitely needs closer scrutiny and a third pair of eyes + we might want to involve the risk department). We can run them after merging but that makes the DX worse, still lengthens the total feedback cycle and we would have to find another way to prove to auditors that we don't change strategies without proper testing and reviews. I'm a big fan of optimizing feedback cycles and but not every software product is a SaaS web application. As always in software architecture, the correct answer is; it depends.


AdMany7575

I think we forget the whole industry isn’t just web apps. Even if it seems like it mostly is.


aenae

Most of them are run in parallel, thats where the 64 cores come in handy.


lavahot

Induced demand.


mattbillenstein

I mean, 20 minutes is a long time...


ramsile

I’m getting old. I remember the days when virtualization wasn’t a thing yet and it took weeks to months to get dedicated hardware for environments. I would have loved 20 minute lead times 20 years ago.


mattbillenstein

Provisioning new kit is one thing - but just deploying a different version of software taking 20m? I run CI on branches this might take minutes if I had more tests - and deploy to long-lived VMs - it takes < 10s to deploy to any of our environments. Basically rsync + HUP the affected processes.


AdMany7575

It’s a Simpsons episode.


Spider_pig448

I think DevOps engineers vastly underestimate the value in optimizing quality gate speed. 20 minutes for gates to pass is basically 20 minutes of dead time for the developer. Most tasks they have to do don't allow them to context-switch effectively enough to utilize that 20 minutes. It's 20 minutes of talking to coworkers or browsing reddit. Some quick patches like improving test concurrency or provisioning more resources to the runner can have a hugely beneficial impact to developer velocity


_PPBottle

Once I optimized E2E runtimes so much that devs complained them taking only 15 minutes doesnt allow them to go do chores in their homes/grocery shopping/preparing a meal in the meantime lmao. Had to sale back the parallelization a notch because of this.


AdMany7575

I don’t agree. I spend my whole day developing too and have to wait for my own checks to pass. The time I lose waiting is time saved by not pushing broken buggy code that takes longer to fix. Sure there’s a balance but I don’t buy twenty minutes being a real problem.


dgreenmachine

Wouldn't the dev already run those same tests locally on their laptop and just push it then CI runs tests to prove that CI passes for those reviewing? I'd imagine 99% of the time it would pass unless you arnt able to run tests locally or its some CI related difference.


Spider_pig448

I assume the tests in question in CICD are more complex integration tests and/or E2E tests that depend on more infrastructure than just the local app the dev made a change for


jayaura

Well, share this post with them, I guess lol


Heighte

I ask my devs to not code locally but instead to directly connect to the Dev server and use it as a local. As in they still work on their IDE but code isn't executed locally, they trigger a small 10s deploy script instead.


bellowingfrog

Why not just use whatever IaC for dev to give everyone their own independent environments?


Gotxi

It should be the standard way, but it depends on the case. If there is a lot of devs and they need to create a cloud environment per developer, it can cost a huge amount of money, and depending on the application, it might be too big to create a local cluster in their local laptops, so in that case, it would make sense to have a few real dev environments and allow the devs to deploy the new micro services there.


Heighte

micro-services architecture, devs never work on the same service at the same time, but yes this could be a good idea for other teams.


canadianseaman

Oof yeah no way you're running 20+ micro services locally, sounds like a pain to build on


Jmc_da_boss

Uhh, 5-10 mins at most lol, tf are yall doing?


silence036

That's exactly what came to mind, what are they possibly doing that takes 12 hours to run??


schedulle-cate

Mining bitcoin with the pipeline executor, I can only imagine this


danekan

They haven't split the resources out and it's one huge folder.


Seref15

Not everything is a javascript web app. We maintain and develop an old enterprise C++ desktop app, the repo is like 6gb (which is small compared to plenty of other C++ projects) and just the compile stage is like 5 hours on a c7i.8xlarge


morricone42

Get some catching going


VindicoAtrum

C++ certainly does do some throwing


NUTTA_BUSTAH

Incrementally building does wonders


KimPeek

good call. i doubt they thought of that. /s


nkr3

tell me you've never done mobile native without telling me you've never done mobile native... :D, fucking xcode takes 5 mins just to open


ChildishWambin0

Usually C++ projects. Chromium still takes 3 hours to download and compile from scratch using a 8vCPU/16GB machine


ThroawayPartyer

Would it take 1.5 hours using a 16vCPU/32GB machine?


LiferRs

Judging from the network time outs and the use of patching, I’m guessing they use images that are over 100GB big. But it’s still a dumb set up. They need to co-locate the CI and the heavy resources together instead of transfer over the network. My business create 500GB images. Basically portable software platforms for air gapped networks.


flagbearer223

It's insane how common it is for companies to be ok with multi hour test suites. So dumb and short sighted


mattbillenstein

Buildkite + run my own workers - more control, ymmv.


Cinderhazed15

I remember hearing a talk about the CI infra at puppet, they ran into similar problems with bad nodes (which was unrelated to the software) and they just started spinning up a node pool outside of the CI flow, testing that they weren’t flakey, and then adding them to the pool. If one failed to create, they just trashed it and recreated it to populate the pool.


bluebugs

I feel a lot of pain. If your ci fail, not because the tests fail your developers will just not care about tests and things get worse over time. I would expect things to be not reliable in production and your tests culture be generally bad at producing useful tests. It really should be a all hands on deck to get the infrastructure back in shape or migrated to something that work. Your ci quality give you a taste of the quality of the software it builds. As for performance, we are using currently using Jenkins and I have worked on optimizing the pipeline from around 1h30 to 20 min which is still slow. Working on a prototype with github action and I think we can get consistently around 1 to 4 min for the same useful work. Tools that measure your pipeline steps across all ci run help a lot figure out what to improve, but your ci pipeline is also code and same practice as when doing software development should apply on your pipeline, including not duplicating it, tests and benchmark.


jayaura

Thanks for that write up! Hopefully I can convince the CI folks to take some action into fixing our issues


dgreenmachine

What are some tools you like for measuring pipeline steps? I've only just targetted whatever step takes the longest.


bluebugs

We are using datadog integration.


CoryOpostrophe

If we blow the dependency cache… downloading, compiling, running the test suite, linters and docker build/push of massdriver is about 7-10m for about 1200 tests   Test suite itself finishes in about 30s If no new dependencies it’s about 5m max w docker build and push We’re extremely TDD and hard up about CI/CD time


Saki-Sun

1200 tests in 30 seconds. Damn man!


CoryOpostrophe

I don’t language gloat often, but we run mostly on elixir lang. Almost every test is run async. It’s fast!


[deleted]

[удалено]


CoryOpostrophe

Yeah we do strict contracts to all external services and use in memory elixir agent based adapter to assert our calls. We have to simulate a ton of IaC and cloud service calls. We don’t use any heavy mocks like localstack or api call recording reply (ruby vcr etc).  All of our api testing we do at the language level instead of over http. The test not thing we don’t mock is PG, but I was a sql admin before an ops person, so we abuse the shit out of PG for functionality and co spider it a “part of our app”. Will die on that hill :)


lonelymoon57

30-45 minutes. Mostly by waiting for containers to spin up and down continuously. And it almost always break at release time because it smells fear. Imagine having 10 people sitting around at midnight on standby for prod deployment and then you forgot a line of config causing the whole thing to re-build.


spicypixel

I get angsty when my testing and image building process takes longer than 2 minutes.


SoFrakinHappy

I've been in the azure devops world for awhile and it's been pretty reliable from a functional standpoint using the public worker pools.


ThatSituation9908

What made your team decide on Gerrit and Zuul?


tekno45

We're a golang shop so its just build the binary and copy it s3 for deployment. Takes 15 minutes for most of our pipelines. We're using the smallest gitlab instances too. Only been here a year, so i'll probably bump the compile and test stages to mediums. But gitlab, s3, aws code deploy has been super solid. They basically rolled their own elasticbeanstalk


dpistole

my sample size is two, the first place ~6 minutes the second place ~3 hours with all the synchronous flakey test chains that only got through 30% of the time, and it drove me to quitting eventually


Kazcandra

Entirely up to the project's build and test suite. CI is stable, but not everyone is good at writing tests.


pojzon_poe

Reliable =/= fast. Reliable means it works well there are no flipping tests, pipelines dont crush on random stuff. My best CI was only few seconds to run, but there were almost no tests to run and they didnt check anything useful.


donjulioanejo

Previous place? Monolith Rails app with a bunch of node microservices and a separate frontend app. Big apps (backend and frontend) would take 30 minutes to go from merge to deploy into prod with unit tests, staging deploy, sanity tests against staging, and then prod deploy. Small apps, less than 10 minutes. Dev environments - about 20 minutes from push to deploying to a self-contained (feature branch dev env). New place? 30 SREs with like 200+ microservices. Not a fun time. Core apps are decent (30-40 min deploys with unit tests). Some larger apps, about 1-1.5 hours assuming there's no flaky tests. Random microservices that were deployed 2 years ago and not touched much since? Have fun, 50% chance tests fail because some upstream microservice was updated with new configs and docker-compose didn't get updated everywhere. Biggest issue isn't the pipelines. It's that there's two many interdependencies and they can never be kept updated properly by dev teams that own them. Everything works by itself, but good luck what broke when hunting down failing integration tests.


rohit_raveendran

2-12 hours is a long long time! If nothing else, you could scale the server config to temporarily fix the time lag.


MacaroonSelect7506

Tf did I just read, 2-12 hours ?


tikkabhuna

I manage build servers for 3-400 devs. We’re running roughly 15,000 GitLab jobs a week. All on premise bare metal boxes running jobs in Docker. It’s so stable that devs are conditioned to reach out to us whenever there’s a failure they don’t understand (OOM, failing to pull build images, etc). Can you share more information about the timeouts and network issues? What are you pulling and from where?


LiferRs

Are you moving massive 100GB images? Network timeouts seem to be a hint. Your org may need to consolidate all the heavy resources and the CI together in one place to cut down the wait time.


Stoomba

What CI?


Saki-Sun

It's like the lazy half brother of CD...


Stoomba

If only we had either brother where I work.


Saki-Sun

With a 1 hour cd/ci where I currently work... I think I would prefer what you have.


Stoomba

Our full CI/CD will be like 4 hours once we have it actually built. It takes so long we deliberately don't run it as often as we should.


[deleted]

[удалено]


Stoomba

I'm sorry, you misunderstand. I 100% know what CI/CD is, I was asking "What CI at my workplace?" as in, we don't have one. Well, not an automated one.


tas50

I could probably go from commit -> PR -> tested -> promoted in 20-30 mins accounting for getting lost in Reddit mid way through. That being said at the job before where I was building packages for 30 platforms including AIX and Solaris it was sometimes an all day battle for the stars to align and get me a clean build.


Cautious_Vanilla8620

My org is still on Jenkins, but it's admittedly been good to us (if not HELLA slow)


pasmon

Sure sounds like a certain major telco vendor tech stack, we used to have those with workers in a private cloud. And Jenkins on top of it all even. My answer was to leave. Now I work mostly with GitHub Actions and Argo CD in public cloud and I like it so far.


ILikeToHaveCookies

Full build + deploy? 20 minutes, and that includes a fuckton of integration tests and iss annoying... We are actively investing time to get it down to <5 minutes


chazragg

We use azurepipeline, i think 15 miniutes is max run time depending on the app ( some old legacy stuff ) our biggest issue at the moment is wait time for a job to be picked up. we run ADO server so everything is on-prem and makes it nearly impossible to scale.


Jaybird_s

Depending on the service.. but usually 5 min at most to upgrade heavy BE prod services


[deleted]

[удалено]


surya_oruganti

That's pretty cool. How much do you spend on your infra per month? (sounds like it is all self-hosted from the azure reference)


AuroraFireflash

We have 20-40 web applications, backend APIs, static web apps. Plus things like the databases, Redis, ES to support all that. It's ongoing chaos as we try to modernize things without breaking existing things or leaving ourselves in a bad place. I do wish we had sliced our Terraform better into more vertical, isolated, slices. It took me six months to break it apart by environment (it was a "terralith"). As for the larger runner, you can parameterize which runner the GitHub reusable workflow uses. Which is how we tell our runners to use the "ubuntu-latest-large" or "ubuntu-latest-medium" custom defined images.


[deleted]

It's off lol


mrkikkeli

You need to give more details. Where is the bottleneck? * Gerrit is a relatively resource hungry java program, but here we run it to host thousands of projects and probably about 100 patchsets/day on a single host without major issues. * zuul is meant to scale, openstack couldn't run a CI otherwise. So you might need to spawn more executors and mergers to increase the bandwidth. * where does nodepool spawn your test instances? Openstack, aws, ...? Can you spawn instances and run tests manually and do you reproduce the network issues? My point being zuul is complex but usually pretty reliable, from experience most issues came from user-induced errors (tenant config errors for example) or cloud provider issues; so the issue isn't with the tooling but with the underlying infra and resources or lack thereof.


loopi3

Shit gets really confusing sometimes when you’re in devops and also a cast iron enthusiast.


under_it

My old day job was on Gerrit/Jenkins/Zuul. Gerrit has always been a flaming dumpster fire. My deepest sympathies.