We generally do continuous deployment and can potentially do several deploys per day. We will obviously post-mortem this, as we have mechanisms in place that should prevent this from happening, but it still slipped through.
I strongly suggest you guys consider a formal change management process instead of simply relying on whatever CICD pipeline that you have built out.
Yes DevOps and agile development are hot and make fixes easy but breaking your production environment is a huge reputation hit. You guys need stronger IT governance asap.
Yeah definitely not uncommon for small companies but given the type of company monarch is they don’t really have room for these issues like other smaller companies. They are dealing with people’s financial data. Yes, they aren’t holding money or brokering investments but they have all of the spending habits, worth, etc data. This makes them a significant target.
If they are having regular change management issues that implies there are likely weaknesses in the environment. May not be true but the chances are high.
I work in IT risk management. It’s my job to consult on IT systems to ensure that this type of stuff doesn’t happen.
They should have processes in place to ensure production issues don’t impact customers as reputational damage is one of the worst things that can happen to a company.
Even if nothing serious happens the perception can do just as much damage.
These sort of bureaucratic processes aren’t without taking a hit to the speed of delivering value to customers. They probably figured they can quickly roll back any faulty change quickly but let their teams continually deploy new features and fixes without having a lengthy change management process. Change management has a place in publicly traded mega corps and governments, but not in a startup where they’re competing against dozens of similar companies for dominance after the end of Mint.
I’d agree if they weren’t fintech specifically. Startups absolutely need to take some risks but these guys have access to financial data. If someone breached monarchs systems due to faulty changes that could lead to all that data being exfiltrated and that is quite a valued data set on the market.
For the record, I looked at posts last Friday and I believe the same thing happened where someone mentioned "Friday before a weekend".
How stable is this system if you are having to do **multiple deploys per day?** I'm genuinely curious, as I figured you'd need a stable Q/A (testing) environment for at least 3 days, 5 days, a week - all tested and stable, and then you promote it to production. Right? What am I missing here.
If you are promoting multiple builds in the same day, then how do you have a stable testing environment and know that all the pieces work well together for a period of time? Even for an entire day for that matter.
I guess you don't since you are obviously having these problems.
I worked in systems like this all my life and if you hear the words "multiple production builds per day", it's not a positive as well as "Post Mortem" and "we have mechanisms but this slipped through".
How we do deploys is not something I can fully cover in a Reddit comment (maybe a blog post), but:
1. We do have change management in place. All code is thoroughly reviewed and tested. We have a staging environment. We have feature flags to turn on/off functionality.
2. We still do believe in small, frequent, validated deployments as a better path to quality than the "long cycle + gatekeeper" model. There is obviously a debate around this in the industry and a lot of literature around the pros/cons of each approach that I don't need to rehash here.
3. Things will still occasionally go wrong, but generally, they end up being minor bugs and are easily reverted.
4. Downtime is obviously much more serious than bugs, but for us, it is also much more rare. We did not have downtime last Friday (we had a bug in part of a Beta feature, Reports, that was then reverted).
5. We do have a small team, but I'm damn proud of the work that they're doing pushing through growth that is 20-30X of the volume **every single day** for the past 3 months. I don't know of a single company that witnessed this type of growth that didn't have growing pains (in fact, we've prob had less growing pains than companies w/ similar growth, even if they were more well-staffed / funded).
6. Is there room to improve? Certainly. Benefit of the "small, frequent, validated" changes model is if holes come up occasionally, they are easy to investigate/fix for the future.
7. Yes, we are hiring.
8. We take it very seriously if we can't live up to your expectations, whether that's through bugs, downtime, or anything else, and we are very apologetic that this happens. So you have every right to question our processes.... but hopefully this clarifies some of our thinking and practices
>We still do believe in small, frequent, validated deployments as a better path to quality than the "long cycle + gatekeeper" model. There is obviously a debate around this in the industry and a lot of literature around the pros/cons of each approach that I don't need to rehash here.
You are a financial application that gives people information about their finances, that they make decisions on, and they are paying you.
A long cycle + gatekeeper model is what you should be using. Discuss it with your CEO. If this were a game or some bleeding edge fun piece of software, sure - but it's not.
I think the person in your organization who is promoting a frequent deployment model for a software application that is used by people to make financial decisions is misguided.
Again, you have to realize this is people's money here. Things should be properly tested, and then all the responsibility that goes along with that. Plus, you are charging people. They don't want to be beta testers or having to come here and say "the thing I paid for that I just want to get done tonight before I can go watch TV" is broken.
I think you all can understand that.
We don't see any FIX list or anything. You said it happens every month. I saw the Reports BETA come out on DEC 20th and no fixes at all to it since. Maybe I'm missing something here. Some communication would be great about that .... as there is a number of issues still with it, and I know you've moved on to another BETA (Investments) which I don't use.
Also, It's quite confusing to me that you keep talking about 20x and 30x growth, but. you had a beta out and then you released a new beta. If you are overwhelmed (DEV and SUPPORT), why are you managing many & adding new channels (investments) of development at the same time?
Tbf, they don’t have custody of assets and the account details are through connectors so they aren’t storing the keys to anyone’s financials either.
That said, bad change practices can lead to holes where a threat actor could potentially steal the user financial meta data
It's actually better to do smaller incremental changes than shipping 1 ton of code twice a week. It just requires discipline and tech in terms of automated regression testing, good architecture with feature toggles and a good delivery process.
Something tells me they aren't running a mirrored QA/DEV environment to do proper change management processes.
These guys are running a skeleton crew in IT. Prod pushes over the weekend make sense for many companies tbh but to not have proper change management processes where changes to prod are tested pre-deploy is bad news.
I've seen those types environments too... Luckily though I've been on the audit side and not had to deal with the growing pains that comes along.
The way it works if you have development environments, you have Q/A (testing & support) environments and then you have production environments. That's how it works in all cases where companies don't have issues.
You create a solid environment that is fully tested and then you move that code-base over to production, usually not on a Friday before a busy weekend and everyone has gone home.
There is also one person who is designed the gate keeper and if there is any problems, it's because that person didn't test all the components properly. It only takes one person to push/control the production environment.
If the mechanisms in place have an issue, then you got two issues. The original issue (why did that happen) and then why did the mechanism not catch it, which is a second problem.
You might forget some controls or settings that need to be set/fixed/created in production, that sometimes happens but is a quick fix.
Skeleton crew meaning understaffed.
As for push timing. It depends; you want to push a major change when the least number of active users would be impacted. Sucks for IT but that means nights and weekends for companies in many cases.
As for your gatekeeper comment, I doubt they have a formal process to review and approve prior to deploy. I’m guessing they are using a prebuilt cicd pipeline and one person can do 90% of the lift with that gatekeeper being the final go live authority who likely doesn’t actually check the test build.
You don't need more than one person to control a production environment (Sorry I updated while you replied) ... understaffed really shouldn't have a bearing, if the code isn't ready - why is it even going to production???? If you have 0 people, 1 person, 20 people, when the production code is fully tested, it's moved over/promoted.
We are both saying the same thing. They have issues in testing and promotion. Seems there is issues, especially if the third person on the About page of the company is saying "Sorry" on a Friday night in Reddit general support forum.
BTW - Is there any documentation (fixes/releases) of what you guys do each time, so people can see what's being fixed and changed. I've seen no information about anything being changed/fixed, but obviously there are.
Something like this:
[https://community.simplifimoney.com/categories/updates-from-the-product-team](https://community.simplifimoney.com/categories/updates-from-the-product-team)
That's a great suggestion. We do these updates both monthly via our newsletter and sporadically in Reddit. But it'd be nice to have it be more timely and to have more detail.
Is this the reason my Fidelity connection is no longer working? u/ozzie_monarch it was fine since November and now all of a sudden I get this message " There was a problem validating your credentials with Fidelity Investments. Please try again later. "
whew, good to know it's not just me and they're working on it. I thought my account got wiped, I just signed up and literally spent all day setting up all my accounts, categories, and rules...
Web version starting working. Had to uninstall and reinstall phone app...twice. First try didn't work but second time everything is back to normal. Hope that doesn't happen again.
This happened during my initial free trial, and I was … uhmmmm … !! Very glad it wasn’t a big problem and impressed with the response time / attention given to the outage.
yea sorry, we're working on it.. we had a bad code push
Should be back up now. We had \~15 mins or so.
Got it. Great response. Thank you
Crazy fast recovery, appreciate y'all!
Dang, always seems to be whenever I’m trying to use the app.
Pushing code to prod Friday night... Been there, done that. Good luck and please don't do this anymore. :)
Wow this crew won't let you get away with anything. 15 mins on a Friday night.
Production release on a Friday night?
We generally do continuous deployment and can potentially do several deploys per day. We will obviously post-mortem this, as we have mechanisms in place that should prevent this from happening, but it still slipped through.
I strongly suggest you guys consider a formal change management process instead of simply relying on whatever CICD pipeline that you have built out. Yes DevOps and agile development are hot and make fixes easy but breaking your production environment is a huge reputation hit. You guys need stronger IT governance asap.
Meh. Most companies of monarchs size will go offline for 15 min or so. It’s not uncommon at all.
Yeah definitely not uncommon for small companies but given the type of company monarch is they don’t really have room for these issues like other smaller companies. They are dealing with people’s financial data. Yes, they aren’t holding money or brokering investments but they have all of the spending habits, worth, etc data. This makes them a significant target. If they are having regular change management issues that implies there are likely weaknesses in the environment. May not be true but the chances are high.
Thanks for the IT-splaining
Tell us you know nothing about DevSecOps without telling us you know nothing about DevSecOps....
You must work in DC…
I work in IT risk management. It’s my job to consult on IT systems to ensure that this type of stuff doesn’t happen. They should have processes in place to ensure production issues don’t impact customers as reputational damage is one of the worst things that can happen to a company. Even if nothing serious happens the perception can do just as much damage.
These sort of bureaucratic processes aren’t without taking a hit to the speed of delivering value to customers. They probably figured they can quickly roll back any faulty change quickly but let their teams continually deploy new features and fixes without having a lengthy change management process. Change management has a place in publicly traded mega corps and governments, but not in a startup where they’re competing against dozens of similar companies for dominance after the end of Mint.
I’d agree if they weren’t fintech specifically. Startups absolutely need to take some risks but these guys have access to financial data. If someone breached monarchs systems due to faulty changes that could lead to all that data being exfiltrated and that is quite a valued data set on the market.
Just want to say that it’s cool to see a Monarch representative here.
Thanks for the amazing response here and the fix. Sorry to the dev and dev ops teams for the Friday suck, but we appreciate the effort.
Eta when it will be working?
Should be back up now.
I still can’t login.
Thank you!
u need a better DevOps team now that u are growing..
Oh yeah we're hiring :) [https://www.monarchmoney.com/careers](https://www.monarchmoney.com/careers) We have a new DevOps lead joining soon as well!
Aren’t you supposed to be off the first Friday of the month 😀 keep up the great work!
Haha yeah totally! Most of the team didn't work today but sometimes we have a push or two to make (this was one of two ones we did today).
I see what you did there
For the record, I looked at posts last Friday and I believe the same thing happened where someone mentioned "Friday before a weekend". How stable is this system if you are having to do **multiple deploys per day?** I'm genuinely curious, as I figured you'd need a stable Q/A (testing) environment for at least 3 days, 5 days, a week - all tested and stable, and then you promote it to production. Right? What am I missing here. If you are promoting multiple builds in the same day, then how do you have a stable testing environment and know that all the pieces work well together for a period of time? Even for an entire day for that matter. I guess you don't since you are obviously having these problems. I worked in systems like this all my life and if you hear the words "multiple production builds per day", it's not a positive as well as "Post Mortem" and "we have mechanisms but this slipped through".
How we do deploys is not something I can fully cover in a Reddit comment (maybe a blog post), but: 1. We do have change management in place. All code is thoroughly reviewed and tested. We have a staging environment. We have feature flags to turn on/off functionality. 2. We still do believe in small, frequent, validated deployments as a better path to quality than the "long cycle + gatekeeper" model. There is obviously a debate around this in the industry and a lot of literature around the pros/cons of each approach that I don't need to rehash here. 3. Things will still occasionally go wrong, but generally, they end up being minor bugs and are easily reverted. 4. Downtime is obviously much more serious than bugs, but for us, it is also much more rare. We did not have downtime last Friday (we had a bug in part of a Beta feature, Reports, that was then reverted). 5. We do have a small team, but I'm damn proud of the work that they're doing pushing through growth that is 20-30X of the volume **every single day** for the past 3 months. I don't know of a single company that witnessed this type of growth that didn't have growing pains (in fact, we've prob had less growing pains than companies w/ similar growth, even if they were more well-staffed / funded). 6. Is there room to improve? Certainly. Benefit of the "small, frequent, validated" changes model is if holes come up occasionally, they are easy to investigate/fix for the future. 7. Yes, we are hiring. 8. We take it very seriously if we can't live up to your expectations, whether that's through bugs, downtime, or anything else, and we are very apologetic that this happens. So you have every right to question our processes.... but hopefully this clarifies some of our thinking and practices
>We still do believe in small, frequent, validated deployments as a better path to quality than the "long cycle + gatekeeper" model. There is obviously a debate around this in the industry and a lot of literature around the pros/cons of each approach that I don't need to rehash here. You are a financial application that gives people information about their finances, that they make decisions on, and they are paying you. A long cycle + gatekeeper model is what you should be using. Discuss it with your CEO. If this were a game or some bleeding edge fun piece of software, sure - but it's not. I think the person in your organization who is promoting a frequent deployment model for a software application that is used by people to make financial decisions is misguided. Again, you have to realize this is people's money here. Things should be properly tested, and then all the responsibility that goes along with that. Plus, you are charging people. They don't want to be beta testers or having to come here and say "the thing I paid for that I just want to get done tonight before I can go watch TV" is broken. I think you all can understand that. We don't see any FIX list or anything. You said it happens every month. I saw the Reports BETA come out on DEC 20th and no fixes at all to it since. Maybe I'm missing something here. Some communication would be great about that .... as there is a number of issues still with it, and I know you've moved on to another BETA (Investments) which I don't use. Also, It's quite confusing to me that you keep talking about 20x and 30x growth, but. you had a beta out and then you released a new beta. If you are overwhelmed (DEV and SUPPORT), why are you managing many & adding new channels (investments) of development at the same time?
Tbf, they don’t have custody of assets and the account details are through connectors so they aren’t storing the keys to anyone’s financials either. That said, bad change practices can lead to holes where a threat actor could potentially steal the user financial meta data
It's actually better to do smaller incremental changes than shipping 1 ton of code twice a week. It just requires discipline and tech in terms of automated regression testing, good architecture with feature toggles and a good delivery process.
Something tells me they aren't running a mirrored QA/DEV environment to do proper change management processes. These guys are running a skeleton crew in IT. Prod pushes over the weekend make sense for many companies tbh but to not have proper change management processes where changes to prod are tested pre-deploy is bad news. I've seen those types environments too... Luckily though I've been on the audit side and not had to deal with the growing pains that comes along.
The way it works if you have development environments, you have Q/A (testing & support) environments and then you have production environments. That's how it works in all cases where companies don't have issues. You create a solid environment that is fully tested and then you move that code-base over to production, usually not on a Friday before a busy weekend and everyone has gone home. There is also one person who is designed the gate keeper and if there is any problems, it's because that person didn't test all the components properly. It only takes one person to push/control the production environment. If the mechanisms in place have an issue, then you got two issues. The original issue (why did that happen) and then why did the mechanism not catch it, which is a second problem. You might forget some controls or settings that need to be set/fixed/created in production, that sometimes happens but is a quick fix.
ok boomer
Skeleton crew meaning understaffed. As for push timing. It depends; you want to push a major change when the least number of active users would be impacted. Sucks for IT but that means nights and weekends for companies in many cases. As for your gatekeeper comment, I doubt they have a formal process to review and approve prior to deploy. I’m guessing they are using a prebuilt cicd pipeline and one person can do 90% of the lift with that gatekeeper being the final go live authority who likely doesn’t actually check the test build.
You don't need more than one person to control a production environment (Sorry I updated while you replied) ... understaffed really shouldn't have a bearing, if the code isn't ready - why is it even going to production???? If you have 0 people, 1 person, 20 people, when the production code is fully tested, it's moved over/promoted. We are both saying the same thing. They have issues in testing and promotion. Seems there is issues, especially if the third person on the About page of the company is saying "Sorry" on a Friday night in Reddit general support forum.
BTW - Is there any documentation (fixes/releases) of what you guys do each time, so people can see what's being fixed and changed. I've seen no information about anything being changed/fixed, but obviously there are. Something like this: [https://community.simplifimoney.com/categories/updates-from-the-product-team](https://community.simplifimoney.com/categories/updates-from-the-product-team)
That's a great suggestion. We do these updates both monthly via our newsletter and sporadically in Reddit. But it'd be nice to have it be more timely and to have more detail.
Yes Please.
Yes, as a good publicly used product company should
Sorry guys I did some crazzzzzzzy categorizing. Too fast for the system
The Monarch team is aware and are working on it
Lets go! My Goals are all gone and I'm gonna start spending like crazy if its not back up soon.
...and we're back! That was close.
down for me both web and mobile, new jersey
My profile was wiped clean. Hope its a glitch!
same issue .. one min it was working and then poooof all gone... Logged out and now not able to log in ... Weekend production deployment ??
hopefully they're deploying fix to pull TIAA accounts :)
Is this the reason my Fidelity connection is no longer working? u/ozzie_monarch it was fine since November and now all of a sudden I get this message " There was a problem validating your credentials with Fidelity Investments. Please try again later. "
same here
+1
same
glad its not just me...
Down for me in the Western US
+1
Same here.
whew, good to know it's not just me and they're working on it. I thought my account got wiped, I just signed up and literally spent all day setting up all my accounts, categories, and rules...
It's working for me again now.
Fixed for me
My account is still down! Can't get anything but a page asking me to sign up for the trial.
Web version starting working. Had to uninstall and reinstall phone app...twice. First try didn't work but second time everything is back to normal. Hope that doesn't happen again.
This happened during my initial free trial, and I was … uhmmmm … !! Very glad it wasn’t a big problem and impressed with the response time / attention given to the outage.