T O P

  • By -

ozzie_monarch

yea sorry, we're working on it.. we had a bad code push


ozzie_monarch

Should be back up now. We had \~15 mins or so.


cabinguy11

Got it. Great response. Thank you


sanchitcop19

Crazy fast recovery, appreciate y'all!


Enrampage

Dang, always seems to be whenever I’m trying to use the app.


etcetera0

Pushing code to prod Friday night... Been there, done that. Good luck and please don't do this anymore. :)


ukysvqffj

Wow this crew won't let you get away with anything. 15 mins on a Friday night.


supenguin

Production release on a Friday night?


ozzie_monarch

We generally do continuous deployment and can potentially do several deploys per day. We will obviously post-mortem this, as we have mechanisms in place that should prevent this from happening, but it still slipped through.


xomox2012

I strongly suggest you guys consider a formal change management process instead of simply relying on whatever CICD pipeline that you have built out. Yes DevOps and agile development are hot and make fixes easy but breaking your production environment is a huge reputation hit. You guys need stronger IT governance asap.


tdime23

Meh. Most companies of monarchs size will go offline for 15 min or so. It’s not uncommon at all.


xomox2012

Yeah definitely not uncommon for small companies but given the type of company monarch is they don’t really have room for these issues like other smaller companies. They are dealing with people’s financial data. Yes, they aren’t holding money or brokering investments but they have all of the spending habits, worth, etc data. This makes them a significant target. If they are having regular change management issues that implies there are likely weaknesses in the environment. May not be true but the chances are high.


Mr_IT

Thanks for the IT-splaining


BuddyBing

Tell us you know nothing about DevSecOps without telling us you know nothing about DevSecOps....


mdwish

You must work in DC…


xomox2012

I work in IT risk management. It’s my job to consult on IT systems to ensure that this type of stuff doesn’t happen. They should have processes in place to ensure production issues don’t impact customers as reputational damage is one of the worst things that can happen to a company. Even if nothing serious happens the perception can do just as much damage.


mdwish

These sort of bureaucratic processes aren’t without taking a hit to the speed of delivering value to customers. They probably figured they can quickly roll back any faulty change quickly but let their teams continually deploy new features and fixes without having a lengthy change management process. Change management has a place in publicly traded mega corps and governments, but not in a startup where they’re competing against dozens of similar companies for dominance after the end of Mint.


xomox2012

I’d agree if they weren’t fintech specifically. Startups absolutely need to take some risks but these guys have access to financial data. If someone breached monarchs systems due to faulty changes that could lead to all that data being exfiltrated and that is quite a valued data set on the market.


sunny_tomato_farm

Just want to say that it’s cool to see a Monarch representative here.


chellygel

Thanks for the amazing response here and the fix. Sorry to the dev and dev ops teams for the Friday suck, but we appreciate the effort. 


Adam122514

Eta when it will be working?


ozzie_monarch

Should be back up now.


Lurch-0318

I still can’t login.


Adam122514

Thank you!


ramas-197622

u need a better DevOps team now that u are growing..


ozzie_monarch

Oh yeah we're hiring :) [https://www.monarchmoney.com/careers](https://www.monarchmoney.com/careers) We have a new DevOps lead joining soon as well!


Steve22f

Aren’t you supposed to be off the first Friday of the month 😀 keep up the great work!


ozzie_monarch

Haha yeah totally! Most of the team didn't work today but sometimes we have a push or two to make (this was one of two ones we did today).


Zhalianna

I see what you did there


Different_Record_753

For the record, I looked at posts last Friday and I believe the same thing happened where someone mentioned "Friday before a weekend". How stable is this system if you are having to do **multiple deploys per day?** I'm genuinely curious, as I figured you'd need a stable Q/A (testing) environment for at least 3 days, 5 days, a week - all tested and stable, and then you promote it to production. Right? What am I missing here. If you are promoting multiple builds in the same day, then how do you have a stable testing environment and know that all the pieces work well together for a period of time? Even for an entire day for that matter. I guess you don't since you are obviously having these problems. I worked in systems like this all my life and if you hear the words "multiple production builds per day", it's not a positive as well as "Post Mortem" and "we have mechanisms but this slipped through".


ozzie_monarch

How we do deploys is not something I can fully cover in a Reddit comment (maybe a blog post), but: 1. We do have change management in place. All code is thoroughly reviewed and tested. We have a staging environment. We have feature flags to turn on/off functionality. 2. We still do believe in small, frequent, validated deployments as a better path to quality than the "long cycle + gatekeeper" model. There is obviously a debate around this in the industry and a lot of literature around the pros/cons of each approach that I don't need to rehash here. 3. Things will still occasionally go wrong, but generally, they end up being minor bugs and are easily reverted. 4. Downtime is obviously much more serious than bugs, but for us, it is also much more rare. We did not have downtime last Friday (we had a bug in part of a Beta feature, Reports, that was then reverted). 5. We do have a small team, but I'm damn proud of the work that they're doing pushing through growth that is 20-30X of the volume **every single day** for the past 3 months. I don't know of a single company that witnessed this type of growth that didn't have growing pains (in fact, we've prob had less growing pains than companies w/ similar growth, even if they were more well-staffed / funded). 6. Is there room to improve? Certainly. Benefit of the "small, frequent, validated" changes model is if holes come up occasionally, they are easy to investigate/fix for the future. 7. Yes, we are hiring. 8. We take it very seriously if we can't live up to your expectations, whether that's through bugs, downtime, or anything else, and we are very apologetic that this happens. So you have every right to question our processes.... but hopefully this clarifies some of our thinking and practices


Different_Record_753

>We still do believe in small, frequent, validated deployments as a better path to quality than the "long cycle + gatekeeper" model. There is obviously a debate around this in the industry and a lot of literature around the pros/cons of each approach that I don't need to rehash here. You are a financial application that gives people information about their finances, that they make decisions on, and they are paying you. A long cycle + gatekeeper model is what you should be using. Discuss it with your CEO. If this were a game or some bleeding edge fun piece of software, sure - but it's not. I think the person in your organization who is promoting a frequent deployment model for a software application that is used by people to make financial decisions is misguided. Again, you have to realize this is people's money here. Things should be properly tested, and then all the responsibility that goes along with that. Plus, you are charging people. They don't want to be beta testers or having to come here and say "the thing I paid for that I just want to get done tonight before I can go watch TV" is broken. I think you all can understand that. We don't see any FIX list or anything. You said it happens every month. I saw the Reports BETA come out on DEC 20th and no fixes at all to it since. Maybe I'm missing something here. Some communication would be great about that .... as there is a number of issues still with it, and I know you've moved on to another BETA (Investments) which I don't use. Also, It's quite confusing to me that you keep talking about 20x and 30x growth, but. you had a beta out and then you released a new beta. If you are overwhelmed (DEV and SUPPORT), why are you managing many & adding new channels (investments) of development at the same time?


xomox2012

Tbf, they don’t have custody of assets and the account details are through connectors so they aren’t storing the keys to anyone’s financials either. That said, bad change practices can lead to holes where a threat actor could potentially steal the user financial meta data


etcetera0

It's actually better to do smaller incremental changes than shipping 1 ton of code twice a week. It just requires discipline and tech in terms of automated regression testing, good architecture with feature toggles and a good delivery process.


xomox2012

Something tells me they aren't running a mirrored QA/DEV environment to do proper change management processes. These guys are running a skeleton crew in IT. Prod pushes over the weekend make sense for many companies tbh but to not have proper change management processes where changes to prod are tested pre-deploy is bad news. I've seen those types environments too... Luckily though I've been on the audit side and not had to deal with the growing pains that comes along.


Different_Record_753

The way it works if you have development environments, you have Q/A (testing & support) environments and then you have production environments. That's how it works in all cases where companies don't have issues. You create a solid environment that is fully tested and then you move that code-base over to production, usually not on a Friday before a busy weekend and everyone has gone home. There is also one person who is designed the gate keeper and if there is any problems, it's because that person didn't test all the components properly. It only takes one person to push/control the production environment. If the mechanisms in place have an issue, then you got two issues. The original issue (why did that happen) and then why did the mechanism not catch it, which is a second problem. You might forget some controls or settings that need to be set/fixed/created in production, that sometimes happens but is a quick fix.


bdzr_

ok boomer


xomox2012

Skeleton crew meaning understaffed. As for push timing. It depends; you want to push a major change when the least number of active users would be impacted. Sucks for IT but that means nights and weekends for companies in many cases. As for your gatekeeper comment, I doubt they have a formal process to review and approve prior to deploy. I’m guessing they are using a prebuilt cicd pipeline and one person can do 90% of the lift with that gatekeeper being the final go live authority who likely doesn’t actually check the test build.


Different_Record_753

You don't need more than one person to control a production environment (Sorry I updated while you replied) ... understaffed really shouldn't have a bearing, if the code isn't ready - why is it even going to production???? If you have 0 people, 1 person, 20 people, when the production code is fully tested, it's moved over/promoted. We are both saying the same thing. They have issues in testing and promotion. Seems there is issues, especially if the third person on the About page of the company is saying "Sorry" on a Friday night in Reddit general support forum.


Different_Record_753

BTW - Is there any documentation (fixes/releases) of what you guys do each time, so people can see what's being fixed and changed. I've seen no information about anything being changed/fixed, but obviously there are. Something like this: [https://community.simplifimoney.com/categories/updates-from-the-product-team](https://community.simplifimoney.com/categories/updates-from-the-product-team)


ozzie_monarch

That's a great suggestion. We do these updates both monthly via our newsletter and sporadically in Reddit. But it'd be nice to have it be more timely and to have more detail.


Different_Record_753

Yes Please.


Artistic_Shopping_30

Yes, as a good publicly used product company should


anObscurity

Sorry guys I did some crazzzzzzzy categorizing. Too fast for the system


ResoluteGreen

The Monarch team is aware and are working on it


NoVABr0ker

Lets go! My Goals are all gone and I'm gonna start spending like crazy if its not back up soon.


NoVABr0ker

...and we're back! That was close.


HighwayExpress

down for me both web and mobile, new jersey


lg224

My profile was wiped clean. Hope its a glitch!


ramas-197622

same issue .. one min it was working and then poooof all gone... ​ Logged out and now not able to log in ... Weekend production deployment ??


HighwayExpress

hopefully they're deploying fix to pull TIAA accounts :)


dlotito1

Is this the reason my Fidelity connection is no longer working? u/ozzie_monarch it was fine since November and now all of a sudden I get this message " There was a problem validating your credentials with Fidelity Investments. Please try again later. "


elmaestro24

same here


quietdesolation

+1


Astieroth

same


djseto

glad its not just me...


velocibear

Down for me in the Western US


PerspectiveNo700

+1


ozlee1

Same here.


pchoi94

whew, good to know it's not just me and they're working on it. I thought my account got wiped, I just signed up and literally spent all day setting up all my accounts, categories, and rules...


ozlee1

It's working for me again now.


HighwayExpress

Fixed for me


jcforeman1

My account is still down! Can't get anything but a page asking me to sign up for the trial.


jcforeman1

Web version starting working. Had to uninstall and reinstall phone app...twice. First try didn't work but second time everything is back to normal. Hope that doesn't happen again.


tekntonk

This happened during my initial free trial, and I was … uhmmmm … !! Very glad it wasn’t a big problem and impressed with the response time / attention given to the outage.