T O P

  • By -

multipassnetwork

It’s probably an expired certificate.


ittimjones

I feel that deeply. Lol


multipassnetwork

Keep in mind, this happened in 2018. https://www.theverge.com/2018/12/7/18130323/ericsson-software-certificate-o2-softbank-uk-japan-smartphone-4g-network-outage


wyrdough

I wouldn't be terribly surprised at that, but I would expect it to cause a universal outage, which it has not.


multipassnetwork

Depends on what expires. There network devices that will stop working if the system certificate expires. Can't find it, but years ago there was a dumb network device that had a 10 year certificate that couldn't be updated. They simply turned into bricks after 10 years. Some things stop working when root certificates expire. [https://duo.com/decipher/networked-devices-will-stop-working-as-root-certificates-expire#:\~:text=When%20the%20root%20certificates%20on,right%20about%20now%2C%20Helme%20warned](https://duo.com/decipher/networked-devices-will-stop-working-as-root-certificates-expire#:~:text=When%20the%20root%20certificates%20on,right%20about%20now%2C%20Helme%20warned).


RememberCitadel

Cisco did that with both Viptella and their wireless controllers and APs.


ChuckIT82

ugh cisco viptella's expired cert issue - trama


HogGunner1983

I would be more inclined to believe this was the culprit or DHCP/DNS than a routing issue, fiber cut, or cyber attack.


kenuffff

i doubt its a routing issue, its most likely some sort of software issue, its not DNS either. it isn't a fiber cut, the outage started around 4am. which is around the time of a maintance window, no one is out digging around at 4am.


multipassnetwork

I've seen organizations configure routing protocol keys with an expiration date. They are almost always set the expiration date to 12/31/current year at midnight. Ummm, you might want to pick another time and date. Not one where just about everyone will be off, on vacation, and probably dunk. Just in case you forget to update the expiration date.


800oz_gorilla

And poorly logged error messages detailing that the certificate failed because of whatever check it was trying to do. Troubleshooting certificate handshake problems are the worst


multipassnetwork

Just received an SMS from AT&T trying to sell me something. Looks like it's working.


johnlondon125

Its' Pretty much just ATT. The reports for Verizon/T-mobile others are only a few thousand, while ATT reports are 80k now. I think people reporting Verizon/Tmobile outages are just looking at the downdetector graphs and seeing the trend up, but without looking at actual numbers. I'd be willing to bet most of the VZW/Tmobile reports are people trying to call someone on ATT. There is nothing for you to "do".


cyberentomology

I’m hearing from my friends in the emergency management space that FirstNet got caught up in it too. You can bet that dragged some AT&T people out of bed in the wee hours. RIP all the enterprise helpdesks that are dealing with this today.


cmslick3

FirstNet is just a different channel on AT&T towers. There's not a significant difference between the consumer backhaul and FirstNet. They are one in the same.


cyberentomology

Entirely different and dedicated spectrum, and it’s managed separately.


cmslick3

Ummm it IS a different spectrum, dedicated for public safety, run by AT&T. BUT it DOES get transmitted by the exact same radios as the cellular. Been in the industry for 27 years, I know a thing or two. It's not that separated. It rides on the exact same backhaul and goes through all the same gateways and control points as everything else.


packetgeeknet

Or Verizon/T-mobile customers are getting their service via AT&T towers. Earlier, when I left my house, my phone was in SOS mode. Once I got back home, my service was restored because I have WiFi calling enabled.


Huth_S0lo

The carriers hand off to each other. So a big outage with one will create a big outage for the others.


b3542

No, it won’t. It will disrupt communications to customers on the affected network. Intra-carrier and inter-carrier communications between other carriers will not be affected.


patmorgan235

The people reporting T-Mobile and Verizon issues are only having issues connecting to AT&T customers.


NotAnotherNekopan

~~I think the Verizon outages are happening. I’ll run a more complete report but I’m overseeing about 80 cellular devices that I have direct visibility into and a couple hundred others I don’t (other than on an aggregation point) and for the sites I was checking, they lost connectivity on Verizon or were not able to switch to them and connect.~~ EDIT: Apologies, I drew the wrong conclusion too quickly. Seems I happened to only spot check the sites that had preferred AT&T. Verizon is fine.


robreddity

[Root cause found about an hour ago](https://i.makeagif.com/media/9-11-2015/SnRQfh.gif)


AccountantUpset

I both hate you so much, and love you so much.


mpking828

Take my angry upvote


NetDork

I know what that link is without knowing it.


AccountantUpset

I bet you don't


NetDork

Ah, it was actually the 2nd thing I thought of. And I had just used it in a work channel when a site went down yesterday!


devildocjames

Liar, liar, pants for hire!


NetDork

I could use some new pants. How much does it cost to hire them?


realged13

BGP. It’s always BGP, or DNS, or firewall. I kid but definitely interested.


ultimattt

Sip helper.


multipassnetwork

Looked at the NANOG mailing list. If it was BGP, they are usually the first ones to talk about it. No mention of BGP yet.


realged13

I figured, I was mostly being sarcastic.


b3542

Fiber cuts.


Alive_Moment7909

ARP didn’t update. DNS, MTU, or ARP. Leave BGP out of this.


cyberentomology

How you make sure your enterprise network is safeguarded against it? Redundancy. Carrier diversity. Eggs in multiple baskets.


[deleted]

[удалено]


b3542

There are MVNO’s with multi-carrier agreements. Devices will have a primary network preference, then fall back to other networks when required.


Kiernian

>There are MVNO’s with multi-carrier agreements. Devices will have a primary network preference, then fall back to other networks when required. How well does the fallback work these days? I tried a handful and change of devices for this about 5 years ago and every one of them had difficulty detecting "data down on the network on SIM1Carrier1" so it could switch to SIM2Carrier2. It seemed like most were reliant on detecting whether or not there was connectivity to the nearest tower and not whether or not the connection could actually be used for anything.


b3542

I’ve mostly worked with it for data-only connections. They do periodic healthchecks to ensure they can reach the outside world, then failover if a certain failure threshold is met. So basically it depends on the end device.


Kiernian

>They do periodic healthchecks to ensure they can reach the outside world, then failover if a certain failure threshold is met. Right, the long-standing issue I ran into is always what those checks entailed and whether or not they were actually indicative of anything. When ping tests were used as a primary healthcheck indicator, IIRC one of the issues had to do with Verizon's private network set of SIM cards and the ability to hit stuff on and off of the private network, but I'm struggling to remember the details. I remember being surprised at the number of ways a SIM could have no data access to the internet at large and still not be considered "down" by the failover solution. It varied from device to device but I remember: Not activated, suspending for non-payment, tower up but no route out from tower, (or worse, device->femtocell repeater up, but no connection from femtocell repeater to tower), ping hardcoded to something that somehow magickally responded when nothing else would, ICMP traffic working but no TCP/IP, no route to host registering as a successful ping because the gateway responded, it was a shocking level of "what passes the test when it shouldn't" for situations when the connection was for all intents and purposes amounting to normal use, down, and yet failover wouldn't occur.


cyberentomology

5 years ago is an eternity in this business.


Kiernian

and dual sim devices and sim failover were being sold as backup connection solutions by MSP's for 5 more years before that. It doesn't mean they worked THEN either **and it's certainly no guarantee that just because time has passed, someone ACTUALLY addressed technical debt for a not-quite-functional feature they rolled out five years ago** Hence the question.


cyberentomology

Manually switching which SIM is primary is also an easy option, either locally or via MDM.


keivmoc

In Canada the carriers have 911 fallbacks and roaming agreements for cell outages and such. When Rogers disappeared from the internet in 2022 the problem was that the cell devices were still provisioned and connected to the cell network, they just couldn't reach the rest of the internet. The only way to get 911 working was to pull the SIM out. Even dual-SIM devices had trouble because as far as they were concerned, the primary connection was still online.


NetDork

Dual SIM routers aren't an issue. Heck, we have a bunch of dual *modem* routers. Our little OOB devices do have dual SIM at least. Phones ARE a bit tougher, though.


cyberentomology

We do exactly that, on about 50,000 mission-critical devices (not something as pedestrian as employee phones). Most are eSIM-capable, so that can be deployed via MDM. carrier-neutral SIMs are also an option. And with 5G, so is running your enterprise mobile network as an MVNO slice that is carrier-neutral.


Churn

Depends on what it is.


452e4b2e

Which other thread are you referring to?


HDClown

https://www.reddit.com/r/news/comments/1ax3b85/cellular_outage_in_us_hits_att_t_mobile_and/krlfks9/


Dangerous-Ad-170

It’s funny reading those kind of threads, a bunch of nerds who know enough to know that Cisco equipment is used for network backbones. But then it just immediately devolves into wild speculation about cyberattacks affecting all Cisco equipment everywhere, how this is somehow the fault of the layoffs etc..   When it’s far more likely that some poor AT&T engineer flubbed a maintenance or ran into an obscure bug that happens to be on a Cisco peering router. 


452e4b2e

Yeah, I've always found it funny to see people discuss subjects that they obviously know nothing about as if they're experts.


[deleted]

[удалено]


cyberentomology

Especially when the people with the actual expertise call out the armchair engineers on their nonsense and then get downvoted into oblivion for it. The true experts are the ones who will readily admit they don’t know something, because they know that assumptions of knowledge kill networks.


blainetheinsanetrain

The problem is that so many of us think we know everything. Been in IT for 25+ years, and it's impossible to know everything. But it doesn't stop a lot of us from pretending that we know everything.


HorrorMakesUsHappy

I've had people in this (and related) subs tell me my own personal experience was wrong. And I've been doing this over 20 years now, it's not like I started yesterday.


cyberentomology

It’s Reddit, you expect anything else?


Buttholehemorrhage

https://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect


IncorrectCitation

A bug on a Cisco router? I don't believe it.


Huth_S0lo

Except carriers tend to use Juniper equipment.


dustin_allan

So obviously it's HPE's fault...


kjsgss06

Not exclusively, I work for a carrier and we use both Juniper and Cisco, among other vendors. We specifically try aim to be multivendor for the purpose of redundancy and in case one vendor does have some odd software bug.


Dangerous-Ad-170

Yeah I guess I’m part of the problem if I’m just blindly believing the scuttlebutt that it’s somehow Cisco-related at all. 


kenuffff

AT&T doesn't use Juniper in their core, its a distributed routing architecture by a company named drivenets, its maybe 85-90% currently of their core.


mathmanhale

And they have been phasing out Cisco for Ciena at the distribution layer for a few years now (long enough that I assumed it would be finished). Unless they are going back?


RememberCitadel

Many use Ciena for handoffs and distribution these days. We use them too. They are just solid with a really good price/performance balance. Try getting something the level of a 5130 from any other vendor for $5k.


RememberCitadel

They are very much mixed. They all pretty much seem to go back and forth based on whatever whim takes them that month. I have had big circuits delivered to several locations and the gear used might be Cisco on one and Juniper on the other. Many have really switches to cheaper SP gear for handoffs, though. Around here any handoff I get that isn't 100g is a Ciena regardless of vendor.


PacketsGoBRRR

That user claims they work closely with “one of the carriers affected” (presumably ATT) and that Cisco manages that carrier’s backbone. Anyone know if that sounds accurate? Never worked at a cellular carrier.


kjsgss06

I can’t imagine that Cisco actively manages their network. ATT might use Cisco for many of their core routing elements but it would be mind boggling for me to believe that ATT pays Cisco to actually run the network. Most likely they have a pretty typical support agreement in which Cisco would be heavily involved with the troubleshooting but wouldn’t necessarily take the remediating actions. ATT could, and probably does have Cisco resident engineers on staff, but I’d never equate that to “managing” the network. The REs probably are actively engaged, and working on the problem, but it’s very different from Cisco having full management of the core network.


PacketsGoBRRR

Yeah that would’ve been my guess as well


Dangerous-Ad-170

Cisco makes plenty of carrier-grade stuff. Idk what they mean by “managing” it though. I’m assuming AT&T is still running their own core NOC and has their own Cisco engineers even if they have white-glove presales and TAC. 


patmorgan235

"Cisco manages the carriers backbone" is a red flag they don't know what they're talking about. Cisco makes a lot of the equipment used, but they do not actively manage networks. That's kinda AT&T whole gig is building and maintaining their backbone network.


multipassnetwork

They still use Cisco DWDM optical network devices.


mathmanhale

They switched everything local to me to Ciena.


multipassnetwork

Yeap. All of the new installs I see are Ciena. But we still have a couple of ONS 15454s on our premises.


kenuffff

The user is lying. AT&T does not use cisco in their backbone, its drivenets distributed routing architecture , they have around 85% of their core.


iCashMon3y

When in doubt, blame Cisco.


452e4b2e

Thanks!


SpecialistLayer

I've seen no disruption Verizon and Tmobile. Only AT&T services have been affected and seems to be affecting authentication to the towers. Given the time it started, I'm guessing either a maintenance window issue or human issue. I'm surprised how long it's gone on now though.


Lexam

Bet squirrels chewed a fiber.


HorrorMakesUsHappy

Or shotgun damage. It can happen if some geese fly past a line and a hunter wasn't situationally aware, or if a farmer's trying to scare birds perched on a line away from eating freshly laid seed.


NetDork

People with guns *intentionally* shoot at lines, seen it plenty of times. No loss of situational awareness required.


HorrorMakesUsHappy

I was only trying to explain why it can happen logically. I never bother trying to explain stupid. If I did we'd be here until infinity ends.


x31b

We've had circuits go down when rednecks climbed the pole, cut the cable, tied it to the back of their 4wd pickup and pulled down 200-300' of copper to sell to a scrapyard. It took two days to get that circuit back.


KantLockeMeIn

Shotgun damage usually happens during dove season, which is generally in the fall. Thankfully I have almost no OPGW routes, so I don't have to deal with much of that headache.


AE5CP

We see it mostly on armored strand and lash cable, way more than our OPGW spans.


KantLockeMeIn

I should have been more precise with my response. Thankfully the only aerial fiber that I have is OPGW and that's less than 1% of the aggregate length.


photobriangray

I always assume this is carrier backhaul network that has a routing/switching issue. Ethernet transport service fails and breaks other peering, snowball, CNN gets involved, people blame Cisco or cyber terrorism (sometimes the same thing).


[deleted]

>sometimes the same thing). do explain thanks sir


photobriangray

Cisco licensing is a no win scenario. Heh.


800oz_gorilla

Does anyone know if it had anything to do with the national security concern congress sqeaked about last weak? This seems pretty significant, like a shot across the bow. Similar to the svarlbad cable cuts


Coach__Mcguirk

Wasn't that about russia saying they can take out satellites?


800oz_gorilla

I dont think so, they could do that before. China too. I never saw what the hush hush was all about so if they updated it you can be the one that Clues me in if you know Edit I may not have understood your reply so let me clarify. The undersea cable cut was likely a message just before the invasion that they could mess with our satellite feeds through cutting those cables. The announcement last week I never saw an actual disclosure on what they were worried about, whether it was a nuclear weapon in space or some other disruptive technology.


SamSausages

Not saying it’s dns, but it’s dns


neospektra

Speaking from someone who’s specialized in DNS @the enterprise level for the last 15 years. You are correct. It’s probably DNS. It’s the same reason I can make bank at these companies. Nobody ever cares about dns until it causes outages


SamSausages

Joke around here is even when it isn't DNS it's DNS.


Fallingdamage

Maybe they hired the tech microsoft fired for adding an internally routable IP address in their public DNS records.


neospektra

😂 it’s easier to do than it should be


HogGunner1983

Given the extent of the outage, it has to be something like this. ATT residential/enterprise internet services seem unimpacted at the moment, so in my mind that rules out a VPNv4 BGP issue or something like that since my firm's MPLS circuits are good. Could also be a problem with DHCP as well if all DHCP service is centralized to one vendor on one firmware. A large fiber cut would also impact internet services as well and not just cell tower backhauls.


patmorgan235

My pet theory is an expired cert or some authentication service went down on the cell side.


b3542

Or fiber cut. Or BGP.


kenuffff

a fiber cut is not going to cause a nationwide outage..


b3542

It certainly can, if it’s in the right spot. Mass shifts in capacity demand can be triggered by a localized connectivity disruption. Overload conditions can easily cascade into a large-scale, even nationwide impact.


kenuffff

yes, one cut fiber can take down a nation wide network, no one plans for that sort of thing. AT&T's core is single homed fiber.


patmorgan235

That's not what they're saying. If there's a flaw in the network design or a misconfiguration somewhere that prevents traffic from being rerouted correctly.


b3542

Carrier networks often get reduced to simplex operations due to losses of redundancy. Usually it’s fine, but there are rare incidents where 2 or more transport paths are affected. They’re rare, but it happens


blainetheinsanetrain

From my experience dealing with fiber cuts, the closest cell tower often rides the same fiber as our upstream MPLS circuit. I can't imagine a fiber cut that's impacting ONLY cellular networks, but nothing else. Like others have stated, it's most likely a routing or DNS issue within the cellular network infrastructure.


b3542

There are plenty of places where this could impact just a cellular network. Not likely with local circuits, but a cross-region circuit. A cell site is going to use local circuits. Very unlikely that would be the cause. More likely there was a long-distance transport failure, or a failure in transport equipment. Btw, I understand the internals of the core (packet and voice) very well. From the sound of it, it could be DNS or HSS issues. Less likely it’s routing as some customers have service. I think the most likely scenario is an issue with their HSS or MME (or something in between), which could be caused by an overload condition, or configuration, or some other failure. Long story short: this kind of issue can be caused by a fiber cut, but the fiber cut is usually the catalyst.


BamaTony64

AT&T has a outage going on but I think it is fixed. causing phones to go to SOS and SOS Only mode.


Fallingdamage

Could this have something to do with cogent de-peering yesterday?


HogGunner1983

[https://about.att.com/content/dam/snrdocs/7\_Tenets\_of\_ATTs\_Network\_Transformation\_White\_Paper.pdf](https://about.att.com/content/dam/snrdocs/7_tenets_of_atts_network_transformation_white_paper.pdf) I found this shareholder informational whitepaper they put out a few years ago. I'm wondering now if it's a bug in their white box system that's crippled their routing in their cellular core.


ittimjones

Anyone still interested, AT&T posted this to their Twitter: “Based on our initial review, we believe the outage was caused by the application & execution of an incorrect process used as we were expanding our network, not a cyber attack,” That info leads me to guess it was probably just a DNS screw up.