Browsed by
Month: April 2018

Telegram’s infrastructure and outages. Some updates.

Telegram’s infrastructure and outages. Some updates.

This post is meant as an (ongoing) sequence of updates to the previous one about Telegram’s outages in March and April 2018. Please read it here first.

Last updated: April 30th, 6:00 AM UTC

UPDATE 1

With the help of a friend (and his own HowIsResolved), we managed to confirm that for most open resolvers worldwide (25k+ tested) api.telegram.org is showing up as 149.154.167.220. Only outliers seem to be China (resolving as of now as 174.37.154.236) and Russia (85.142.29.248).

UPDATE 2

During my analysis this morning I created a new Telegram App, and the (only) suggested MTProto (the Telegram protocol) server was 149.154.167.50. This falls into the IP range analysed above, and seems to be solely located in Amsterdam.

Kipters was so kind to review and help me notice that this server is not used for real “data” communication, but just for a “discovery” API call (help.getConfig method) which will return the list of servers that will have to be used for sending messages. We are currently still in process of comparing ranges received across the world, but in the best case scenario (ie: they are spread over multiple geographic locations) this would mean that there is still a single point of failure in the hardcoded “directory” server.

UPDATE 3

What I found in the previous note was “too weird to be true”, so I went ahead and kept digging into TDLib and the official Desktop and Android Apps, to confirm wether they were bootstrapping a session beginning from a single MTProto endpoint or not.

Fortunately, turns out this is not the case (relevant snippets for TDLib, DesktopAndroid Apps): both of them contain, hardcoded, in addition to endpoints in the range 149.154.167.0/24 (Amsterdam, AS62041), endpoints in 149.154.175.0/24 (Miami, AS59930) and 149.154.171.0/24 (Singapore, AS62014).

Sounds like we should look into different reasons why many users worldwide outside of EMEA had issues today (or wait for an official, detailed post mortem if it will ever come): there are many, from broken dependencies to weird cases of mis-routing.

Some areas are left to explore (feel free to share your ideas if you have any): why third party apps don’t have access to the whole list of “initial” MTProto endpoints, and are pushed to use only a single, non redundant one? Why the main website and api.telegram.org (mainly used for bots I think) are based off a single location?

UPDATE 4

Telegram Web (https://web.telegram.org/) seems to be single-homed in Amsterdam too. As I haven’t had the opportunity to test during the outage, I don’t know whether it has been failed over somewhere else or not.

UPDATE 5

According to the official documentation, users (registered by phone number) are located off a single datacenter, picked at signup time based on geographical proximity: “During the process of working with the API, user information is accumulated in the DC with which the user is associated. This is the reason a user cannot be associated with a different DC by means of the client.

They are only moved if they keep connecting from a remote location for a prolonged period of time (ie: you permanently relocate to another continent): this might explain why there seem to be no failover scenario and 12+ hours outages are happening.

(Thanks to adjustableneutralism from Reddit for flagging)

Telegram is down (again): a deep look at their infrastructure.

Telegram is down (again): a deep look at their infrastructure.

I’ve been a strong Telegram advocate since its launch in 2013, mainly because of the advanced features and technical state of the art compared to competitors – as a consequence, I’ve been looking very closely at their infrastructure for the last few years.

The two large scale outages that recently hit their users and the sequence of events following them made me ask some questions around their platform. For most of them I have only found additional question marks rather than answers, but here it is what I have so far.

Let’s start from the outages: in case you missed that, on March 29th and April 29th this year, Telegram went down in their Amsterdam datacenter due to a power failure, causing disruptions, according to their official communications, to users in EMEA, MENA, Russia and CIS.

Zooming in on the latter: it’s still ongoing at time of writing this article (8:30AM UTC), and is showing up with clients unable to connect to the platform and both https://www.telegram.org/ (website) and https://api.telegram.org/ (api endpoint) failing with an HTTP error code 500.

Let’s start with the items that, to me, don’t add up: first and foremost, the outage. In case of “massive power outage” in the Amsterdam area, I would expect to see a traffic drop in AMS-IX, the largest Internet Exchange in the region, but there is none (it should be showing around 01 AM):

There are indeed reports of an outage that affected Amsterdam (below the one from Schiphol Airport), but no (public) reports of consequent large datacenter failures.

Who’s involved in running large scale platforms will be surprised by at least two things here: the fact that they are serving an huge geographical area from a single datacenter and their inability to reactively reroute traffic to the other locations they are operating, even in case of extended outage (no DR plans?).

A quick search on Twitter shows that even if the official communication states the issue is only affecting the EMEA region, users from Canada, US, Australia, Japan and other countries are facing it as well.

I used Host-Tracker to have a deeper look into this: an HTTP check to Telegram’s API endpoint and their website fails with an HTTP 500 error from every location across the world:

I went ahead and began digging to find out more about their infrastructure, network and the other locations they are running from.

And here comes the second huge question mark: the infrastructure.

A bunch of DNS lookups across the main endpoints show they are always resolving to the same v4 and v6 IPs, in a way that doesn’t look related to the source location of my queries.

They look to be announced by AS62041 (owned by Telegram LLP): this kind of DNS scheme made me think they were running an anycast based network, so next logical step has been analysing latencies from multiple locations.

Turns out, latency is averaging 20/30ms from EMEA, 100/150ms from AMER, and 250/300ms from APAC: as if from all of those countries you were being routed to the Amsterdam datacenter.

What I’m seeing in terms of latency is confirmed by analysing reverse lookups of routers found in the different paths to Telegram: in my trace from Australia the last visible hop is et3-1-2.amster1.ams.seabone.net (notice that “ams”), most of the traces from US are landing on xcr1.att.cw.net (195.2.1.14) which 1 millisecond away from my lab in Amsterdam and a couple of samples from US and Canada are running all the way up to ae-2-3201.ear3.Amsterdam1.Level3.net, which is self-explaining.

Important to highlight, there are no outliers: I couldn’t find a single example of very low latency from APAC / AMER, that would have proved the existence of a local point of presence. A summary of my tests in the table below:

To get the full picture, I decided to dig into AS62041 main upstream carriers (CW AS1273, TI Sparkle AS6762, Level3 AS3356) and see how they were handing over internet traffic to Telegram.

Turns out, CW is always preferring the path to xcr1.att.cw.net/195.2.1.14 (tested from some locations across the world), our little router-friend in Amsterdam. TI Sparkle always lands on amster1.ams.seabone.net and Level3 only has paths to ear3.Amsterdam1 (tested from Asia and US). Level3’s BGP communities are interesting: routes are tagged as “Europe Backbone” and “Level3_Customer Netherlands Amsterdam”:

Telegram is also peering with Hurricane Electric (AS6939): their routers in US, JP, AU have a next hop of ams-ix-gw.telegram.org/80.249.209.69 for 149.154.164.0/22. That hop seems to be Telegram’s AMS-IX facing router, and the IP is definitely part of AMS-IX:

 

As said in the opening, there are definitely more questions than answers in the article. It’s as if there was no Telegram infrastructure outside Amsterdam, and over there it was running in a single datacenter. This would explain why users across the world are seeing an outage that should only affect EMEA and close areas, and why Telegram is not taking steps to reroute users to another datacenter/location during the failure in AMS.

Am I missing something very obvious? Please let me know!

UPDATE: With the help of some friends and random people, I found out more details. Find them (with -ongoing- updates) in the dedicated post.

This is your sysadmin speaking: please expect some turbulence.

This is your sysadmin speaking: please expect some turbulence.

A few months back I blogged about my HP DL320 Gen8’s (in)compatibility with the outside world, and someone suggested me to solve the problem by replacing the P420i RAID controller with an LSI-something which would ensure wider flexibility.

Others were suggesting to replace (again) the hard drives instead, and someone was even pushing to swap this “hobby” with something healthier and go cloud instead*.

For the first time in my life I decided to listen to friends, so I replaced the RAID controller with an LSI 9300i HBA (I’m using mdraid anyway)…

…well, not really: I also replaced the chassis, motherboard, CPU, RAM banks, fans, PSUs and drive caddies.

Meet “ZA Rev2″**:

This is how it evolved:

  • HP -> Supermicro (yay!)
  • Xeon E3-1240 v2 -> Xeon E3-1240 v6
  • 4×8 GB DDR3 RAM -> 2×16 GB DDR4 RAM (2 slots free for future upgrades)
  • HP P420i -> LSI-9300i
  • 2x SSD Samsung 850 EVO 250 GB -> no change
  • 2x HGST SATA 7.2k 1 TB -> no change

D-Day for replacement is April 18th (taking a day off from my job to go and do the same things, just for hobby, feels really weird, yes), with a 6 AM wake up call, flight to AMS, 8/10 hours to do everything and a flight back to LON (LTN to be precise, because I didn’t double check before hitting “Buy”).

Now to the sad part: there is no (easy) way to just move the drives to the new server and have everything working, so I have to reinstall it from the ground up. This means my stuff (including this blog, because loose-coupling is a thing but I decided to run its DB and NFS from another country… …for some reason) will be down (or badly broken) during that time window and possibly longer, depending how much I manage to do while I’m onsite.

The timing couldn’t be better for a clean start, as in the last few months I had been considering the option to move away (escape) from Proxmox (which, as an example, is so flexible that its management port number is hardcoded everywhere and can’t be changed) to something else, most likely oVirt or OpenNebula. Haven’t taken a decision yet, but I’ve really fallen in love with the latter: it’s perfect for the cloud-native minds and runs on Debian, whereas oVirt would force me to move to the RPM side of the world.

Deeply apologise in advance for my rants on Twitter while I try to accomplish this mission. Stay tuned.

Giorgio

 

* I.AM.100%.CLOUD. There are two things you can’t (yet) do in the cloud: physical backup of your assets that live in the cloud and testing stuff which requires VT extensions. This is what I’m doing here: ZA is my bare-metal lab.

** this is not ZA Rev2. It was supposed to be, but it came in with a faulty backplane so I pushed for it to be entirely replaced. I don’t have a picture of the new one with me at the time of writing but… yeah, it looks exactly the same (with better cable management).

%d bloggers like this: