My second year at AWS: down the rabbit hole

My second year at AWS: down the rabbit hole

We all know, time flies. I posted about my first year at Amazon Web Services just 12 months ago and now I’m already celebrating my second AWS-birthday.

It’s been a year of both personal and professional growth: it began in December when I became head of a project that has been helping some of our customers running their platforms at massive scale, all the way up to my promotion to Senior TAM a few months back. Needless to say, customers I’m looking after have made giant leaps too, as part of transformation processes that start from the infrastructure and can reach up to their corporate culture.

My post about “Year One” was organised by subsequent evolutional phases but this won’t really make sense from the second year onward: I’ve been involved in a bunch of projects, every one of them with its own life. Why not trying then to recap the last 12 months by picking the project (more details on it here, with a cameo appearance) that took most of my time and checking our slogan “Work Hard, Have Fun, Make History” has been truly met in it? Let’s start here.

Work Hard

This is how it begins. It might seem obvious, as no one ever will pay us to do something other than working hard, but it’s not. Working hard in AWS means taking responsibilities, being effective, facing challenges and turn every opportunity into an huge success.

“Hard” as in pushing our brains to 100%, not necessarily as in working 16 hours a day. True, we carry pagers, and might end up having late evening calls with the teams in Seattle or doing late night debugging sessions from our hotel room, but this only happens in exceptional situations.

I personally find this extremely rewarding: when you focus on a project with all your energy, then the sense of achievement when it’s done is super strong.

…check ✓

Thought I was joking? This is my hotel room that night: note the Snowball Edge.

Have Fun

“Having Fun” is something we keep reading in job offers: it’s a “new economy” concept, meant as enjoying what you do and finding personal motivation in addition to the obvious business one.

I find this kind of comes by itself: if you work effectively on something and achieve results, then customers will trust you, the relationship will become more friendly and relaxed and you will end up having a lot of fun with them, even in the day to day.

This is me looking at Chris @ PhotoBox doing some snowball-weightlifting.

Check ✓

Make History

Last one, and possibly just another consequence. Is there any other way a successful project can finish?

Sometimes we might not realise how big a given change can be. We might focus on some virtual machines becoming EC2 instances and some hard drives becoming S3 partitions, but there’s much more behind the curtains: you will see a quickly changing and moving world there.

Never underestimate the importance of small actions and small steps as they can quickly prove to be giant leaps.

(in the picture below you can see me helping with the final step of a cloud migration: loading a massive storage array on a truck after datacenter decommissioning and shutdown) …check ✓

I might have a future in this area.

So What?

Two years in, and for me it still feels like it’s Day One. Learning something new every day, consciously jumping in rabbit holes every other day just to re-emerge stronger and wiser later on. Being surrounded by the smartest people on earth makes you feel extremely small sometimes, but also guarantees you endless opportunities for growth.

This is what I’ll keep doing.

(want to join the band? just ping me!)

iliad: dov’è veramente la rivoluzione?

iliad: dov’è veramente la rivoluzione?

A due mesi dal lancio in Italia di iliad, il nuovo operatore di telefonia mobile lowcost, l’hashtag #Rivoluzioneiliad continua a rompere le palle ad essere in testa al mio feed Twitter, con una strategia di comunicazione completamente incentrata sulla parola rivoluzione.

Ma c’è veramente una rivoluzione in atto? E se si, in cosa, precisamente? Vi siete fatti questa domanda?

Io si.

No, la rivoluzione non è nel prezzo. Certo, 6.99 €/mese per 40 GB di traffico con illimitati minuti ed SMS non sono male, ma la telefonia mobile in Italia era già da prima molto economica se comparata ai paesi vicini: provate voi ad andare in Inghilterra con 30 €/mese in mano a pretendere un iPhone.

Quindi, in cosa iliad è radicalmente diversa da Tre, Vodafone, Wind e Tim?

In famiglia era arrivato il momento di aggiornare tutti i contratti di telefonia, parliamo di famiglia estesa e 10 SIM in quattro nazioni, perciò ho dovuto studiare le offerte di ogni compagnia di rilievo nei paesi coinvolti.

E ho subito capito: in Italia iliad è diversa nella flessibilità, nella chiarezza, nell’onestà.

I “vecchi” operatori continuano imperterriti a nascondere tariffe in ogni angolo e sostanzialmente a prendere in giro i propri clienti: un esempio eclatante lo abbiamo visto due anni fa, quando un reparto marketing di qualcuno dei grandi ha deciso che accorciare il periodo di offerta da un mese a 28 giorni (di fatto introducendo una tredicesima mensilità da pagare) era più elegante che comunicare un aumento dei prezzi, e tutti gli altri lo hanno seguito. Semplice e diabolicamente geniale: aggiungere un mese sotto i piedi dei clienti, per aumentare le tariffe in modo coatto.

Non iniziamo nemmeno a parlare delle rimodulazioni (leggi: aumenti) unilaterali. Si, vero, si può recedere senza costi quando accadono, ma se ce n’è una ogni sei mesi e tutti gli operatori le propongono contemporaneamente, i clienti sono in una gabbia.

Anche sulle tariffe nascoste gli esempi si sprecano: chiamate a zero € al minuto, e scritto in piccolo uno scatto alla risposta che ti costringe a ipotecare la casa (la finezza di questa offerta è veramente apprezzabile). Tariffa mensile molto vantaggiosa, ma che poi si scopre durare solo per 6 mesi sui 24 totali di vincolo contrattuale. O, ancora, pacchetto 24 mesi con telefono a 20 €/mese e zero € di attivazione, ma con scritto sotto in piccolo che ne dovrai pagare 150 come rata finale.

Questo dovrebbe già bastare, ma c’è una cosa che trovo veramente impossibile da digerire: la penalizzazione della base clienti esistente di un operatore rispetto ai nuovi. Hai avuto una tariffa per 3 anni e adesso che sono scaduti i vincoli vuoi aggiornarla, perchè in media sul mercato con gli stessi soldi puoi avere di più? L’aggiornamento ad un piano che il tuo stesso operatore offre con attivazione gratuita ai nuovi clienti a te invece può costare anche 5/6 mensilità: ed è così, punto.

In Italia c’è un mercato che costringe i clienti al turnover continuo: vengono ignorate le richieste e le insoddisfazioni dei già clienti e si investe solo sui nuovi. Ci sarà sempre un potenziale “nuovo” cliente da portare a bordo che sarà insoddisfatto del suo precedente operatore, e così la ruota continua a girare.

Per fare un esempio di cosa c’è oltre le Alpi, in Inghilterra piani ricaricabili non esistono (esistono in realtà, ma sono talmente costosi che hanno senso solo se usate il cellulare tre minuti al mese), e anche se volete solo una SIM, serve un abbonamento di 12 o 24 mesi. Passati questi 12/24 mesi però, si è totalmente liberi di scegliere una qualunque offerta dello stesso operatore, pagandola al massimo come la pagherebbe un nuovo cliente e senza costi legati al cambio. Si, avete letto bene: al massimo, perchè molto spesso la vostra fedeltà sarà premiata e avrete accesso allo stesso pacchetto, ad un costo inferiore.

Ecco la #Rivoluzioneiliad: il nuovo operatore si propone di ribaltare tutto questo. La loro offerta è chiara, leggibile, facile da capire. Le procedure tutte automatizzate e semplici, il servizio clienti veloce e incredibilmente cordiale. Come se non bastasse, poco dopo il lancio qualcuno ha notato che una postilla nel contratto dava a iliad la possibilità di cambiare l’offerta, facendo traballare il “per sempre” con cui era pubblicizzata.

Sapete come questa ha reagito? Ha rivisto il contratto, togliendo l’ambiguità.

Il “se” è d’obbligo in quanto esiste da due mesi, ma se iliad dimostrerà di poter sostenere un business con queste premesse e questi principi, allora a tutti gli altri non resterà che adattarsi.

Customer experience and discrimination on Gumtree

Customer experience and discrimination on Gumtree

In a world where Customer Experience is key, Gumtree, the leading online classifieds website in England, is telling me to f*ck off. Something that I will do indeed as their TOS are crystal clear (they can restrict access with no explanation due), but not without telling the story first.

It all started this morning: after having signed up using my 10 years old Google account and my real name (which, even if I’m not Donald Trump, should have a decent reputation and trust online) I posted an AD for my HP DL320e (and even paid to have it featured). Location was my real postcode (which you can verify in public records) and I paid with an UK credit card (just another way for them to verify my identity).

Good experience up to that point: nice and easy UI, clean workflow. The AD went into moderation queue, but after a while it moved straight into the “Removed” state:

Allegedly, I’ve broken some posting rules and should have expected an email with some explanations. Except the email never came in (no, it’s not my spam filter, I got other emails from them) and the link to the posting rules leads to a blank page (there is a menu on the left, but every single item leads to a blank page).

I asked for support on Twitter, genuinely thinking it should have been a mistake of some automated fraud prevention system (a very cheap one, probably):

The only thing I managed to get back was this response via DM, which classifies as the worst and unnecessarily rude answer I ever got from someone’s customer care department:

I tried to appeal by sending an email to their support department, hoping for a deeper review and consideration.

And it happened: someone came back to me apologising and explaining that my AD and account were absolutely fine, no rule had been broken and that the block was the result of a mistake. It would have been lifted immediately.

The end of an odyssey, you would think. Well, no: two hours later my account got blocked again, and this email landed in my mailbox:

In short, I’m now permanently banned. They won’t tell me why and won’t answer any further query on the matter. I would love to dig and figure out what’s wrong, but I’m in front of a brick wall.

Let me make a couple of assumptions: I’ve been blocked before having published my first AD, so I can’t have been reported by other users. The only things Gumtree knows about me are:

  • My full name
  • My email address
  • Where I live
  • My Credit Card details

We already have a word for when you are denied something based on those four parameters: discrimination – and this is what’s happening here.

If anybody from Gumtree wants to get in touch and explain feel free, you have my contact details.

Telegram’s infrastructure and outages. Some updates.

Telegram’s infrastructure and outages. Some updates.

This post is meant as an (ongoing) sequence of updates to the previous one about Telegram’s outages in March and April 2018. Please read it here first.

Last updated: April 30th, 6:00 AM UTC

UPDATE 1

With the help of a friend (and his own HowIsResolved), we managed to confirm that for most open resolvers worldwide (25k+ tested) api.telegram.org is showing up as 149.154.167.220. Only outliers seem to be China (resolving as of now as 174.37.154.236) and Russia (85.142.29.248).

UPDATE 2

During my analysis this morning I created a new Telegram App, and the (only) suggested MTProto (the Telegram protocol) server was 149.154.167.50. This falls into the IP range analysed above, and seems to be solely located in Amsterdam.

Kipters was so kind to review and help me notice that this server is not used for real “data” communication, but just for a “discovery” API call (help.getConfig method) which will return the list of servers that will have to be used for sending messages. We are currently still in process of comparing ranges received across the world, but in the best case scenario (ie: they are spread over multiple geographic locations) this would mean that there is still a single point of failure in the hardcoded “directory” server.

UPDATE 3

What I found in the previous note was “too weird to be true”, so I went ahead and kept digging into TDLib and the official Desktop and Android Apps, to confirm wether they were bootstrapping a session beginning from a single MTProto endpoint or not.

Fortunately, turns out this is not the case (relevant snippets for TDLib, DesktopAndroid Apps): both of them contain, hardcoded, in addition to endpoints in the range 149.154.167.0/24 (Amsterdam, AS62041), endpoints in 149.154.175.0/24 (Miami, AS59930) and 149.154.171.0/24 (Singapore, AS62014).

Sounds like we should look into different reasons why many users worldwide outside of EMEA had issues today (or wait for an official, detailed post mortem if it will ever come): there are many, from broken dependencies to weird cases of mis-routing.

Some areas are left to explore (feel free to share your ideas if you have any): why third party apps don’t have access to the whole list of “initial” MTProto endpoints, and are pushed to use only a single, non redundant one? Why the main website and api.telegram.org (mainly used for bots I think) are based off a single location?

UPDATE 4

Telegram Web (https://web.telegram.org/) seems to be single-homed in Amsterdam too. As I haven’t had the opportunity to test during the outage, I don’t know whether it has been failed over somewhere else or not.

UPDATE 5

According to the official documentation, users (registered by phone number) are located off a single datacenter, picked at signup time based on geographical proximity: “During the process of working with the API, user information is accumulated in the DC with which the user is associated. This is the reason a user cannot be associated with a different DC by means of the client.

They are only moved if they keep connecting from a remote location for a prolonged period of time (ie: you permanently relocate to another continent): this might explain why there seem to be no failover scenario and 12+ hours outages are happening.

(Thanks to adjustableneutralism from Reddit for flagging)

Telegram is down (again): a deep look at their infrastructure

Telegram is down (again): a deep look at their infrastructure

I’ve been a strong Telegram advocate since its launch in 2013, mainly because of the advanced features and technical state of the art compared to competitors – as a consequence, I’ve been looking very closely at their infrastructure for the last few years.

The two large scale outages that recently hit their users and the sequence of events following them made me ask some questions around their platform. For most of them I have only found additional question marks rather than answers, but here it is what I have so far.

Let’s start from the outages: in case you missed that, on March 29th and April 29th this year, Telegram went down in their Amsterdam datacenter due to a power failure, causing disruptions, according to their official communications, to users in EMEA, MENA, Russia and CIS.

Zooming in on the latter: it’s still ongoing at time of writing this article (8:30AM UTC), and is showing up with clients unable to connect to the platform and both https://www.telegram.org/ (website) and https://api.telegram.org/ (api endpoint) failing with an HTTP error code 500.

Let’s start with the items that, to me, don’t add up: first and foremost, the outage. In case of “massive power outage” in the Amsterdam area, I would expect to see a traffic drop in AMS-IX, the largest Internet Exchange in the region, but there is none (it should be showing around 01 AM):

There are indeed reports of an outage that affected Amsterdam (below the one from Schiphol Airport), but no (public) reports of consequent large datacenter failures.

Who’s involved in running large scale platforms will be surprised by at least two things here: the fact that they are serving an huge geographical area from a single datacenter and their inability to reactively reroute traffic to the other locations they are operating, even in case of extended outage (no DR plans?).

A quick search on Twitter shows that even if the official communication states the issue is only affecting the EMEA region, users from Canada, US, Australia, Japan and other countries are facing it as well.

I used Host-Tracker to have a deeper look into this: an HTTP check to Telegram’s API endpoint and their website fails with an HTTP 500 error from every location across the world:

I went ahead and began digging to find out more about their infrastructure, network and the other locations they are running from.

And here comes the second huge question mark: the infrastructure.

A bunch of DNS lookups across the main endpoints show they are always resolving to the same v4 and v6 IPs, in a way that doesn’t look related to the source location of my queries.

They look to be announced by AS62041 (owned by Telegram LLP): this kind of DNS scheme made me think they were running an anycast based network, so next logical step has been analysing latencies from multiple locations.

Turns out, latency is averaging 20/30ms from EMEA, 100/150ms from AMER, and 250/300ms from APAC: as if from all of those countries you were being routed to the Amsterdam datacenter.

What I’m seeing in terms of latency is confirmed by analysing reverse lookups of routers found in the different paths to Telegram: in my trace from Australia the last visible hop is et3-1-2.amster1.ams.seabone.net (notice that “ams”), most of the traces from US are landing on xcr1.att.cw.net (195.2.1.14) which 1 millisecond away from my lab in Amsterdam and a couple of samples from US and Canada are running all the way up to ae-2-3201.ear3.Amsterdam1.Level3.net, which is self-explaining.

Important to highlight, there are no outliers: I couldn’t find a single example of very low latency from APAC / AMER, that would have proved the existence of a local point of presence. A summary of my tests in the table below:

To get the full picture, I decided to dig into AS62041 main upstream carriers (CW AS1273, TI Sparkle AS6762, Level3 AS3356) and see how they were handing over internet traffic to Telegram.

Turns out, CW is always preferring the path to xcr1.att.cw.net/195.2.1.14 (tested from some locations across the world), our little router-friend in Amsterdam. TI Sparkle always lands on amster1.ams.seabone.net and Level3 only has paths to ear3.Amsterdam1 (tested from Asia and US). Level3’s BGP communities are interesting: routes are tagged as “Europe Backbone” and “Level3_Customer Netherlands Amsterdam”:

Telegram is also peering with Hurricane Electric (AS6939): their routers in US, JP, AU have a next hop of ams-ix-gw.telegram.org/80.249.209.69 for 149.154.164.0/22. That hop seems to be Telegram’s AMS-IX facing router, and the IP is definitely part of AMS-IX:

 

As said in the opening, there are definitely more questions than answers in the article. It’s as if there was no Telegram infrastructure outside Amsterdam, and over there it was running in a single datacenter. This would explain why users across the world are seeing an outage that should only affect EMEA and close areas, and why Telegram is not taking steps to reroute users to another datacenter/location during the failure in AMS.

Am I missing something very obvious? Please let me know!

UPDATE: With the help of some friends and random people, I found out more details. Find them (with -ongoing- updates) in the dedicated post.

%d bloggers like this: