cloud – Page 2 – Giorgio's Dumpster

Eventi straordinari e siti istituzionali: un rapporto (ancora) tormentato.

November 1, 2017 Giorgio Bonfiglio Comments 0 Comment

Anni fa ho scritto questo articolo (in un momento di frustrazione causata dalla puntuale indisponibilità dei siti istituzionali nei momenti di loro maggiore utilità), nella speranza quantomeno di aprire una linea di dialogo. Ero stato fortunato e questa si era aperta, ma il tutto era stato impacchettato e rispedito al mittente senza troppi complimenti.

Il problema in breve: sono molti i siti informativi, soprattutto in ambito Pubblica Amministrazione, “inutili” e poco visitati per il 99.9% del tempo, che però diventano critici in momenti di particolare interesse. Immaginate ad esempio il censimento della popolazione: ha cadenza decennale e dura due mesi. Durante questa finestra di tempo ogni cittadino userà l’apposito servizio online, ovviamente aspettandosi che tutto funzioni a dovere.

Altro esempio è il portale del Ministero dell’Istruzione: basso carico per gran parte dell’anno, ma quando vengono annunciate le commissioni di maturità, deve essere funzionante, pronto e scattante. Pensate poi al sito dove vengono pubblicati i risultati delle elezioni: utilizzato ogni quattro o cinque anni, diventa il più visitato d’Italia durante le poche ore di scrutinio.

Internet oggi è la fonte primaria di informazione per molte persone: è un dato di fatto che non si può ignorare, ed è necessario dare adeguata importanza alle piattaforme che contribuiscono a questa informazione.

Ne parlavo nel 2011, perchè è stato l’anno in cui i tre servizi sopracitati hanno mancato il loro obiettivo primario: quando servivano, non funzionavano. Se ne era parlato, soprattutto tra gli addetti ai lavori: ci eravamo arrabbiati, ma qualcuno aveva commentato che le soluzioni al problema (che spaziano da questioni molto tecniche come lo sharding dei database e l’elasticità delle infrastrutture a questioni più di buon senso, come una corretta previsione dei carichi) erano molto distanti dal mondo dei “comuni mortali”, e ancor di più dal settore pubblico.

Un punto di vista secondo me contestabile, ma quasi sicuramente con un fondo di verità: al tempo il concetto di “cloud” esisteva da pochi anni, e alcuni vendor dubitavano ancora delle sue potenzialità.

Sembra di parlare della preistoria.

(per non dimenticare: il load balancing manuale delle Elezioni 2011)

Adesso siamo nel 2017: sono passati sei anni dal mio articolo e come alcuni continuano a ripetere, “cloud is the new normal”. Il cloud è la nuova normalità, tutti lo usano, lo scetticismo, se mai c’è stato, è sparito: il tempo ha ormai provato che è una nuova e rivoluzionaria tecnologia e non solo un trend temporaneo o una pazzia di un singolo vendor.

In questi anni, nella nostra PA, sarà cambiato qualcosa?

Alcuni segnali fanno ben sperare: Eligendo ad esempio, il portale delle Elezioni, è esposto tramite una CDN (ma non supporta HTTPS). Altri fanno invece perdere la speranza appena guadagnata: questo mese si è tenuto il Referendum per l’Autonomia della Lombardia – serve che vi dica in che stato era il sito ufficiale durante gli scrutini? Timeout.

Le soluzioni a questo tipo di problemi sono ormai ben conosciute e consolidate: caching estremo, utilizzo di CDN, sfruttamento di infrastrutture scalabili, etc. I costi sono molto bassi e granulari: con una architettura ben studiata, si possono servire tutte le richieste senza sprecare un euro. Fa in un certo senso pensare il fatto che in certi ambienti siano ancora presenti e gravi problemi che l’industria ha risolto già da tempo, come quello dei picchi di carico.

Quali sono quindi i fattori limitanti, quindi?

Non stento a credere ci sia una scarsa comprensione del tema e della sua importanza ai “piani alti” di ogni ente: solo di recente siamo riusciti a mettere insieme una community di sviluppatori e un “team digitale” (composto da professionisti di veramente alto rango) volto a svecchiare il “sistema Italia”.

L’iniziativa sta già portando i suoi primi frutti, ma si tratta di un team per ora piccolo molto focalizzato sullo sviluppo e non sulle operations/mantenimento: il passo per il cambiamento della mentalità generale è ancora lungo. Non è difficile immaginare come una scarsa comprensione del tema porti molto velocemente alla mancanza di interesse e di risorse dedicate – con conseguente frustrazione di quelli che sono i “piani inferiori”.

Un secondo fattore spesso portato (o meglio, trascinato) in gioco è la scarsità di infrastrutture: se questo poteva essere vero una volta, oggi, con l’affermazione delle tecnologie cloud e del concetto di “on demand”, questo smette di essere un punto bloccante. Le infrastrutture ci sono, basta sfruttarle.

Ultimo, ma non per importanza, il discorso “competenze”: non stento a credere come molti fanno notare che sia difficile reclutare personale adatto e che chi si occupa oggi di sistemi nella PA abbia ben altre responsabilità e quindi ben altre basi. Ritengo però non si possa ignorare il fatto che al giorno d’oggi il concetto di “as a service” (servizi managed se volete chiamarli con un nome forse più familiare) rimuova buona parte di questo problema, e che l’immensa offerta di training e relativa facilità di sperimentazione renda estremamente facile la coltivazione delle skills mancanti.

Può servire tempo, ma da qualche parte bisognerà pur partire. Molti IT manager e sistemisti sono lì fuori pronti, a fare il passo: hanno solo bisogno di essere ispirati.

Ispiriamoli, no?

Story of a journey: my first year at Amazon Web Services

August 22, 2017 Giorgio Bonfiglio Comments 1 comment

Exactly one year ago today I was sitting in a room in Amazon’s London Holborn office, attending the New Hire induction and waiting for my manager to pick me up and introduce me to the rest of the Technical Account Managers team.

It has been one year already – it’s about time to tell my story, and share my experience in this (amazing) reality.

(this is me at this year’s London Summit, looking for something, somewhere)

Looking back at the first year (or, in Amazonian terms: “those first 365 day one’s.”), I can easily highlight a few different phases. Here they are, in a more or less chronological order.

—

Phase 1: “lost” (in an hexagonal office)

Technical Account Managers (TAM) spend a lot of time with customers, and only drop into the AWS office when required. As a new starter this can be a little daunting, especially when trying to get set up – configuring your mobile, using the vast array of internal tools you have at your fingertips and the simple things, like finding the toilet.

The good news is: everybody is always happy to help you. Literally: everybody. In my first days I had phone calls with most of my team mates, shadowing sessions in front of customers, and even asked a mix of random people in the office for various kinds of help: they always guided me, as if it was a single, big family and that helped me, and I never really felt lost (yeah, I know, but it looked as a good title for this chapter…).

(about the toilet, if you’re wondering: I realised that as our office was hexagonal – or kind of -, everything was “straight on and then on the left”)

I’ll skip phase 1.5, the official training: we spend about two to three weeks in classes with Support Engineers before getting hands on with the day to day job. The training is what you’d expect from training, but it provides a great opportunity to meet and learn from tenured colleagues. This is also when I personally went from getting lost in the London office to getting lost in the Seattle campus (every. single. time.).

Phase 2: the ramp up (aka: “OMG I don’t know anything”)

The ramp up that comes after the training is exciting: you’re back, you’ve had 2/3 weeks to try to learn as much as possible and after three weeks of training, you think you know what you are doing – you’ve learnt the theory, you know how to use the tools, you think you know what to do when, and you’re ready to get on with it.

In theory.

What you realise at this point is that yes, it’s true, and you’re working with Amazon Web Services. If you work with cloud, you hear this name daily, and becoming part of it doesn’t simply feel real for a while.

One of the first matters I understood was that the only thing I was bringing with me in AWS was my brain: your past experience can definitely help, but Amazon is so different from other companies that you have to learn, literally from scratch, almost everything. If you’ve been hired it’s because you share the mindset, so it’s not hard and it’s not an obstacle, it’s just something to keep in mind.

The main differences? First, and by far, is our “Customer Obsession”. We obsess over our customers, and not over our technology: every discussion we have ends up focusing what’s best for our customers, and how we can improve their experience. We work every day making sure we help them doing what’s best for their platforms – not for us – and we spend our time listening to them and trying to figure out how to make their life easier.

The second one is definitely what’s summarised in our “Everyday is Day One” motto, which is much more tangible than you would expect from something that is written on every wall in an HQ. Our customers and us are moving so quickly that you must always be ready to wake up and start as if you were in a completely new world. You learn new things daily and the technology you were using / evangelising three months before could not be the best one for a given use case anymore.

This is all about change and how it becomes part of your daily routine.

Phase 3: the First Customer

After a few months you’re ready to onboard your first customer. I had spent some time shadowing and helping a more tenured colleague, and in November I was ready for onboarding my first “very own” account.

At that point in time I was confident on my daily tasks, had already had to deal with critical situations, and everything was looking good. But the first customer you onboard onto AWS Enterprise Support is just different: you’re starting a journey together, with some pre-defined goals and some others that will eventually show up.

It’s journey of change, a journey toward continuous improvement and optimisation.

It’s just matter of weeks, and you will start knowing your customer’s team members by first name, and recognising who’s logging a support case just by looking at their writing style.

Yes, that’s a very close relationship: some of my colleagues love to say that we work for Amazon, but on behalf of our customers.

Phase 4: the first event

You don’t really feel part of the customer’s team until you go through your first event. An event could be anything, from a planned traffic spike or feature launch, to, ehm, yes, an unplanned downtime.

Let’s pick a feature launch: it’s something big, the customer’s development teams have been working for months on it, the marketing team is heavily pushing and the operational teams do have a single focus, making sure everything will work smoothly.

This is where our teams become glued together with the customer’s: we share a goal, we share a focus, we setup “war rooms” and make sure everything is in place and properly architected for when the big day arrives. The TAM acts here as a customer facing frontman for an army of Support Engineers, Subject Matter Experts, Service Team Engineers, and many more – and during this kind of events, everyone comes together.

And then it happens – detailed and obsessive planning ensure everything works smoothly and meets expectations, leaving plenty of time to celebrate – and to realise that none of this would be possible without the super close relationship we develop with our customers.

Phase 5: personal development

This is not really a phase (mainly because it never ends), but after you’ve been in the company for 6/8 months you begin having really clear ideas on how things work, where you want to go and what you want to do.

AWS is a world of opportunities, for any kind of person: in this first year I joined a team which is helping our customers with the migration of strategic workloads and presented at the AWS Summit in London.

I’m currently trying to decide what to target next.

Phase 6: retrospective

As said, technology is evolving quickly, and so are we and our customers. When you reach the one-year mark, you try to look back and this is when you really understand where you used to be, and where you are now.

Where your customers were, and where they are now: the distance they have most likely covered in a single year looks unbelievable.

Phase 7: writing a blog post about your first year

Come on, I’m just joking.

—

Time to wrap up: I’m enjoying my new working life, my team, my mentor(s), my manager(s) and the extended Enterprise Support team. I have the opportunity every day to work with exciting customers, to actually be part of my customer’s teams and to experience the latest innovations first hand.

There is a question I get asked a lot, especially from people who know my background: do I miss being hands on, had to do with operations? Not really. First, we have time and business needs for testing and using any new product we launch, so I still spend some time actually “playing” with stuff. Second, despite the name, this role is super-technical – we get to see a lot of operations, development and devops.

If you are reading this and looking for a new and interesting challenge, or would like to consider joining the AWS team, then get in touch.

Giorgio

Don’t buy servers.

July 19, 2017 Giorgio Bonfiglio Comments 0 Comment

No, please don’t. Not even for personal use.

Let me start from the beginning: during my relocation last year, I left my desktop computer behind. It hadn’t been my primary machine for a while and I was probably powering it on only once a month, but it was still my core repository for backups and long term storage.

As I went 100% cloud years ago (no USB drives, no external HDDs, etc) my “current” dataset is now online, synchronised with my laptop(s). Still, there are some hundreds of GBs of “cold” (as in: I will probably never need them again) pictures/docs/archives that I want to be able to access, even remotely, at any time. After exploring some mid-range NAS solutions, I ended up realising that despite having a reliable internet connection, my flat was not the best location for hosting it, so started looking around for a decent colocation space.

It didn’t take much time to figure out that space and power in a datacenter are so expensive that a NAS isn’t suitable nor effective for this purpose.

As a consequence…

…meet MY-ZA*.

MY-ZA is an HP DL320e Gen8 server, equipped with an E3-1240 v2 CPU, 32 GB of RAM, and 2×250 GB SSD + 4x1TB SATA drives. Dual PSU, P420i hardware RAID controller, iLO4, etc… …yes: a real server.

I’m sure you’re now wondering what the hell I am doing here. The answer is easy, and anybody with an engineering mindset can probably confirm: sometimes we need to spend time and energy in experiments even if we know they will fail, because what we want to figure out is how exactly they will fail.

To be honest, even if I knew this choice was sub-optimal at the very least, I was like: “Hey, what could go wrong? It’s just a server”.

Well, now I know the answer: anything – (and if you cross this with Murphy’s law…).

My background is in traditional IT, but looks like I quickly forgot about the pain of having to deal with bare metal. To make sure this doesn’t happen again, here’s a quick reminder that might also help you all:

Servers are expensive: this is a $2800 machine (I’ve paid roughly 50% of that), that will cost around 70/80$ per month just by colocation and bandwidth. Moreover, in 2 years time it will be obsolete.
Bare metal servers are… …heavy: arranging shipping back and forth costs time. And money, of course.
They’re slow, reaaaaaaalllllly slooooooww. This thing wastes 10+ minutes just to get to the operating system boot. Don’t forget this if you’re doing something that requires a lot of reboots (like trying different RAID configurations, updating a newly installed Windows, etc). We’re now in an era where the boot time of an instance is shorter than what it takes to you, slow and inefficient human, to copy and paste connection details in your SSH client.
What about the risk? Well, it’s huge. I have onsite support, but no spare parts. So, should something bad happen, the downtime will be counted in hours, at least.
They don’t scale. This “thing” has already reached the maximum amount of RAM it can hold. What if I need more? I have two options, double the colocation space (and thus cost) and buy a similar second server, or buy a larger one to replace it and begin a slow, complex and painful migration.
Agility? What? – You must manage it as you would do with a pet. If something breaks, repair it, if the OS is out of date, upgrade it. Well, in a world where if an instance is broken you immediately spin up a new one, having to fix an OS doesn’t seem appropriate.
SSDs do have a well defined lifespan. This is not something you care about if you’re using a cloud hosting service, but here you should keep it in mind, as they will eventually die. Both at the same time, as their load will be similar.

After having spent the last 7 days (evenings to be fair, as I have a job during the day) on this project, I think I have definitely debunked the theory about cloud not being effective for personal workloads.

Project failed, time to terminat…

…no, wait, you can’t terminate a bare metal server: it’s an investment, it’s a long term decision, you can’t just roll back as you would do with a cloud instance.

Oh, God.

* don’t even try to understand my host naming convention. There are no standards, names are just random letters. Servers are cattle, not pets, right?

Going Cloud: the 8 don’ts

September 25, 2016 Giorgio Bonfiglio Comments 0 Comment

Okay, let’s face it: the world is finally figuring out that cloud is for everyone, and not just for large-scale enterprises. This is a big step ahead, but when it comes to new adopters there are still many misconceptions and wrong expectations.

Wrong expectations are probably the most common reason for failure, because they usually lead to disasters that leave moving back to a legacy infrastructure design as the only option left.

(Image Source: XKCD)

But, turns out, it’s easier than you would expect. There is a basic set of rules and guidelines, and if you follow them you can easily be successful.

Let’s begin in this article with the 8 don’ts:

Never, ever trust a single instance of a given service. Don’t rely on redundant database platforms, replicated block devices, and so forth. They can still fail: accounting for this kind of failure at the application layer is the way to go.
Don’t put all your eggs in a bucket: cloud platforms are available in different geographical locations by nature, so you should really leverage this. True geographical redundancy can be hard to achieve at the beginning, but try at least to have read replicas spread over the world, so that in case of downtime in the main region you’re using your service would just be degraded and not completely unreachable.
Never think small. Some design patterns could seem overkills at first sight, but believe me, they are not. If you focus on designing your service so that it is ready for scaling up when needed, you won’t have to worry about later.
Don’t design complex software platforms: micro services are the way to go. Keep them simple and easy to maintain. It will be easier to scale them, and not only from a technical point of view: imagine how easy could be handing over not a part of a complex software, but a micro service to a new dedicated development team.
Never forget that performance is the key: a killer SQL query could still be affordable if you have a small number of users, but is going to be an issue when your platform grows. Make your application as efficient as possible, even when it doesn’t seem needed.
Don’t forget that everything could break, at any time. Keep your instances as simple as possible, so that they are easy to operate. If one fails or starts misbehaving, just respawn it, don’t waste your time trying to fix or debug it. In an ideal world, they should all be stateless.
Vertical scaling is a no go. Choose the size of your instances based on the performance you want a single request to have, but always spread multiple requests horizontally. This pattern will help a lot with availability as well.
Don’t be ‘legacy’: the world around you is moving very fast, and just looking at it makes no sense. New releases of software packages usually improve their performance and efficienty, and new versions of the services your cloud provider is offering you usually improve a number of items, cost being usually the main one. Running a legacy instance type just because your platform is too hard to upgrade to a newer operating system makes no sense and will kill your business in the long term.

Here we are. Now go and build!

What an IaaS service is. And what it is not.

September 12, 2016 Giorgio Bonfiglio Comments 0 Comment

The term “Cloud Computing” has been openly used for almost ten years now, but there are still some misconceptions around the concept itself and around some more specific words like “IaaS” (Infrastructure as a Service).

Sometimes I have to face pointless discussions with people that have completely wrong ideas and expectations: this can be annoying from my point of view, but can be catastrophic for realities deciding to make “the big move” without having completely understood what the cloud is all about.

If you have come across this post as you’re still trying to figure out what “Cloud Computing” and “IaaS” mean, then let me save your life and probably your job with some clarifications.

The market offering isn’t helping us, as service providers are confused as well and they use to define “IaaS” completely unrelated products. The US NIST has released a document containing a list of 5 “Essential Characteristics” of cloud services, but they are not so specific and won’t help you make any choice.

When words are being used in such a confused way, you have to decide which of the many interpretations is the “authoritative” one: my authorities for this article are Amazon Web Services (and not because I work for them, but because ten years ago they have been the first at offering an IaaS platform) and OpenStack (that is, AWS concepts and terms reviewed by the biggest open source community in the cloud computing world).

So, what you should expect or not expect from an IaaS offering?

You should expect to be billed based on a Pay as you go model. Let’s be serious, if you have to pay an one time or monthly fee for your account and/or services you are using then this is not really cloud. Offering pay as you go services is a real technical challenge for the service provider, and if they aren’t giving you this option then you should have some doubts about them being up to date with the technology. Some providers will offer you discounts on long term commitments and this is fine, but always look for the PayG option, please.
You should expect to have full access to API and CLI tools and not just to a GUI. This is critical also if you are not planning to use them from the beginning. Cloud is all about automation, and if you stick with a service that only offers a GUI, then you will be forever bound to your mouse (and hands): if you come from an on premise physical server environment you could not see my point right now, but in the cloud you will start using automation soon, at least in its basic form. Because it’s easy and useful.
You should not expect your instances (virtual machines) to be always available. This is something I’ve already blogged about a few years ago (in italian, I’m sorry) but it’s still one of the biggest, most spread and more dangerous misconceptions. Cloud services are based on commodity hardware, and thus the instances on top of that should be considered in the same way, as a commodity. The single instance could be there or couldn’t be, and your customers don’t have to notice: you have to plan for high availability at application level, taking into account the various kinds of failure. Some additional services like Block Storage, Object Storage and Load Balancing as a Service will help you achieving the high levels of availability you need. If your service provider is offering you an extreme level of HA, then you’re probably paying for something you don’t need (if you’re using 5 web nodes, then what’s the matter if one of them goes down for a while?).
You should expect instant provisioning: seriously, provisioning has to happen in seconds. Be careful not to underestimate this: you could be happy with a 24 hours delivery time for your first bunch of servers, but believe me you won’t be when you will need to rapidly scale because of a traffic peak. Maybe I’m being too picky here but I expect the provisioning of my account to happen in real time as well: I’m not happy with providers asking me to send a physical signed contract or my IDs before using their service.
You should expect the service you choose not to have limits that could (and will) impact you. Okay, not all of us need the scale of AWS, but make sure your provider won’t go out of capacity when you will need it: planning for infrastructure is their job, and from your point of view you must always be able to use the resources you need, when you need them, with no previous commitment.
You should (probably) expect to have access to multiple autonomous regions: being it for active-active HA or just for backup and disaster recovery purposes, doesn’t make so much sense to choose a provider that is hosting its entire platform in a single datacenter. Yes, you could choose to use 2 different services providers hosting services in different locations, but this is not going to be easy to deal with.
You should (probably) expect not to be locked in by small-scale service providers: always look for open standards, expecially if the company you’re buying resources from is still at a scale where going out of business from one day to another is a (remote) possibility.
You should not expect to be able to easily scale vertically (increase instance size, or a single resource inside the instance): cloud computing is based on horizontal scalability (that means adding building blocks, not making the existing ones bigger), and this is why service provider don’t focus so much on hot resize of instances or on the ability to add RAM if you need RAM without modifying anything else. This is related to availability as well: if you can’t afford a planned downtime on a single instance in your infrastructure, then you’re doing something wrong.

That’s it, at least for now. I’m sure moving to the cloud is the right choice almost for every company in the world, but please make sure you fully understand it before making any choice. Really.

Giorgio

(per non dimenticare: il load balancing manuale delle Elezioni 2011)

Share this:

Like this:

(this is me at this year’s London Summit, looking for something, somewhere)

Phase 1: “lost” (in an hexagonal office)

Phase 2: the ramp up (aka: “OMG I don’t know anything”)

Phase 3: the First Customer

Phase 4: the first event

Phase 5: personal development

Phase 6: retrospective

Phase 7: writing a blog post about your first year

Share this:

Like this:

Share this:

Like this:

(Image Source: XKCD)

Share this:

Like this:

Share this:

Like this: