Welcome to The Cloud Pod – where the forecast is always cloudy! This week your hosts, Jonathan and Ryan, are talking all about EC2 instances, including changes to AWS Systems Manager and Elastic Disaster Recovery. And speaking of disasters, we’re also taking a dive into the ongoing Google DDOS attacks. Plus, we’ve even thrown a little earthquake warning into the podcast, just for effect.
Titles we almost went with this week:
A big thanks to this week’s sponsor:
Foghorn Consulting provides top-notch cloud and DevOps engineers to the world’s most innovative companies. Initiatives stalled because you have trouble hiring? Foghorn can be burning down your DevOps and Cloud backlogs as soon as next week.
📰General News this Week:📰
- A few weeks ago many got excited about the new AMD chips coming to help with AI workloads.
- The Instinct MI300A has often been touted as an alternative to Nvidia’s H100.
- But… it’s not as easy to use those chips.
- The startup that tweeted about using the new AMD chips has been working on it for multiple years, and most startups who would want to switch would have to throw out their code and start from scratch. We’re not super sure about that claim, but we shall see…
- Plus, Nvidia has a 20 year head start when it comes to Cuda and other development tools for AI.
- It’s not all bad news though – AMD does have some advantages that may make it worth it, including a chip that combines the GPU, which performs multiple computations simultaneously, and a CPU which executes more general instructions and manages the systems broader operations.
- (Nvidia plans to do the same with the Grace Hopper Superchip).
- The AMD chips also have more memory than the H100 at 128gb vs 80gb.
02:20 📢 Ryan – “Yeah. I mean, it’s interesting how complex these have become, right? When it used to just be – sort of – you had optimized at the computer level and maybe at the OS level, but now the workloads are so specific because they’re so demanding, and then power is also very challenging. So that’s kind of neat. I’m kind of glad I don’t have to deal with it much.”
- Amazon has reportedly committed 1B to license M365 cloud productivity software for 1 million of its corporate and frontline workers in a surprise megadeal.
- Amazon will upgrade from traditional MS office software to the cloud productivity suite, (Probably because MS stopped supporting it? But we digress) according to the report, which notes that Amazon had been reluctant to do so previously.
04:40 📢 Jonthan – “I’m surprised they haven’t worked on their own office suite. They could have taken some open-source thing and made it their own.”
05:44 📢 Ryan – “if you think about all those documents, all those emails is now going to be residing on essentially Azure systems, right? And so it’s like, are they worried about corporate espionage? They worried about data privacy? And I get the concern. It would be very interesting to see if something came out of that, because it would be hard to detect and hard to enforce.”
- You can now enable Systems Manager and configure permissions for all EC2 instances in an organization that has been configured to AWS organizations, with a single action using default host management configuration (DHMC).
- This feature provides a method to help customers ensure core systems manager capabilities such as patch manager, session manager and inventory are available for all new and existing instances.
- DHMC is recommended for all EC2 customers and offers a simple, scalable process to standardize the availability of Systems Manager tools.
- The new feature is available pretty much in all commercial Regions where quick setup is available, with exceptions for China.
07:09 📢 Ryan – “This is one of those things if you’re offering the cloud service to the rest of your business, you want this to be a checkbox instead of trying to do organization cloud stacks to make sure this is enabled in every… I’m a big proponent of having these things turned on by default.”
- Cloudwatch is announcing out-of-the-box, best practice alarm recommendations for AWS service-vended metrics. It provides alarm recommendations and alarm configurations for key vended metrics, along with the ability to download pre-filled infrastructure-as-code templates for these alarms.
- Initially supporting 19 services and will expand from there.
09:22 📢 Jonathan – “This is cool! Not a single mention of AI either, which you know is probably driving this on them.”
- You can now recover into existing instances instead of spinning up new Ec2 instances with AWS Elastic Disaster Recovery.
- DRS minimizes downtime and data loss with fast, reliable recovery of on-premise and cloud based applications using AWS services.
- Recovering into an existing instance allows you to retain metadata and security parameters.
09:51 📢 Ryan- “Just the IP reuse alone is a huge advantage for this, right? Like you really had to, you know, if you automate it, right, it’s not that big of a deal to swap out the things, but not everything’s easily automatable into an auto scaling group or something that’s more elastic.”
- CodeWhisper is a coding companion similar to Github Copilot or Google Duet.
- While these tools can help, they lack context of your private code repositories.
- This limitation presents challenges for developers learning to use internal libraries and avoiding security problems.
- To address this issue, CodeWhisperer customization capability enables organizations to customize CodeWhisperer to generate specific code recommendations from private code repositories.
- With this feature, developers who are part of the CodeWhisperer Professional tier can now receive real-time code recommendations that include their internal libraries, APIs, packages, classes and methods.
14:58 📢 Ryan- “You don’t have to retrain the entire model using your internal data in order to get the proper responses, right? That’s a pain, that’s not gonna scale. And so having the AI be able to make recommendations, but then feeding it this customization capabilities on top of that is pretty fantastic.”
15:16 📢 Jonathan – “Yeah, I’m waiting for the day when it doesn’t just generate code for you, but it tells you what you could be doing better.”
**pause for earthquake warning system – insert “we move the earth to bring you the best in cloud news” jokes here**
- This year’s report has some really fascinating insights. Justin remembers being a little underwhelmed with 2022’s report; he;s not sure if that was just the place he was at work wise, or if the report was just lackluster.
- They did talk a lot about Westrum Organizational Culture last year… and that might have been part of it.
- This year the team researched and explored key outcomes and capabilities that contribute to achieving:
- Organization Performance – The organization should produce not only revenue, but value for customers, as well as for extended community
- Team performance – The ability for an application or service team to create value, innovate, and collaborate
- Employee Well-being- The strategies an organization or team adopts should benefit the employees — reduce burnout, foster a satisfying job experience, and increase people’s ability to produce valuable outputs (that is, productivity).
- With really 2 outcomes as a result of the above:
- Software Delivery Performance – Teams can safely, quickly, and efficiently change their technology systems
- Operational Performance – The service provides a reliable experience for its users.
- One of the more interesting additions this year is their focus on performance outcomes based on team types, and broke teams down into 4 types:
- User-Centric – This type of team focuses the most on user needs.
- Feature-Driven – Prioritizes on shipping features, with a relentless focus on shipping may distract from delivering on user needs.
- Developing – Focuses on the needs of app users, but still working on product-market fit or their technical capabilities
- Balanced – A balanced sustainable approach between organization performance, good team performance and good job satisfaction.
- Net of the report is that Culture and User Focus are the keys to success for high performing organizations.
24:44📢 Jonathan – “I write almost the same thing every year on my personal self-review. What motivates you? What can we do? What do we need to do to keep you working hard? And my answer is almost invariably… As long as you give me the tools I need to do the job you’re asking, I will happily crunch through work 40 hours a week or more, as the case may be. But if you don’t give me the tools to be successful, then I’ll be out.”
- Google is back with another massive DDOS attack blocked by Google’s cybersecurity teams. This attack was 7.5 times larger than the “largest in history” attack the year before.
- This new DDOS attack reached a peak of 398 million RPS, and relied on a novel HTTP/2 “Rapid Reset” technique based on stream multiplexing that has affected multiple internet infrastructure companies. By contrast, last year’s largest DDOS attack was 46 million rps.
- These attacks started in August, and are still continuing as of this publication – targeting large infrastructure providers including Google.
- Google was able to mitigate the attack at the edge of their network, leveraging their investment in edge capacity to ensure services and customer services remained largely unaffected.
- Google wants you to know that any enterprise or individual serving an HTTP based workload may be at risk from this attack.
- Web apps, services and APIs on a server or proxy able to communicate using the HTTP/2 protocol could also be vulnerable.
29:42📢 Jonathan – “I feel like we’re kind of entering into the asymmetric warfare phase of DDoS now because this HTTP2 rapid reset exploit is really asymmetric in that to attack a server requires very little resources on the client side anymore using this.”
- Jonathan can summarize it the best…
- The attacks included a technique called the “HTTP/2 Rapid Reset attack,” where the client cancels each request immediately after sending it, keeping the connection open. This approach created an advantage for the attacker, as they incurred minimal costs compared to the server.
- Several variants of the Rapid Reset attack were observed, some of which did not immediately cancel streams, but rather opened and canceled batches of streams in succession.
- Mitigating these attacks is challenging, as simply blocking individual requests is not effective. Instead, it’s necessary to close the entire TCP connection when abuse is detected, using mechanisms like the GOAWAY frame. However, the standard GOAWAY process may not be robust against malicious clients and needs adjustment.
- Mitigations involve tracking connection statistics and using business logic to determine how useful each connection is. Recommendations include closing connections that exceed the concurrent stream limit and applying similar mitigations for HTTP/3 (QUIC).
- Google coordinated with industry partners to address this new attack vector, and a coordinated vulnerability disclosure process was initiated to notify large-scale implementers of HTTP/2, enabling widespread protections and fixes.
- Providers with HTTP/2 services should assess their exposure to these attacks, and software patches and updates are recommended to address the vulnerabilities.
- Google recommends its customers patch their software and enable security features like the Application Load Balancer and Google Cloud Armor to protect against these types of attacks.
- Slow running databases are hard to diagnose per Google Product Designers Mani HK and Kaushal Agrawal.
- Is our SQL saturated? What is consuming resources? What changed in the DB? Are there background tasks like Vacuum and backup operations? And this sometimes takes some effort to diagnose.
- Which is why they have built System Insights. A database systems monitoring tool that brings together critical metrics, events and logs to provide a comprehensive view of both the external database performance and the internal system resources, bringing all the signals into a single dashboard allows you to quickly identify potential sources of problems without having to switch between tools.
- Available for Postgres and Spanner now in GA and Mysql in Preview.
- They built this due to the friction caused by having to look at metrics on the instance overview page as well as custom dashboards. They intended to give you a snapshot of system status quickly with pre-built dashboards with actionable metrics.
35:12📢 Ryan – “What are the biggest challenges for a lot of teams coming from Microsoft SQL Server using the Postgres is vacuum, right? I don’t know how it’s handled at Microsoft SQL Server. I just know that this is a common complaint from teams that are making that transition and they’re like, this isn’t performing, why not? And so having the insight into that to have that understanding of, you know, it’s an action you’re not triggering, sort of maintaining its indexes, which it needs to do, or it would slow to a crawl. So it’s great to just have that visibility, because once you know about it, you can tune it.”
- As of October 10th, 2023 Windows Server 2012 R2 has reached end of support.
- You can avoid this by purchasing Extended Security Updates enabled by Azure Arc
- Migrating to Azure for Free Extended Security Updates
- Or modernize to one of MS PaaS offerings including Azure SQL Managed Instance or Azure App Service.
38:11 📢 Ryan – “Microsoft can’t fund the investment of engineering this for, to the end of time. And they do a pretty good job, I think, with the length of life and the amount of options for extension, because there are extensions you can do, not being on the Azure platform. But I do think it is kind of clever for them to make that a feature of the Azure platform, you know, as far as being a differentiator.”
And that is the week in the cloud! We would like to thank our sponsors Foghorn Consulting. Check out our website, the home of the Cloud Pod where you can join our newsletter, slack team, send feedback or ask questions at theCloud Pod.net or tweet at us with hashtag #theCloud Pod