Welcome to episode 245 of The CloudPod podcast, where the forecast is always cloudy! This week is a real SBOM of an episode. (See what I did there?) Justin and Matthew have braved Teams outages, floods, cold, and funny business names to bring you the latest in Cloud and AI news. This week, we’re talking about Roomba, OpenTofu, and Oracle deciding AI makes money, along with a host of other stories. Join us!
Titles we almost went with this week:
- 🧹Amazon Decides Roomba Sucks
- ⚔️AI Weapons: Will They Shift Cloud Supremacy
- 🤑Oracle Realizes There is Money in Gen AI
A big thanks to this week’s sponsor:
We’re sponsorless this week! Interested in sponsoring us and having access to a very specialized and targeted market? We’d love to talk to you. Send us an email or hit us up on our Slack Channel.
General News
REMINDER: 2gather Sunnyvale: Cloud Optimization Summit
On February 15, Justin will be onsite in Google’s #Sunnyvale office for the @C2C #2Gather Sunnyvale: #CloudOptimization Summit! Come heckle him, we mean JOIN him, to talk about all things #GenAI and #CloudOps. Consider this your invitation – he’d love to see you there! Sign up → https://events.c2cglobal.com/e/m9pvbq/?utm_campaign=speaker-Justin-B&utm_source=SOCIAL_MEDIA&utm_medium=LinkedIn
02:23 Amazon abandons $1.4 billion deal to buy Roomba maker iRobot
- Amazon is no longer buying iRobot for 1.4 billion, as there is no path to regulatory approval in the European Union.
- We’re not surprised this is the end result.
- Of course, iRobot proceeded to lay off 350 employees, or around 31 percent of its workforce.
- In addition CEO Colin Angle, who co-founded the company, stepped down from his CEO position and his chair position.
- Amazon gets to pay 94 Million in a termination fee to iRobot, which will help pay off a loan iRobot took the year prior.
04:02 Terraform fork OpenTofu launches into general availability
- OpenTofu has moved into General Availability.
- The milestone is after a four month development effort, with hundreds of contributors and over five dozen developers.
- Now that they have a stable version separated from the main Terraform product, they are promising a steady set of new features and enhancements.
- The GA version is OpenTofu 1.6, which includes hundreds of enhancements including bug fixes and performance and improvements.
- One of the big features is a replacement for the terraform registry, which you can now run cheaper and is developed faster.
- The new OpenTofu registry came from multiple RFC’s that were submitted and is 10x faster and 10x cheaper.
- An RFC for client side state encryption was submitted by a community member that they had been attempting to get into Terraform since 2016.
- The next version of OpenTofu is set to introduce even more significant upgrades. The project’s developers are working on a plugin system that will make it easier for users to extend the core open-source with custom features.
- For more info, check out OT’s migration guide, or the OpenTofu Slack community.
07:12 📢 Justin- “I think hashing corp has been kind of closed minded and what they could do in many ways. And so I am kind of curious to see where the community takes it, uh, which is the blessing and the curse of open source, of open source, right.”
AWS
09:55 Amazon VPC now supports idempotency for route table and network ACL creation
- Amazon VPC now supports idempotent creation of route-tables and network ACLs, allowing you to safely retry creation without additional side effects.
- Idempotent creation of route tables and network ACLs is intended for customers that use network orchestration systems or automation scripts that create route tables or Network ACL’s as part of the workflow.
10:18 📢 Matthew- “10 years ago called and it really wanted this feature.”
13:04 Integrating the AWS Lambda Telemetry API with Prometheus and OpenSearch
- Last week Google announced that you can integrate Prometheus with cloud run, and AWS said “hold my beer”
- You can now integrate the AWS Lambda telemetry (metrics, logs, traces) and integrate that into open source observability and telemetry solutions.
- This Lambda Telemetry API was announced in 2022… but we somehow missed it.
- The Telemetry API replaced Lambda Logs API, which was always limited, so no great loss.
- Extensions subscribed to the API can send this data directly from AWS to Prometheus or Open Search, with support for building your own extensions and delivery points available as well.
13:43 📢 Matthew- “I love the direct integration. I don’t need to put lambdas back in the middle. Just immediately take my stuff and shove it into OpenSearch or shove it into Prometheus. Like, I don’t want to deal with it. I don’t want to deal with the toil. Just point A to point B and I’m done. Take care of it for me. I’m a lazy person. There’s a reason why I like the cloud. I don’t want to deal with this.”
15:41 Export a Software Bill of Materials using Amazon Inspector
- For those of you in regulated environments, you may be familiar with the SBOM or Software Bill of Materials. This was one of the many recommendations after the supply chain attacks on Solarwinds a few years ago.
- Now Amazon Inspector has the ability to export a consolidated SBOM for supported Amazon Inspector monitored resources, excluding Windows EC2 instances.
- The SBOM will be in one of the two industry standards either CycloneDx or SPDX.
17:32 AWS will invest $15B+ in Japan to expand its local data center footprint
- Google announced a few billion dollar expansion of datacenters in the UK, so Amazon responded with an announcement that they will be expanding their Japanese datacenter footprint with a 15B+ investment through 2027.
- The expansion will be in Tokyo and Osaka, which covers both of their Japanese regions.
- This will help with mounting pressure from Microsoft and Google.
- Google opened a cloud datacenter about an hour outside of Tokyo, and Microsoft has operated Azure datacenters in Tokyo and Osaka as well.
- If you were struggling with the complexities of ETL and AWS Glue, you can now make that experience even worse with the new Amazon Q Data integration in AWS glue.
- The new chatbot is powered by Amazon Bedrock and understands natural language to author and troubleshoot data integration jobs.
- You can describe your data integration workload and Amazon Q will generate a complete ETL script.
- You can troubleshoot your jobs by asking Amazon Q to explain errors and propose solutions.
- Q will provide detailed guidance and will help you learn and build data integrations jobs.
21:08 📢 Justin – “I am sort of curious how it’s going to work out. You know, like, oh, uh, you know, Amazon Q, write me, uh, a data ingestion job for this bucket to Redshift, right… but then it has to understand something about your data model, doesn’t it? To be able to do that, or is this going to create you a little piece of scaffolding and be like, here, this will do it, and it’s just a select star from S3 and just dump it in Redshift raw. It might be quick, it might be easy. It might also cost you a hundred million dollars. So just be careful.”
GCP
22:54 4 ways to reduce cold start latency on Google Kubernetes Engine
- First there was Lambda Cold Start, and now google is blogging about GKE cold start and how to reduce your latency.
- While we definitely appreciate these approaches… shouldn’t ML/AI help us here on both AWS and Google to help design capacity based on standard patterns? Why don’t you all build that?
- Techniques to overcome the cold start challenge:
- Use Ephemeral storage with local SSD or larger boot disks.
- Higher throughput for RW compared to PD balanced disks.
- Enable Container Image Streaming which allows your image to start without waiting for the entire image to be downloaded.
- With GKE image streaming the end to end startup for an Nvidia Trion Server (5.4GB container Image) is reduced from 191s to 30s.
- Use Zstandard compressed container images which is natively supported in ContainerD.
- Use preloader Daemonset to preload the base container on nodes.
- Use Ephemeral storage with local SSD or larger boot disks.
24:07 📢 Matthew – “So move data closer. Make your data be smaller so it’s faster to load. Press your data. And pre launch it so it’s there so you know. All very logical things. But – It’s okay to have a few seconds of cold start on stuff. Like, do you really need your model to load – or anything to load – at that exact second? And is it okay if it takes a second? So make sure you’re actually solving a real problem here that’s actually affecting your business, not just, you know, something that you think is a problem.”
- Google Organization Policy Service can help you control resource configurations and establish guard rails in your cloud environment.
- Now with custom organization policies, you can now create granular resource policies to help address your cloud governance requirements.
- The new capability comes with a dry run mode that lets you safely roll out new policies without impacting your production environments.
- Custom org policies adds the ability to create and manage your own security and compliance policies that meet and address changes to your organizations business requirements or policies.
- Prior to this feature you could only select from a library of more than 100 predefined policies.
- Custom org policies can be applied at the organization, folder or project level, and security admins can craft custom constraints tailored to their specific use case through Console, CLI, or API in a matter of minutes.
- Custom org policies can help you meet regulatory requirements including HIPAA, PCI-DSS and GDPR or your own organization compliance standard.
- “Staying true to our mission of safeguarding Snap’s production infrastructure, we are continuously evolving and looking for new opportunities to establish access and policy guardrails. We’re excited to see custom organization policies go GA as we plan to adopt this product to help us enforce, amongst other things, GKE constraints associated with CIS benchmarks,” said Babak Bahamin, production security manager, Snap.
- Couple of example use cases enforce GKE Auto upgrade, this will ensure that your nodes in the GKE cluster have the latest security fixes and reduce overhead to manually update nodes. The admin would set a custom constraint with a condition like “resource.management.autoupgrade = true” and enforce it against your hierarchy.
- Another use case may be to restrict virtual machines, this may be to limit to a specific virtual machine type like the N2d for cost or compliance resources. The policy can then be enforced centrally and exceptions can be granted for approved use cases.
27:19 📢 Ryan – “Sounds great. But you’ll never get the CEL to do what you actually want.”
Azure
28:45 Microsoft’s AI Coding Product Becomes Weapon in Battle with AWS
- Listener note: paywall article
- The information states the obvious in that Microsoft AI coding product becomes the weapons in their fight for cloud customers with AWS
- Microsoft has continued to heavily invest in AI with the hope to encourage customers to try its Azure service. From things like github copilot powered by Open AI to Office Copilot and more.
- The information points to Goldman Sachs who has long used a mix of Github, Gitlab and other code repos but has increasingly spent more on github as it buys copilot seats for its 10,000 software developers. This has also resulted in a 20% increase in Azure spend in the second half of the year with a pace to spend more than $10 Million annually across Azure.
30:20 Improved exports experience
- Azure is introducing a new improved experience to export your FinOps data. With automatic exports of additional cost impacting datasets, the updated exports are optimized to handle the large datasets while enhancing the user experience.
- The enhanced user interface now allows you to create multiple exports for various datasets and manage them all in one place. Including the new FOCUS format.
- You can check out some of the preview features here.
Oracle
32:26 The Future of Generative AI: What Enterprises Need to Know
-
- Did you know you could make money with AI? Oracle just figured it out! Generative AI can make money. Who knew?
- Oracle posted a blog post about what enterprises need to know about the future of generative AI. And we hadn’t really seen much from Oracle on this topic… so color us intrigued.
- Oracle acknowledges that AI has captured the imagination of enterprise executives.
- Oracle states that enterprises need AI that can impact business outcomes, and that models need to be fine tuned or augmented by an organization’s data and intellectual property, designed to deliver outputs only a model familiar on an org can deliver.
- Oracle contends that you need AI at every layer of the stack from SaaS apps, AI services, Data and Infrastructure.
- Oracle set out to carefully think through the enterprise’s business processes and how they could be enhanced with Generative Ai. Creating an end to end generative AI experience that encompasses their entire stack.
- Oracle contends that AI at oracle is designed to be seamless, not piecemeal parts or tools that you have to assemble into a do-it-yourself project.
- As Dave Vellante, Chief Research Officer at Wikibon recently said, “Oracle is taking a full stack approach to enterprise generative AI. Oracle’s value starts at the top of the stack, not in silicon. By offering integrated generative AI across its Fusion SaaS applications, Oracle directly connects to customer business value. These apps are supported by autonomous databases with vector embeddings and run on high-performance infrastructure across OCI or on-prem with Dedicated Region. Together these offerings comprise a highly differentiated enterprise AI strategy, covering everything from out-of-the-box RAG to a broad range of fine-tuned models and AI infused throughout an integrated stack. Our research shows that 2023 was the year of AI experimentation. With capabilities such as this, our expectation is that 2024 will be the year of showing ROI in AI.”
- To power all of this, Oracle is announcing several new things:
- First, GA of OCI Generative AI service.
- The AI service supports Llama2 and Cohere’s models, with a multilingual embedding capability for over 100 languages.
- They have also added improvements to make it easier to work with LLMs with functionalities such as LangChain integration, endpoint management and content moderation.
- OCI Gen AI also includes an improved GPU cluster management experience with multi-endpoint support to host clusters.
- OCI Generative AI Agents (Beta). Agents translate user queries into tasks that Gen AI components perform to answer the queries. \
- The first is a retrieval augmented generation (RAG) agent that complements the LLMs with internal data using OCI opensearch to provide contextually relevant answers.
- OCI Data Science Quick Actions Feature: is a no code feature of the OCI data science service that enables access to a wide range of open source LLMs including options from Meta, Mistral AI and more. The AI quick actions will provide verification and environment checks, models, curated deployment models, few click fine tuning tasks, monitoring of fine tuning and playground features.
- Oracle Fusion Cloud Apps and Oracle Database will be getting AI capabilities. The initial use cases are focused on summarization and assisted authoring, such as summarizing performance rviews, assisted authoring for job descriptions, etc. etc.
- Oracle database 23c with AI vector search and MySQL HeatWave with Vector Store provide RAG capabilities to prompts.
- Autonomous Database Select AI, customers can leverage LLM to use natural language queries rather than writing SQL when interacting with the Autonomous Databases.
- Oracle isn’t done in this space with promises for several enhancements including:
- Oracle Digital Assistant
- OCI Language
- Document Translation Experience
- OCI Vision (facial Detection)
- OCI Speech
- OCI Document Understanding
- OCI Data Science
- WHEW. We thought Amazon was behind, but Oracle might be even further in the rear view mirror.
- And yes, Justin did read the whole article, so you don’t have to. Just one of the many services brought to you by The CloudPod.
After Show
37:20 Snow day in corporate world thanks to another frustrating Microsoft Teams outage
- #hugops for the Microsoft Teams Team on Friday and then again on Monday as they suffered a major Teams outage.
- Teams died on Friday, and we wish it was permanent. But at least I missed out on a ton of meetings, chats and recordings. Woohoo!
- Microsoft blamed the issue on a network outage that broke Teams.
- Whatever the issue though was limited to only teams, and Microsoft shared that they were failing over in Europe, ME and Africa regions.
- In North America though the failover option didn’t fix the issue, and end users were impacted from 1455 UTC to 2212 UTC.
- We’re really looking forward to reading leaked info about what happened here.
39:47📢 Matthew- “I’m still curious why. In network issues, you know they said, they failed over in some countries, which if it was networking, unless they’re running all their own infrastructure, I just assumed that they would just be running on Azure, but that made too much sense, I guess. Maybe they would break Azure if they ran all their workloads there.”
40:09 📢 Justin – I wonder if, it’s probably DNS. I mean, like, there’s a network issue. It’s always DNS. So it’s probably gonna be BGP. But I mean, if it was BGP, I think again, it would be more than just Teams. Unless the Teams team doesn’t understand latency and failovers and BGP routing, you have to reconnect things. But then, like, why wouldn’t the failover work?”
41:05 Quantifying the impact of developer experience
- Nicole Forsgren has a great thought leadership blog post on quantifying the impact of the developer experience.
- The big focus has been on how to make developers achieve more, quicker.
- Which was once called developer productivity, then developer velocity is now mostly developer experience or Devex.
- Deves is not just about individual developer satisfaction, it directly influences the quality, reliability, maintainability and security of software systems.
- The recently published study DevEx in Action: a Study of its Tangible Impacts seeks to quantify the impact of improved Devex at three levels: individual, team and Organization.
- Overall the research is promising.
- A few teasers:
- Flow State
- Developers who had significant amount of time carved out for deep work felt 50% more productive, compared to show without dedicated time
- Developers who find their work engaging feel they are 30% more productive, compared to those who found their work boring
- Cognitive Load
- Developers who report a high degree of understanding with the code they work with feel 42% more productive than those who report low to no understanding
- Developers who find their tools and work processes intuitive and easy to use feel they are 50% more innovative, compared to those with opaque or hard-to-understand processes.
- Feedback Loops
- Developers who report fast code review turnaround times feel 20% more innovative compared to developers who report slow turnaround times.
- Teams that provide fast responses to developers’ questions report 50% less tech debt than teams where responses are slow.
- Flow State
Closing
And that is the week in the cloud! Just a reminder – if you’re interested in joining us as a sponsor, let us know! Check out our website, the home of the Cloud Pod where you can join our newsletter, slack team, send feedback or ask questions at theCloud Pod.net or tweet at us with hashtag #theCloud Pod