326: Oracle Discovers the Dark Side (And Finally Has Cookies)

Cloud Pod Header
tcp.fm
326: Oracle Discovers the Dark Side (And Finally Has Cookies)
Loading
/

Welcome to episode 326 of The Cloud Pod, where the forecast is always cloudy! Justin and Ryan are your guides to all things cloud and AI this week! We’ve got news from SonicWall (and it’s not great), a host of goodbyes to say over at AWS, Oracle (finally) joins the dark side, and even Slurm – and you don’t even need to ride on a creepy river to experience it. Let’s get started! 

Titles we almost went with this week

  • 🧱 SonicWall’s Cloud Backup Service: From 5% to Oh No, That’s Everyone
  • 🧼 AWS Spring Cleaning: 19 Services Get the Boot
  • 🗑️ The Great AWS Service Purge of 2025
  • ☠️ Maintenance Mode: Where Good Services Go to Die
  • ✈️ GitHub Gets Assimilated: Resistance to Azure Migration is Futile
  • 🎁 Salesforce to Ransomware Gang: You Can’t Always Get What You Want
  • 🏙️ Kansas City Gets the Need for Speed with 100G Direct Connect. Peter, what are you up too
  • 🛞 Gemini Takes the Wheel: Google’s AI Learns to Click and Type 
  • 🌑 Oracle Discovers the Dark Side (Finally Has Cookies)
  • 💾 Azure Goes Full Blackwell: 4,600 Reasons to Upgrade Your GPU Game
  • 👮 DataStax to the Future: AWS Hires Database CEO for Security Role
  • 🪖 The Clone Wars: EBS Strikes Back with Instant Volume Copies
  • 🥤 Slurm Dunk: AWS Brings HPC Scheduling to Kubernetes
  • 🤝 The Great Cluster Convergence: When Slurm Met EKS
  • 💬 Codex sent me a DM that I’ll ignore too on Slack

General News 

01:24 SonicWall: Firewall configs stolen for all cloud backup customers

  • SonicWall confirmed that all customers using their cloud backup service had firewall configuration files exposed in a breach, expanding from their initial estimate of 5% to 100% of cloud backup users. That’s a big difference…
  • The exposed backup files contain AES-256-encrypted credentials and configuration data, which could include MFA seeds for TOTP authentication, potentially explaining recent Akira ransomware attacks that bypassed MFA.
  • SonicWall requires affected customers to reset all credentials, including local user passwords, TOTP codes, VPN shared secrets, API keys, and authentication tokens across their entire infrastructure.
  • This incident highlights a fundamental security risk of cloud-based configuration backups where sensitive credentials are stored centrally, making them attractive targets for attackers.
  • The breach demonstrates why WebAuthn/passkeys offer superior security architecture since they don’t rely on shared secrets that can be stolen from backups or servers.
  • Interested in checking out their detailed remediation guidance? Find that here

02:36 📢 Justin – “You know, providing your own encryption keys is also good; not allowing your SaaS vendor to have the encryption key is a positive thing to do. There’s all kinds of ways to protect your data in the cloud when you’re leveraging a SaaS service.”

04:43 Take this rob and shove it! Salesforce issues stern retort to ransomware extort

  • Salesforce is refusing to pay ransomware demands from criminals claiming to have stolen nearly 1 billion customer records, stating they will not engage, negotiate with, or pay any extortion demand. 
  • This firm stance sets a precedent for how major cloud providers handle ransomware attacks.
  • The stolen data appears to be from previous breaches rather than new intrusions, specifically from when ShinyHunters compromised Salesloft’s Drift application earlier this year. 
  • The attackers used stolen OAuth tokens to access multiple companies’ Salesforce instances.
  • The incident highlights the security risks of third-party integrations in cloud environments, as the breach originated through a compromised integration app rather than Salesforce’s core platform. 
  • This demonstrates how supply chain vulnerabilities can expose customer data across multiple organizations.
  • Scattered LAPSUS$ Hunters set an October 10 deadline for payment and offered $10 in Bitcoin to anyone willing to harass executives of affected companies. This unusual tactic shows evolving extortion methods beyond traditional ransomware encryption.
  • Salesforce maintains there’s no indication their platform has been compromised, and no known vulnerabilities in their technology were exploited. The company is working with external experts and authorities while supporting affected customers through the incident.

06:31 📢 Ryan – “I do also really like Salesforce’s response, just because I feel like the ransomware has gotten a little out of hand, and I think a lot of companies are quiet quietly sort of paying these ransoms, which has only made the attacks just skyrocket. So making a big public show of saying we’re not going to pay for this is, is a good idea.”

AI is Going Great – Or How ML Makes Money 

07:06 Introducing AgentKit

  • OpenAI’s AgentKit provides a framework for building and managing AI agents with simplified deployment and customization options, addressing the growing need for autonomous AI systems in cloud environments.
  • The tool integrates with existing OpenAI technologies and supports multiple programming languages, enabling developers to create agents that can interact with various cloud services and APIs without extensive infrastructure setup.
  • AgentKit’s architecture allows for efficient agent lifecycle management, including deployment, monitoring, and behavior customization, which could reduce operational overhead for businesses running AI workloads at scale.
  • Key use cases include automated customer service agents, data processing pipelines, and intelligent workflow automation that can adapt to changing conditions in cloud-native applications.
  • This development matters for cloud practitioners as it potentially lowers the barrier to entry for implementing sophisticated AI agents while providing the scalability and reliability expected in enterprise cloud deployments

09:03 Codex Now Generally Available

  • OpenAI’s Codex is now generally available, offering GPT-3-based AI that’s fine-tuned specifically for code generation and understanding across multiple programming languages. This represents a significant advancement in AI-assisted development tools becoming mainstream.
  • Several new features, A new Slack integration: Delegate tasks or ask questions to Codex directly from a team channel or thread, just like a coworker
  • Codex SDK to embed the same agent that powers Codex CLI to your own workflows, tools, and apps for state-of-the-art performance on GPT-5-Codex without more tuning
  • New Admin tools with environment controls, monitoring, and analytics dashboards. ChatGPT workspace admins now have more control

09:48 📢 Ryan – “I don’t know why, but something about having it available in Slack to boss it around sort of rubs me the wrong way. I feel like it’s the poor new college grad joining the team  – it’s just delegated all the crap jobs.” 

10:14 Introducing the Gemini 2.5 Computer Use model

  • Google released Gemini 2.5 Computer Use model via Gemini API, enabling AI agents to interact with graphical user interfaces through clicking, typing, and scrolling actions – available in Google AI Studio and Vertex AI for developers to build automation agents.
  • The model operates in a loop using screenshots and action history to navigate web pages and applications, outperforming competitors on web and mobile control benchmarks while maintaining the lowest latency among tested solutions.
  • Built-in safety features include per-step safety service validation and system instructions to prevent high-risk actions like bypassing CAPTCHA or compromising security, with developers able to require user confirmation for sensitive operations.
  • Early adopters, including Google teams, use it for UI testing and workflow automation, with the model already powering Project Mariner, Firebase Testing Agent, and AI Mode in Search – demonstrating practical enterprise applications.
  • This represents a shift from API-only interactions to visual UI control, enabling automation of tasks that previously required human interaction like form filling, dropdown navigation, and operating behind login screens.

11:48 📢 Ryan – “I think this is the type of thing that really is going to get AI to be as big as the Agentic model in general; having it be able to understand click and UIs and operate on people’s behalf. It’s going to open up just a ton of use cases for it.”    

AWS

12:35 AWS Service Availability Change Announcement

  • AWS is moving 19 services to maintenance mode starting November 7, 2025, including Amazon Glacier, AWS CodeCatalyst, and Amazon Fraud Detector – existing customers can continue using these services but new customers will be blocked from adoption.
  • Several migration-focused services are being deprecated, including AWS Migration Hub, AWS Application Discovery Service, and AWS Mainframe Modernization Service, signaling AWS may be consolidating or rethinking its migration tooling strategy.
  • The deprecation of Amazon S3 Object Lambda and Amazon Cloud Directory suggests AWS is streamlining overlapping functionality – customers will need to evaluate alternatives like Lambda@Edge or AWS Directory Service for similar capabilities.
  • AWS Snowball Edge Compute Optimized and Storage Optimized entering maintenance indicates AWS is likely pushing customers toward newer edge computing solutions like AWS Outposts or Local Zones for hybrid deployments.
  • The sunset of specialized services like AWS HealthOmics Variant Store and AWS IoT SiteWise Monitor shows AWS pruning niche offerings that may have had limited adoption or overlapping functionality with other services.

13:53 📢 Ryan – “It’s interesting, because I was a heavy user of CodeGuru and CodeCatalyst for a while, so the announcement I got as a customer was a lot less friendly than maintenance mode. It was like, your stuff’s going to end. So I don’t know if it’s true across all these services, but I know with at least those two. I did not get one for Glacier – because I also have a ton of stuff in Glacier, because I’m cheap.” 

17:01 AWS Direct Connect announces 100G expansion in Kansas City, MO

  • AWS Direct Connect now offers 100 Gbps dedicated connections with MACsec encryption at the Netrality KC1 data center in Kansas City, expanding high-bandwidth private connectivity options in the central US region.
  • The Kansas City location provides direct network access to all public AWS Regions (except China), AWS GovCloud Regions, and AWS Local Zones, making it a strategic connectivity hub for enterprises in the Midwest.
  • With 100G connections and MACsec encryption, organizations can achieve lower latency and enhanced security for workloads requiring high throughput, such as data analytics, media processing, or hybrid cloud architectures.
  • This expansion brings AWS Direct Connect to over 146 locations worldwide, reinforcing AWS’s commitment to providing enterprises with reliable alternatives to internet-based connectivity for mission-critical applications.
  • For businesses evaluating Direct Connect, the 100G option typically suits large-scale data transfers and enterprises with substantial bandwidth requirements, while the 10G option remains available for more moderate connectivity needs.

18:07 AWS IAM Identity Center now supports customer-managed KMS keys for encryption at rest | AWS News Blog

  • AWS IAM Identity Center now supports customer-managed KMS keys for encrypting identity data at rest, giving organizations in regulated industries full control over encryption key lifecycle, including creation, rotation, and deletion. This addresses compliance requirements for customers who previously could only use AWS-owned keys.
  • The feature requires symmetric KMS keys in the same AWS account and region as the Identity Center instance, with multi-region keys recommended for future flexibility. Implementation involves creating the key, configuring detailed permissions for Identity Center services and administrators, and updating IAM policies for cross-account access.
  • Not all AWS managed applications currently support Identity Center with customer-managed keys – administrators must verify compatibility before enabling to avoid service disruptions. The documentation provides specific policy templates for common use cases, including delegated administrators and application administrators.
  • Standard AWS KMS pricing applies for key storage and API usage while Identity Center remains free. The feature is available in all AWS commercial regions, GovCloud, and China regions.
  • Key considerations include the critical nature of proper permission configuration – incorrect setup can disrupt Identity Center operations and access to AWS accounts. Organizations should implement encryption context conditions to restrict key usage to specific Identity Center instances for enhanced security.

18:52 📢 Justin – “Encrypt setup can disrupt Identity Center operations, like revoking your encryption key, might be bad for your access to your cloud. So be careful with this one.” 

19:28 New general-purpose Amazon EC2 M8a instances are now available | AWS News Blog

  • AWS launches M8a instances powered by 5th Gen AMD EPYC Turin processors, delivering up to 30% better performance and 19% better price-performance than M7a instances for general-purpose workloads.
  • The new instances feature 45% more memory bandwidth and 50% improvements in networking (75 Gbps) and EBS bandwidth (60 Gbps), making them suitable for financial applications, gaming, databases, and SAP-certified enterprise workloads.
  • M8a introduces instance bandwidth configuration (IBC), allowing customers to flexibly allocate resources between networking and EBS bandwidth by up to 25%, optimizing for specific workload requirements.
  • Each vCPU maps to a physical CPU core without SMT, resulting in up to 60% faster GroovyJVM performance and 39% faster Cassandra performance compared to M7a instances.
  • Available in 12 sizes from small to metal-48xl (192 vCPU, 768GiB RAM) across three regions initially, with standard pricing options including On-Demand, Savings Plans, and Spot instances.

20:01 📢 Ryan – “That’s a big one! I still don’t have a use case for it.” 

 

20:09 Announcing Amazon Quick Suite: your agentic teammate for answering questions and taking action | AWS News Blog

  • Amazon Quick Suite combines AI-powered research, business intelligence, and automation into a single workspace, eliminating the need to switch between multiple applications for data gathering and analysis. 
  • The service includes Quick Research for comprehensive analysis across enterprise and external sources, Quick Sight for natural language BI queries, and Quick Flows/Automate for process automation.
  • Quick Index serves as the foundational knowledge layer, creating a unified searchable repository across databases, documents, and applications that powers AI responses throughout the suite. This addresses the common enterprise challenge of fragmented data sources by consolidating everything from S3, Snowflake, Google Drive, and SharePoint into one intelligent knowledge base.
  • The automation capabilities are split between Quick Flows for business users (natural language workflow creation) and Quick Automate for technical teams (complex multi-department processes with approval routing and system integrations). 
  • Both tools generate workflows from simple descriptions, but Quick Automate handles enterprise-scale processes like customer onboarding with advanced orchestration and monitoring.
  • Existing Amazon QuickSight customers will be automatically upgraded to Quick Suite with all current BI capabilities preserved under the “Quick Sight” branding, maintaining the same data connectivity, security controls, and user permissions. Pricing follows a per-user subscription model with consumption-based charges for Quick Index and optional features.
  • The service introduces “Spaces” for contextual data organization and custom chat agents that can be configured for specific departments or use cases, enabling teams to create tailored AI assistants connected to relevant datasets and workflows. This allows organizations to scale from personal productivity tools to enterprise-wide deployment while maintaining access controls.

22:13 📢 Justin – “This is a confusing product. It’s doing a lot of things, probably kind of poorly.” 

23:13 AWS Strengthens AI Security by Hiring Ex-DataStax CEO As New VP – Business Insider

  • AWS hired Chet Kapoor, former DataStax CEO, as VP of Security Services and Observability, reporting directly to CEO Matt Garman, to strengthen security offerings as AWS expands its AI business.
  • Kapoor brings experience from DataStax, where he led Astra DB development and integrated real-time AI capabilities, positioning him to address the security challenges of increasingly complex cloud deployments.
  • The role consolidates leadership of security services, governance, and operations portfolios under one executive, with teams from Gee Rittenhouse, Nandini Ramani, Georgia Sitaras, and Brad Marshall now reporting to Kapoor.
  • This hire follows recent AWS leadership changes, including the departures of VP of AI Matt Wood and VP of generative AI Vasi Philomin, signaling AWS’s focus on strengthening AI security expertise.
  • Kapoor will work alongside AWS CISO Amy Herzog to develop security and observability services that address what Garman describes as changing requirements driven by AI adoption.

26:03 📢 Justin – “Also, DataStax was bought by IBM – and everyone knows that anything bought by IBM will be killed mercilessly.” 

26:50 Amazon Bedrock AgentCore is now generally available

  • Amazon Bedrock AgentCore provides a managed platform for building and deploying AI agents that can execute for up to 8 hours with complete session isolation, supporting any framework like CrewAI, LangGraph, or LlamaIndex, and any model inside or outside Amazon Bedrock.
  • The service includes five core components: Runtime for execution, Memory for state management, Gateway for tool integration via Model Context Protocol, Identity for OAuth and IAM authorization, and Observability with CloudWatch dashboards and OTEL compatibility for monitoring agents in production.
  • AgentCore enables agents to communicate with each other through Agent-to-Agent protocol support and securely act on behalf of users with identity-aware authorization, making it suitable for enterprise automation scenarios that require extended execution times and complex tool interactions.
  • The platform eliminates infrastructure management while providing enterprise features like VPC support, AWS PrivateLink, and CloudFormation templates, with consumption-based pricing and no upfront costs across nine AWS regions.
  • Integration with existing observability tools like Datadog, Dynatrace, and LangSmith allows teams to monitor agent performance using their current toolchain, while the self-managed memory strategy gives developers control over how agents store and process information.

28:17 📢 Ryan – “This really to me, seems like a full app, you know, like this is a core component instead of doing development; you’re just taking  AI agents, putting them together, and giving them tasks. Then, the eight-hour runtime is crazy. It feels like it’s getting warmer in here just reading that.”

28:49 AWS’ Custom Chip Now Powers Most of Its Key AI Cloud Service — The Information

  • AWS has transitioned the majority of its AI inference workloads to its custom Inferentia chips, marking a significant shift away from Nvidia GPUs for production AI services. 
  • The move demonstrates AWS’s commitment to vertical integration and cost optimization in the AI infrastructure space.
  • Inferentia chips now handle most inference tasks for services like Amazon Bedrock, SageMaker, and internal AI features across AWS products. 
  • This custom silicon strategy allows AWS to reduce dependency on expensive third-party GPUs while potentially offering customers lower-cost AI inference options.
  • The shift to Inferentia represents a broader industry trend where cloud providers develop custom chips to differentiate their services and control costs. AWS can now optimize the entire stack from silicon to software for specific AI workloads, similar to Apple’s approach with its M-series chips.
  • For AWS customers, this transition could mean more predictable pricing and better performance-per-dollar for inference workloads. The custom chips are specifically designed for inference rather than training, making them more efficient for production AI applications.
  • This development positions AWS to compete more effectively with other cloud providers on AI pricing while maintaining control over its technology roadmap. 
  • Customers running inference-heavy workloads may see cost benefits as AWS passes along savings from reduced reliance on Nvidia hardware

29:39 📢 Ryan – “Explains all the Oracle and Azure Nvidia announcements.” 

30:16 Introducing Amazon EBS Volume Clones: Create instant copies of your EBS volumes | AWS News Blog

  • Amazon EBS Volume Clones enables instant point-in-time copies of encrypted EBS volumes within the same Availability Zone through a single API call, eliminating the previous multi-step process of creating snapshots in S3 and then new volumes.
  • Cloned volumes are available within seconds with single-digit millisecond latency, though performance during initialization is limited to the lowest of: 3,000 IOPS/125 MiB/s baseline, source volume performance, or target volume performance.
  • This feature targets development and testing workflows where teams need quick access to production data copies, but it complements rather than replaces EBS snapshots, which remain the recommended backup solution with 11 nines durability in S3.
  • Pricing includes a one-time fee per GiB of source volume data at initiation, plus standard EBS charges for the new volume, making cost governance important since cloned volumes persist independently until manually deleted.
  • The feature currently requires encrypted volumes and operates only within the same Availability Zone, supporting all EBS volume types across AWS commercial regions and select Local Zones.

32:06 Running Slurm on Amazon EKS with Slinky | Containers

  • AWS introduces Slinky, an open source project that lets you run Slurm workload manager inside Amazon EKS, enabling organizations to manage both traditional HPC batch jobs and modern Kubernetes workloads on the same infrastructure without maintaining separate clusters.
  • The solution deploys Slurm components as Kubernetes pods with slurmctld on general-purpose nodes and slurmd on GPU/accelerated nodes, supporting features like auto-scaling worker pods based on job queues and integration with Karpenter for dynamic EC2 provisioning.
  • Key benefit is resource optimization – AI inference workloads can scale during business hours while training jobs scale overnight using the same compute pool, with teams able to use familiar Slurm commands (sbatch, srun) alongside Kubernetes APIs.
  • Slinky provides an alternative to AWS ParallelCluster (self-managed), AWS PCS (managed Slurm), and SageMaker HyperPod (ML-optimized) for organizations already standardized on EKS who need deterministic scheduling for long-running jobs.
  • The architecture supports custom container images, allowing teams to package specific ML dependencies (CUDA, PyTorch versions) directly into worker pods, eliminating manual environment management while maintaining reproducibility across environments.

GCP

33:09 Introducing Gemini Enterprise | Google Cloud Blog

  • Google launches Gemini Enterprise as a unified AI platform that combines Gemini models, no-code agent building, pre-built agents, data connectors for Google Workspace and Microsoft 365, and centralized governance through a single chat interface. 
  • This positions Google as offering a complete AI stack, rather than just models or toolkits like competitors.
  • The platform includes notable integrations with Microsoft 365 and SharePoint environments while offering enhanced features when paired with Google Workspace, including new multimodal agents for video creation (Google Vids with 2.5M monthly users) and real-time speech translation in Google Meet. This cross-platform approach differentiates it from more siloed offerings.
  • Google introduces next-generation conversational agents with a low-code visual builder supporting 40+ languages, powered by the latest Gemini models for natural voice interactions and deep enterprise integration. 
  • Early adopters like Commerzbank report 70% inquiry resolution rates, and Mercari projects 500% ROI through 20% workload reduction.
  • The announcement includes new developer tools like Gemini CLI (1M+ developers in 3 months) with extensions from Atlassian, GitLab, MongoDB, and others, plus industry protocols for agent interoperability (A2A), payments (AP2), and model context (MCP). 
  • This creates a foundation infrastructure for an agent economy where developers can monetize specialized agents.
  • Google’s partner ecosystem includes 100,000+ partners with expanded integrations for Box, Salesforce, ServiceNow, and deployment support from Accenture, Deloitte, and others. 
  • The company also launches Google Skills training platform and GEAR program to train 1 million developers, addressing the critical skills gap in enterprise AI adoption.

35:01 📢 Justin – “I think both Azure and Amazon have similar problems; they are rushing so fast to make products, that they’re creating the same products over and over again, just with slightly different limitations or use cases.” 

36:05 Introducing LLM-Evalkit | Google Cloud Blog

  • Google releases LLM-Evalkit, an open-source framework that centralizes prompt engineering workflows on Vertex AI, replacing the current fragmented approach of managing prompts across multiple documents and consoles.
  • The tool shifts prompt development from subjective testing to data-driven iteration by requiring teams to define specific problems, create test datasets, and establish concrete metrics for measuring LLM performance.
  • LLM-Evalkit features a no-code interface designed to democratize prompt engineering, allowing non-technical team members like product managers and UX writers to contribute to the development process.
  • The framework integrates directly with Vertex AI SDKs and provides versioning, benchmarking, and performance tracking capabilities in a single application, addressing the lack of standardized evaluation processes in current workflows.
  • Available now on GitHub as an open-source project, with additional evaluation features accessible through the Google Cloud console, though specific pricing details are not mentioned in the announcement.

37:09 📢 Ryan – “Reading through this announcement, it’s solving a problem I had – but I didn’t know I had.” 

38:17 Announcing enhancements to Google Cloud NetApp Volumes | Google Cloud Blog

  • Google Cloud NetApp Volumes now supports iSCSI block storage alongside file storage, enabling enterprises to migrate SAN workloads to GCP without architectural changes. 
  • The service delivers up to 5 GiB/s throughput and 160K IOPS per volume with independent scaling of capacity, throughput, and IOPS.
  • NetApp FlexCache provides local read caches of remote volumes for distributed teams and hybrid cloud deployments. 
  • This allows organizations to access shared datasets with local-like performance across regions, supporting compute bursting scenarios that require low-latency data access.
  • The service now integrates with Gemini Enterprise as a data store for RAG applications, allowing organizations to ground AI models on their secure enterprise data without complex ETL processes. 
  • Data remains governed within NetApp Volumes while being accessible for search and inference workflows.
  • Auto-tiering automatically moves cold data to lower-cost storage at $0.03/GiB for the Flex service level, with configurable thresholds from 2-183 days. Large-capacity volumes now scale from 15TiB to 3PiB with over 21GiB/s throughput per volume for HPC and AI workloads.
  • NetApp SnapMirror enables replication between on-premises NetApp systems and Google Cloud with zero RPO and near-zero RTO. 
  • This positions GCP competitively against AWS FSx for NetApp ONTAP and Azure NetApp Files for enterprise storage migrations.

40:30 📢 Justin – “I have a specific workload that needs storage, that’s shared across boxes, and iSCSI is a great option for that, in addition to other methods you could use that I’m currently using, which have some sharp edges. So I’m definitely going to do some price calculation models. This might be good, because Google has multi-writer files, like EBS-type solutions, but does not have the performance that I need quite yet.”

Azure

41:08 GitHub Will Prioritize Migrating to Azure Over Feature Development – The New Stack

  • GitHub is migrating its entire infrastructure from its Virginia data center to Azure within 24 months, with teams being asked to delay feature development to focus on this migration due to capacity constraints from AI and Copilot workloads.
  • The migration represents a significant shift from GitHub’s previous autonomy since Microsoft’s 2018 acquisition, with GitHub losing independence after CEO Thomas Dohmke’s departure and being folded deeper into Microsoft’s organizational structure.
  • Technical challenges include migrating GitHub’s MySQL clusters that run on bare metal servers to Azure, which some employees worry could lead to more outages during the transition period, given recent service disruptions.
  • This positions Azure to capture one of the world’s largest developer platforms as a flagship customer, demonstrating Azure’s ability to handle massive scale workloads while potentially raising concerns among open source developers about tighter Microsoft integration.
  • The move highlights how AI workloads are straining traditional infrastructure, with GitHub citing “existential” needs to scale for AI and Copilot demands, showing how generative AI is forcing major architectural decisions across the industry.

43:17 📢 Ryan – “I just hope the service stays up; it’s so disruptive to my day job when GitHub has issues.” 

43:33 Microsoft 365 services fall over in North America • The Register

  • Microsoft 365 experienced a North American outage on October 9, lasting just over an hour, caused by misconfigured network infrastructure that affected all services, including Teams, highlighting the fragility of centralized cloud services when configuration errors occur.
  • This incident followed another Azure outage where Kubernetes crashes took down Azure Front Door instances, suggesting potential systemic issues with Microsoft’s infrastructure management and configuration processes that enterprise customers should factor into their reliability planning.
  • Users reported that switching to backup circuits restored services, and some attributed issues to AT&T’s network, demonstrating the importance of multi-path connectivity and diverse network providers for mission-critical cloud services.
  • Microsoft’s response involved rerouting traffic to healthy infrastructure and analyzing configuration policies to prevent future incidents, though the lack of detailed root cause information raises questions about transparency and whether customers have sufficient visibility into infrastructure dependencies.
  • The back-to-back outages underscore why organizations need robust disaster recovery plans beyond single cloud providers, as even brief disruptions to productivity tools like Teams can significantly impact business operations across entire regions.

44:17 Introducing Microsoft Agent Framework | Microsoft Azure Blog

  • Microsoft Agent Framework converges AutoGen research project with Semantic Kernel into a unified open-source SDK for orchestrating multi-agent AI systems, addressing the fragmentation challenge as 80% of enterprises now use agent-based AI according to PwC.
  • The framework enables developers to build locally and then deploy to Azure AI Foundry with built-in observability, durability, and compliance, while supporting integration with any API via OpenAPI and cross-runtime collaboration through Agent2Agent protocol.
  • Azure AI Foundry now provides unified observability across multiple agent frameworks, including LangChain, LangGraph, and OpenAI Agents SDK, through OpenTelemetry contributions, positioning it as a comprehensive platform compared to AWS Bedrock or GCP Vertex AI’s more limited agent support.
  • Voice Live API reaches general availability, offering a unified real-time speech-to-speech interface that integrates STT, generative AI, TTS, and avatar capabilities in a single low-latency pipeline for building voice-enabled agents.
  • New responsible AI capabilities in public preview include task adherence, prompt shields with spotlighting, and PII detection, addressing McKinsey’s finding that the lack of governance tools is the top barrier to AI adoption.

44:48 📢 Justin – “We continue to be in a world of confusion around Agentic and out of control of Agentic things.” 

45:54 NVIDIA GB300 NVL72: Next-generation AI infrastructure at scale | Microsoft Azure Blog

  • Microsoft deployed the first production cluster with over 4,600 NVIDIA GB300 NVL72 systems featuring Blackwell Ultra GPUs, enabling AI model training in weeks instead of months and supporting models with hundreds of trillions of parameters. 
  • This positions Azure as the first cloud provider to deliver Blackwell Ultra at scale for production workloads.
  • Each ND GB300 v6 VM rack contains 72 GPUs with 130TB/second of NVLink bandwidth and 37TB of fast memory, delivering up to 1,440 PFLOPS of FP4 performance. 
  • The system uses 800 Gbps NVIDIA Quantum-X800 InfiniBand for cross-rack connectivity, doubling the bandwidth of previous GB200 systems.
  • The infrastructure targets frontier AI workloads, including reasoning models, agentic AI systems, and multimodal generative AI, with OpenAI already using these clusters for training and deploying their largest models. 
  • This gives Azure a competitive edge over AWS and GCP in supporting next-generation AI workloads.
  • Azure implemented custom cooling systems using standalone heat exchangers and new power distribution models to handle the high energy density requirements of these dense GPU clusters. 
  • The co-engineered software stack optimizes storage, orchestration, and scheduling for supercomputing scale.
  • While pricing wasn’t disclosed, the scale and specialized nature of these VMs suggest they’ll target enterprise customers and AI research organizations requiring cutting-edge performance for training trillion-parameter models. Azure plans to deploy hundreds of thousands of Blackwell Ultra GPUs globally.

47:24📢 Ryan – “Pricing isn’t disclosed because it’s the GDP of a small country.” 

48:05 Generally Available: CLI command for migration from Availability Sets and basic load balancer on AKS 

  • Thanks for the timely heads up on this one… 
  • Azure introduces a single CLI command to migrate AKS clusters from deprecated Availability Sets to Virtual Machine Scale Sets before the September 2025 deadline, simplifying what would otherwise be a complex manual migration process.
  • The automated migration upgrades clusters from basic load balancers to standard load balancers, providing improved reliability, zone redundancy, and support for up to 1000 nodes compared to the basic tier’s 100-node limit.
  • This positions Azure competitively with AWS EKS and GCP GKE, which already use more modern infrastructure patterns by default, though Azure’s migration tool reduces the operational burden for existing customers.
  • Organizations running production AKS workloads on Availability Sets should prioritize testing this migration in non-production environments first, as the process involves recreating node pools, which could impact running applications.
  • While the migration itself has no direct cost, customers will see increased charges from standard load balancers (approximately $0.025/hour plus data processing fees) compared to free basic load balancers.

49:01 📢 Ryan – “This is why you drag your feet on getting off of everything.” 

Oracle

49:12 Announcing Dark Mode For The OCI Console

  • Oracle finally joins the dark mode club with OCI Console, following years behind AWS (2017), Azure (2019), and GCP (2020) – a basic UI feature that took surprisingly long for a major cloud provider to implement.
  • The feature allows users to toggle between light and dark themes in the console settings, with Oracle claiming it reduces eye strain and improves battery life on devices – standard benefits that every other cloud provider has been touting for years.
  • Dark mode persists across browser sessions and devices when logged into the same OCI account, though Oracle hasn’t specified if this preference syncs across different OCI regions or tenancies.
  • While this is a welcome quality-of-life improvement for developers working late hours, it highlights Oracle’s ongoing challenge of playing catch-up on basic console features that competitors have long considered table stakes.
  • The rollout appears to be gradual with no specific timeline mentioned, and Oracle provides no details about API or CLI theme preferences, suggesting this is purely a web console enhancement.

Closing

And that is the week in the cloud! Visit our website, the home of the Cloud Pod, where you can join our newsletter, Slack team, send feedback, or ask questions at theCloudPod.net or tweet at us with the hashtag #theCloudPod

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.