Welcome to episode 328 of The Cloud Pod, where the forecast is always cloudy! Justin, Ryan, and Matt are on board today to bring you all the latest news in cloud and AI, including secret regions (this one has the aliens), ongoing discussions between Microsoft and OpenAI, and updates to Nova, SQL, and OneLake -and even the latest installment of Cloud Journeys. Let’s get started!
Titles we almost went with this week
- ✍️ CloudWatch’s New Feature: Because Nobody Likes Writing Incident Reports at 3 AM
- ⚰️ DNS: Did Not Survive – The Great US-EAST-1 Outage of 2025
- 🔍 404 DevOps Not Found: The AWS Automation Adventure mk
- 🤖 When Your DevOps Team Gets Replaced by AI and Then Everything Crashes
- 🚛 Database Migrations Get the ChatGPT Treatment: Just Vibe Your Schema Changes
- 🧑💻 AWS DevOps Team Gets the AI Treatment: 40% Fewer Humans, 100% More Questions
- 💔 Breaking Up is Hard to Compute: Microsoft and OpenAI Redefine Their Relationship
- 🪦 AWS Goes Full Scope: Now Tracking Your Cloud’s Carbon from Cradle to Gate
- 🧑🔬 Platform Engineering: When Your Golden Path Leads to a Dead End
- 🏁 DynamoDB’s DNS Disaster: How a Race Condition Raced Through AWS
- 🖥️ AI Takes Over AWS DevOps Jobs, Servers Take Unscheduled Vacation
- ☕ PostgreSQL Scaling Gets a 30-Second Makeover While AWS Takes a Coffee Break
- 🫳 The Domino Effect: When DynamoDB Drops, Everything Drops
- 📑 RAG to Riches: Amazon Nova Learns to Cite Its Sources
- 🏺 AWS Finally Tells You When Your EC2 Instance Can’t Keep Up With Your Storage Ambitions
- 💭 AWS Nova Gets Grounded: No More Hallucinating About Reality
- 🚣 One API to Rule Them All: OneLake’s Storage Compatibility Play
- 💸 OpenAI gets to pay Alimony
- 📊 Database schema deployments are totally a vibe
- 🌳 AWS will tell you how not green you are today, now in 3 scopes
General News
02:00 DDoS in September | Fastly
- Fastly‘s September DDoS report reveals a notable 15.5 million requests per second attack that lasted over an hour, demonstrating how modern application-layer attacks can sustain extreme throughput with real HTTP requests rather than simple pings or amplification techniques.
- Attack volume in September dropped to 61% of August levels, with data suggesting a correlation between school schedules and attack frequency: lower volumes coincide with school breaks, while higher volumes occur when schools are in session.
- Media & Entertainment companies faced the highest median attack sizes, followed by Education and High Technology sectors, with 71% of September’s peak attack day attributed to a single enterprise media company.
- The sustained 15 million RPS attack originated from a single cloud-provider ASN, using sophisticated daemons that mimicked browser behavior, making detection more challenging than typical DDoS patterns.
- Organizations should evaluate whether their incident response runbooks can handle hour-long attacks at 15+ million RPS, as these sustained high-throughput attacks require automated mitigation rather than manual intervention.
- Listen, we’re not inviting a DDoS attack, but also…we’ll just turn off the website, so there’s that.
AI Is Going Great – Or How ML Makes Money
04:41 Google AI Studio updates: More control, less friction
- Google AI Studio introduces “vibe coding” – a new AI-powered development experience that generates working multi-modal apps from natural language prompts without requiring API key management or manual service integration.
- The platform now automatically connects appropriate models and APIs based on app descriptions, supporting capabilities like Veo for video generation, Nano Banana for image editing, and Google Search for source verification.
- New Annotation Mode enables visual app modifications by highlighting UI elements and describing changes in plain language rather than editing code directly
- The updated App Gallery provides visual examples of Gemini-powered applications with instant preview, starter code access, and remix capabilities for rapid prototyping
- Users can add personal API keys to continue development when free-tier quotas are exhausted, with automatic switching back to the free tier upon renewal.
- Are you a visual learner? You can check out their YouTube tutorial playlist here.
05:39 📢 Justin – “So, there are still API keys – they made it sound like there wasn’t, but there is. You just don’t have to manage them until you’ve consumed your free tier.”
09:35 OpenAI takes aim at Microsoft 365 Copilot • The Register
- OpenAI launched “company knowledge” for ChatGPT Business, Enterprise, and Edu plans, enabling direct integration with corporate data sources, including Slack, SharePoint, Google Drive, Teams, and Outlook; notably excluding OneDrive, which could impact Microsoft-heavy organizations.
- The feature requires manual activation for each conversation and lacks capabilities like web search, image generation, or graph creation when enabled, unlike Microsoft 365 Copilot‘s deeper integration across Office applications.
- ChatGPT Business pricing at $25/user/month undercuts Microsoft 365 Copilot’s $30/month fee, potentially offering a more cost-effective enterprise AI assistant option with stronger brand recognition. (5 bucks is 5 bucks, right?)
- Security implementation includes individual authentication per connector, encryption of all data, no training on corporate data, and an Enterprise Compliance API for conversation log review and regulatory reporting.
- Data residency and processing locations vary by connector, with no clear documentation from OpenAI, requiring organizations to verify compliance requirements before deployment.
- We kind of think we’ve heard of this before…
11:05 📢 Ryan – “And it’s a huge problem. It’s been a huge problem that people have been trying to solve for a long time.”
14:23 The next chapter of the Microsoft–OpenAI partnership – The Official Microsoft Blog
- Welp, the divorce has reached a (sort of) amicable alimony agreement.
- Microsoft and OpenAI have restructured their partnership with Microsoft, now holding approximately 27% stake in OpenAI’s new public benefit corporation, which is now valued at $135 billion, while maintaining exclusive Azure API access and IP rights until AGI is achieved.
- The agreement introduces an independent expert panel to verify AGI declarations and extends Microsoft’s IP rights for models and products through 2032, including post-AGI models with safety guardrails, though research IP expires by 2030 or AGI verification.
- OpenAI gains significant operational flexibility, including the ability to develop non-API products with third parties on any cloud provider, release open weight models meeting capability criteria, and serve US government national security customers on any cloud infrastructure.
- Microsoft can now independently pursue AGI development alone or with partners, and if using OpenAI’s IP pre-AGI, must adhere to compute thresholds significantly larger than current leading model training systems.
- OpenAI has committed to purchasing $250 billion in Azure services while Microsoft loses its right of first refusal as OpenAI’s compute provider, signaling a shift toward more independent operations for both companies.
Con’t The next chapter of the Microsoft–OpenAI partnership | OpenAI
- Microsoft’s investment in OpenAI is now valued at approximately $135 billion, representing roughly 27% ownership on a diluted basis, while OpenAI transitions to a public benefit corporation structure.
- The partnership introduces an independent expert panel to verify when OpenAI achieves AGI, with Microsoft’s IP rights for models and products extended through 2032, including post-AGI models with safety guardrails.
- OpenAI gains significant flexibility, including the ability to develop non-API products with third parties on any cloud provider, release open weight models meeting capability criteria, and provide API access to US government national security customers on any cloud.
- Microsoft can now independently pursue AGI development alone or with partners, while OpenAI has committed to purchasing an additional $250 billion in Azure services, but Microsoft no longer has the right of first refusal as a compute provider.
- The revenue-sharing agreement continues until AGI verification, but payments will be distributed over a longer timeframe, while Microsoft retains exclusive rights to OpenAI’s frontier models and Azure API exclusivity until AGI is achieved.
15:59 📢 Justin – “Once AGI is achieved is an interesting choice… I wonder how Microsoft believes that’s gonna happen very soon, and OpenAI doesn’t, that’s why they’re willing to agree on that term; it’s interesting. Again, it has to be independently verified by a partner, so OpenAI can’t just come out and say, ‘we’ve created AGI,’ then, into a legal dispute – it has to be agreed upon by others. So that’s all very interesting.”
17:45 Build more accurate AI applications with Amazon Nova Web Grounding | AWS News Blog
- AWS announces general availability of Web Grounding for Amazon Nova Premier, a built-in RAG tool that automatically retrieves and cites current web information during inference.
- The feature eliminates the need to build custom RAG pipelines while reducing hallucinations through automatic source attribution and verification.
- Web Grounding operates as a system tool within the Bedrock Converse API, allowing Nova models to intelligently determine when to query external sources based on prompt context.
- Developers simply add nova_grounding to the toolConfig parameter, and the model handles retrieval, integration, and citation of public web sources automatically.
- The feature is currently available only in US East N. Virginia for Nova Premier, with Ohio and Oregon regions coming soon, and support for other Nova models planned.
- Additional costs apply beyond standard model inference pricing, detailed on the Amazon Bedrock pricing page.
- Primary use cases include knowledge-based chat assistants requiring current information, content generation tools needing fact-checking, research applications synthesizing multiple sources, and customer support where accuracy and verifiable citations are essential.
- The reasoning traces in responses allow developers to follow the model’s decision-making process.
- The implementation provides a turnkey alternative to custom RAG architectures, particularly valuable for developers who want to focus on application logic rather than managing complex information retrieval systems while maintaining transparency through automatic source attribution.
18:36 📢 Justin – “This is the first time I’ve heard anything about Nova in months, so, good to know?”
Cloud Tools
19:34 Introducing-ai-powered-database-migration-authoring
- Harness introduces AI-powered database migration authoring that lets developers describe schema changes in plain English, like “create a table named animals with columns for genus_species,” and automatically generates production-ready SQL migrations with rollback scripts and Git integration.
- The tool addresses the “AI Velocity Paradox” where 63% of organizations ship code faster with AI, but 72% have suffered production incidents from AI-generated code – by extending AI automation to database changes, which remain a manual bottleneck in most CI/CD pipelines.
- Built on Harness’s Software Delivery Knowledge Graph and MCP Server, it analyzes current schemas, generates backward-compatible migrations, validates for compliance, and integrates with existing policy-as-code governance – making it more than just a generic SQL generator.
- Database DevOps is one of Harness’s fastest-growing modules, with customers like Athenahealth reporting they saved months of engineering effort compared to Liquibase Pro or homegrown solutions while getting better governance and visibility.
- This positions databases as first-class citizens in CI/CD pipelines rather than the traditional midnight deployment bottleneck, allowing DBAs to maintain oversight through automated approvals while developers can finally move database changes at DevOps speed.
20:44 📢 Ryan – “Given how hard this is for humans to do, I look forward to AI doing this better.”
AWS
21:38 Amazon Allegedly Replaced 40% of AWS DevOps With AI Days Before Crash
- An unverified report claims Amazon replaced 40% of AWS DevOps staff with AI systems capable of automatically fixing IAM permissions, rebuilding VPC configurations, and rolling back failed Lambda deployments, just days before their widely reported on crash.
- AWS has not confirmed this, and skepticism remains high, however.
- The timing coincides with a recent AWS outage that impacted major services, including Snapchat, McDonald’s app, Roblox, and Fortnite, raising questions about automation’s role in system reliability and incident response.
- AWS officially laid off hundreds of employees in July 2025 (and more just recently), but the alleged 40% DevOps reduction would represent a significant shift toward AI-driven infrastructure management if true.
- The incident highlights growing concerns about cloud service concentration risk, as both this AWS outage and the 2023 CrowdStrike incident demonstrate how single points of failure can impact thousands of businesses globally.
- For AWS customers, this raises practical questions about the balance between automation efficiency and human oversight in critical infrastructure operations, particularly for disaster recovery and complex troubleshooting scenarios.
22:19 📢 Justin – “In general, Amazon has been doing a lot of layoffs. There’s been a lot of brain drain. I don’t know that they’ve automated 40% of the DevOps staff with AI systems…so this one seems a little rumor-y and speculative, but I did find it fun that people were trying to blame AI for Amazon’s woes last week.”
24:41 Summary of the Amazon DynamoDB Service Disruption in Northern Virginia (US-EAST-1) Region
- DynamoDB experienced a 2.5-hour outage in US-EAST-1 due to a race condition in its DNS management system that resulted in empty DNS records, affecting all services dependent on DynamoDB, including EC2, Lambda, and Redshift.
- The cascading failure pattern showed how tightly coupled AWS services are – EC2 instance launches failed for 14 hours because DynamoDB’s outage prevented lease renewals between EC2’s DropletWorkflow Manager and physical servers.
- Network Load Balancers experienced connection errors from 5:30 AM to 2:09 PM due to health check failures caused by EC2’s network state propagation delays, demonstrating how infrastructure dependencies can create extended recovery times.
- AWS has disabled the automated DNS management system globally and will implement velocity controls and improved throttling mechanisms before re-enabling, highlighting the challenge of balancing automation with resilience.
- The incident reveals architectural vulnerabilities in multi-service dependencies – services like Redshift in all regions failed IAM authentication due to hardcoded dependencies on US-EAST-1, suggesting the need for better regional isolation.
26:31 📢 Matt – “It’s a good write-up to show that look, even these large cloud providers that have these massive systems and have redundancy upon redundancy upon redundancy – it’s all software under the hood. Software will eventually have a bug in it. And this just happens to be a really bad bug that took down half the internet.”
28:30 Amazon CloudWatch introduces interactive incident reporting
- CloudWatch now automatically generates post-incident analysis reports by correlating telemetry data, investigation inputs, and actions taken during an investigation, reducing report creation time from hours to minutes.
- Reports include executive summaries, event timelines, impact assessments, and actionable recommendations, helping teams identify patterns and implement preventive measures for better operational resilience.
- The feature integrates directly with CloudWatch investigations, capturing operational telemetry and service configurations automatically without manual data collection or correlation.
- Currently available in 12 AWS regions, including US East, Europe, and Asia Pacific, with no specific pricing mentioned – likely included in existing CloudWatch investigation costs.
- This addresses a common pain point where teams spend significant time manually creating incident reports instead of focusing on root cause analysis and prevention strategies.
31:00 Customer Carbon Footprint Tool Expands: Additional emissions categories including Scope 3 are now available | AWS News Blog
- AWS Customer Carbon Footprint Tool now includes Scope 3 emissions data covering fuel/energy-related activities, IT hardware lifecycle emissions, and building/equipment impacts, giving customers a complete view of their carbon footprint beyond just direct operational emissions.
- The tool provides both location-based and market-based emission calculations with 38 months of historical data recalculated using the new methodology, accessible through the AWS Billing console with CSV export and integration options for QuickSight visualization.
- Scope 3 emissions are amortized over asset lifecycles (6 years for IT hardware, 50 years for buildings) to fairly distribute embodied carbon across operational lifetime, with all calculations independently verified following GHG Protocol standards.
- Early access customers like Salesforce, SAP, and Pinterest report that the granular regional data and Scope 3 visibility help them move beyond industry averages to make targeted carbon reduction decisions based on actual infrastructure emissions.
- The tool remains free to use within the AWS Billing and Cost Management console, providing emissions data in metric tons of CO2 equivalent (MTCO2e) to help organizations track progress toward sustainability goals and compliance reporting requirements.
32:45 📢 Matt – “This is a difficult problem to solve. Once you have scope three, it’s all your indirect costs. So, I think if I remember correctly, scope one is your actual server, scope two is power, and then scope three is all the things that have to get included to generate your power and your servers, which includes shipping, et cetera. So getting all that, it’s not an easy task to do. Even when I look at the numbers, I don’t know what these mean half the time when I have to look at them. I’m like, we’re going down. That seems positive.”
33:59 AWS Secret-West Region is now available
- AWS launches Secret-West, its second region capable of handling Secret-level U.S. classified workloads, expanding beyond the existing Secret-East region to provide geographic redundancy for intelligence and defense agencies operating in the western United States.
- The region meets stringent Intelligence Community Directive (ICD) 503 and DoD Security Requirements Guide Impact Level 6 requirements, enabling government agencies to process and analyze classified data with multiple Availability Zones for high availability and disaster recovery.
- This expansion allows agencies to deploy latency-sensitive classified workloads closer to western U.S. operations while maintaining multi-region resiliency, addressing a critical gap in classified cloud infrastructure outside the eastern United States.
- AWS continues to operate in a specialized market segment with limited competition, as few cloud providers can meet the security clearance and infrastructure requirements necessary for Secret-level classification hosting.
- Pricing information is not publicly available due to the classified nature of the service; interested government agencies must contact AWS directly through their secure channels to discuss access and costs.
📢 Agent Coulson – “Welcome to level 7.”
38:24 AWS Transfer Family now supports changing identity provider type on a server
- AWS Transfer Family now allows changing identity provider types (service managed, Active Directory, or custom IdP) on existing SFTP, FTPS, and FTP servers without service interruption, eliminating the need to recreate servers during authentication migrations.
- This feature enables zero-downtime authentication migrations for organizations transitioning between identity providers or consolidating authentication systems, particularly useful for companies undergoing mergers or updating compliance requirements.
- The capability is available across all AWS regions where Transfer Family operates, with no additional pricing beyond standard Transfer Family costs, which start at $0.30 per protocol per hour.
- Organizations can now adapt their file transfer authentication methods dynamically as business needs evolve, such as switching from basic service-managed users to enterprise Active Directory integration without disrupting ongoing file transfers.
- Implementation details and migration procedures are documented in the Transfer Family User Guide here.
39:26 📢 Ryan – “Any kind of configuration change that requires you to destroy and recreate isn’t fun. I do believe that we should architect for such things and be able to redirect things with DNS traffic (which never goes wrong), never causes anyone any problems. But, it is terrible when that happens, because even when it works, you’re sort of nervously doing it the entire time.”
40:24 New Amazon CloudWatch metrics to monitor EC2 instances exceeding I/O performance
- AWS introduces Instance EBS IOPS Exceeded Check and Instance EBS Throughput Exceeded Check metrics that return binary values (0 or 1) to indicate when EC2 instances exceed their EBS-optimized performance limits, helping identify bottlenecks without manual calculation.
- These metrics enable automated responses through CloudWatch alarms, such as triggering instance resizing or type changes when I/O limits are exceeded, reducing manual intervention for performance optimization.
- Available at no additional cost with 1-minute granularity for all Nitro-based EC2 instances with attached EBS volumes across all commercial AWS regions, including GovCloud and China.
- Addresses a common blind spot where applications experience degraded performance due to exceeding instance-level I/O limits rather than volume-level limits, which many users overlook when troubleshooting. (Yes, we’re all guilty of this.)
- Particularly useful for database workloads and high-throughput applications where understanding whether the bottleneck is at the instance or volume level is critical for right-sizing decisions.
41:20 📢 Matt – “This would have solved a lot of headaches when GP3 came out…”
GCP
43:53 A practical guide to Google Cloud’s Parameter Manager | Google Cloud Blog
- Google Cloud Parameter Manager provides centralized configuration management that separates application settings from code, supporting JSON, YAML, and unformatted data with built-in format validation for JSON and YAML types
- The service integrates with Secret Manager through a __REF__ syntax that allows parameters to securely reference secrets like API keys and passwords, with regional compliance enforcement ensuring secrets can only be referenced by parameters in the same region
- Parameter Manager uses versioning for configuration snapshots, enabling safe rollbacks and preventing unintended breaking changes to deployed applications while supporting use cases like A/B testing, feature flags, and regional configurations
- Both Parameter Manager and Secret Manager offer monthly free tiers, though specific pricing details aren’t provided in the announcement; the service requires granting IAM permissions for parameters to access referenced secrets
- Key benefits include eliminating hard-coded configurations, supporting multi-region deployments with region-specific settings, and enabling dynamic configuration updates without code changes for applications across various industries
44:22📢 Justin – “ I’m a very heavy user of parameter store on AWS. I love it, and you should all use it for any of your dynamic configuration, especially if you’re moving containers between environments. This is the bee’s knees in my opinion.”
49:39 Cross-Site Interconnect, now GA, simplifies L2 connectivity | Google Cloud Blog
- Cross-Site Interconnect is now GA, providing managed Layer 2 connectivity between data centers using Google’s global network infrastructure, eliminating the need for complex multi-vendor setups and reducing capital expenditures for WAN connectivity.
- The service offers consumption-based pricing with no setup fees or long-term commitments, allowing customers to scale bandwidth dynamically and pay only for what they use, though specific pricing details weren’t provided in the announcement.
- Built on Google’s 3.2 million kilometers of fiber and 34 subsea cables (and you know how much we love a good undersea cable).
- Cross-Site Interconnect provides a 99.95% SLA that includes protection against cable cuts and maintenance windows, with automatic failover and proactive monitoring across 100s of Cloud Interconnect PoPs.
- Financial services and telecommunications providers are early adopters, with Citadel reporting stable performance during their pilot program, highlighting use cases for low-latency trading, disaster recovery, and dynamic bandwidth augmentation for AI/ML workloads.
- As a transparent Layer 2 service, it enables MACsec encryption between remote routers with customer-controlled keys, while providing programmable APIs for infrastructure-as-code workflows and real-time monitoring of latency, packet loss, and bandwidth utilization.
50:57📢 Ryan – “I mean, I like this just because of the heavy use of infrastructure as code availability. Some of these deep-down network services across the clouds don’t really provide that; it’s all just sort of click ops or a support case. So this is kind of neat. And I do like that you can dynamically configure this and stand it up / turn it down pretty quickly.”
53:12 Introducing Bigtable tiered storage | Google Cloud Blog
- Bigtable introduces tiered storage that automatically moves data older than a configurable threshold from SSD to infrequent access storage, reducing storage costs by up to 85% while maintaining API compatibility and data accessibility through the same interface.
- The infrequent access tier provides 540% more storage capacity per node compared to SSD-only nodes, enabling customers to retain historical data for compliance and analytics without manual archiving or separate systems.
- Time-series workloads from manufacturing, automotive, and IoT benefit most – sensor data, EV battery telemetry, and factory equipment logs can keep recent data on SSD for real-time operations while moving older data to cheaper storage automatically based on age policies.
- Integration with Bigtable SQL allows querying across both tiers, and logical views enable controlled access to historical data for reporting without full table permissions, simplifying data governance for large datasets.
- Currently in preview with pricing at approximately $0.026/GB/month for infrequent access storage compared to $0.17/GB/month for SSD storage, representing significant savings for organizations storing hundreds of terabytes of historical operational data.
54:31📢 Ryan – “To illustrate that I’m still a cloud guy at heart, whenever I’m in an application and I’m loading data and I go back – like I want to see a year’s data – and it takes that extra 30 seconds to load, I actually get happy, because I know what they’re doing on the backend.”
56:05 Now Shipping A4X Max, Vertex AI Training and more | Google Cloud Blog
- Google launches A4X Max instances powered by NVIDIA GB300 NVL72 with 72 Blackwell Ultra GPUs and 36 Grace CPUs, delivering 2x network bandwidth compared to A4X and 4x better LLM training performance versus A3 H100-based VMs. The system features 1.4 exaflops per NVL72 system and can scale to clusters twice as large as A4X deployments.
- GKE now supports DRANET (Dynamic Resource Allocation Kubernetes Network Driver) in production, starting with A4X Max, providing topology-aware scheduling of GPUs and RDMA network cards to boost bus bandwidth for distributed AI workloads.
- This improves cost efficiency through better VM utilization by optimizing connectivity between RDMA devices and GPUs.
- GKE Inference Gateway integrates with NVIDIA NeMo Guardrails to add safety controls for production AI deployments, preventing models from engaging in undesirable topics or responding to malicious prompts.
- The integration combines model-aware routing and autoscaling with enterprise-grade security features.
- Vertex AI Model Garden will support NVIDIA Nemotron models as NIM microservices, starting with Llama Nemotron Super v1.5, allowing developers to deploy open-weight models with granular control over machine types, regions, and VPC security policies.
- Vertex AI Training now includes curated recipes built on NVIDIA NeMo Framework and NeMo-RL with managed Slurm environments and automated resiliency features for large-scale model development.
- A4X Max is available in preview through Google Cloud sales representatives and leverages Cluster Director for lifecycle management, topology-aware placement, and integration with Managed Lustre storage.
- Pricing details were not disclosed in the announcement.
57:41📢 Justin – “That’s a lot of cool hardware stuff that I do not understand.”
Azure
58:38 NVIDIA GB300 NVL72: Next-generation AI infrastructure at scale | Microsoft Azure Blog
- Microsoft deployed the first production cluster with over 4,600 NVIDIA GB300 NVL72 systems featuring Blackwell Ultra GPUs, enabling AI model training in weeks instead of months and supporting models with hundreds of trillions of parameters
- The ND GB300 v6 VMs deliver 1,440 petaflops of FP4 performance per rack with 72 GPUs, 37TB of fast memory, and 130TB/second NVLink bandwidth, specifically optimized for reasoning models, agentic AI, and multimodal generative AI workloads
- Azure implemented 800 Gbps NVIDIA Quantum-X800 InfiniBand networking with full fat-tree architecture and SHARP acceleration, doubling effective bandwidth by performing computations in-switch for improved large-scale training efficiency
- The infrastructure uses standalone heat exchanger units and new power distribution models to handle high-density GPU clusters, with Microsoft planning to scale to hundreds of thousands of Blackwell Ultra GPUs across global datacenters
- OpenAI and Microsoft are already using these clusters for frontier model development, with the platform becoming the standard for organizations requiring supercomputing-scale AI infrastructure (pricing is not specified in the announcement).
59:55📢 Ryan – “Companies looking for scale – companies with a boatload of money.”
1:00:23 Generally Available: Near-zero downtime scaling for HA-enabled Azure Database for PostgreSQL servers
- Azure Database for PostgreSQL servers with high availability can now scale in under 30 seconds compared to the previous 2-10 minute window, reducing downtime by over 90% for database scaling operations.
- This feature targets production workloads that require continuous availability during infrastructure changes, particularly benefiting e-commerce platforms, financial services, and SaaS applications that cannot afford extended maintenance windows.
- The near-zero downtime scaling works specifically with HA-enabled PostgreSQL instances, leveraging Azure’s high availability architecture to perform seamless compute and storage scaling without disrupting active connections.
- While pricing remains unchanged from standard PostgreSQL rates, the reduced downtime translates to lower operational costs by minimizing revenue loss during scaling events and reducing the need for complex maintenance scheduling.
- This enhancement positions Azure PostgreSQL competitively against AWS RDS and Google Cloud SQL, which still require longer downtime windows for similar scaling operations on their managed PostgreSQL offerings.
1:01:16📢 Matt – “They’ve had this for forever on Azure SQL, which is their Microsoft SQL platform, so it doesn’t surprise me. It surprised me more that this was already a two-to-10-minute window to scale. Seems crazy for a production HA service.”
1:02:10 OneLake APIs: Bring your apps and build new ones with familiar Blob and ADLS APIs | Microsoft Fabric Blog | Microsoft Fabric
- OneLake now supports Azure Blob Storage and ADLS APIs, allowing existing applications to connect to Microsoft Fabric’s unified data lake without code changes – just swap endpoints to onelake.dfs.fabric.microsoft.com or onelake.blob.fabric.microsoft.com. What could go wrong?
- This API compatibility eliminates migration barriers for organizations with existing Azure Storage investments, enabling immediate use of tools like Azure Storage Explorer with OneLake while preserving existing scripts and workflows
- The feature targets enterprises looking to consolidate data lakes without rewriting applications, particularly those using C# SDKs or requiring DFS operations for hierarchical data management
- Microsoft provides an end-to-end guide demonstrating open mirroring to replicate on-premises data to OneLake Delta tables, positioning this as a bridge between traditional storage and Fabric’s analytics ecosystem
- No specific pricing mentioned for OneLake API access – costs likely follow standard Fabric capacity pricing model based on compute and storage consumption
Cloud Journey
1:03:47 8 platform engineering anti-patterns | InfoWorld
- Platform engineering initiatives are failing at an alarming rate because teams treat the visual portal as the entire platform rather than building solid backend APIs and orchestration first. The 2024 DORA Report found that dedicated platform engineering teams actually decreased throughput by 8% and change stability by 14%, showing that implementation mistakes have serious downstream consequences.
- The biggest mistake organizations make is copying approaches from large companies like Spotify without considering ROI for their scale. Mid-size companies invest the same effort as enterprises with thousands of developers but see minimal returns, making reference architectures often impractical for solving real infrastructure abstraction challenges.
- Successful platform adoption requires shared ownership where developers can contribute plugins and customizations rather than top-down mandates. Spotify achieves 100% employee adoption of their internal Backstage by allowing engineers to build their own plugins like Soundcheck, proving that developer autonomy drives platform usage.
- Organizations must survey specific user subsets because Java developers, QA testers, and SREs have completely different requirements from an internal developer platform. Tracking surface metrics like onboarded users misses the point when platforms should measurably improve time to market, reduce costs, and increase innovation rather than just showing DORA metrics.
- Simply rebranding operations teams as platform engineering without a cultural shift and product mindset creates more toil than it reduces. Platforms need to be treated as products requiring continuous improvement, user research, internal marketing, and incremental development, starting with basic CI/CD touchpoints rather than attempting to solve every problem on day one.
Closing
And that is the week in the cloud! Visit our website, the home of the Cloud Pod, where you can join our newsletter, Slack team, send feedback, or ask questions at theCloudPod.net or tweet at us with the hashtag #theCloudPod