Using Event Grid and Function App to Change VM Size when Capacity or Zone Allocation Issues Occur with Azure Virtual Machines or AVD.

In today’s fast-paced business environment, ensuring efficiency and responsiveness is crucial for maintaining productivity and user satisfaction. Proactively managing virtual machines (VMs) is essential to prevent potential issues before they arise, ensuring that end users have the resources they need to remain productive and satisfied.

By leveraging Azure’s Event Grid and Function App, we can automate the process of resizing VMs in response to capacity issues with specific SKUs or Zone issues. This proactive approach can help reduce or possibly eliminate the need for users to report issues when it comes to virtual machines not starting because of a Zone or SKU capacity issue thus minimizing downtime and maximizing productivity.

Below is how you can use Event Grid and Function App to monitor for ZonalAllocationFailed or insufficient capacity and trigger automatic resizing and starting of the Azure Virtual Desktop.

First you will need to create a Function App and Function.

Create a Function and Select Azure Event Grid Trigger.

Specify a Function Name. I try to keep the Function Name and Event Grid Subscription Name the same so it’s easier to follow for everyone on your team.

After the Function is created, copy and paste the script into the Function. Check for formatting issues and make any adjustments to tags, SKUs, and whatif’s as needed. This script is wont make any changes until the -whatifs are removed.

param(
    [Parameter(Mandatory=$true)]
    [object]$eventGridEvent,

    [Parameter(Mandatory=$true)]
    [object]$TriggerMetadata
)

# Import necessary modules
Import-Module Az.Accounts -ErrorAction Stop
Import-Module Az.Compute -ErrorAction Stop
Import-Module Az.Resources -ErrorAction Stop

# Log the incoming event
Write-Host "Event received:"
$eventGridEvent | ConvertTo-Json -Depth 3 | Write-Host

function Get-ResourceGroupName {
    param (
        [string]$resourceId
    )
    return ($resourceId -split "/")[-5]
}

function Get-VMName {
    param (
        [string]$resourceId
    )
    return ($resourceId -split "/")[-1]
}

function Handle-VMStartFailure {
    param (
        [string]$resourceId
    )

    Write-Host ("Handling VM start failure for resource: " + $resourceId)

    # Retrieve existing tags
    $resource = Get-AzResource -ResourceId $resourceId -ErrorAction Stop
    $existingTags = $resource.Tags
    if (-not $existingTags) {
        $existingTags = @{}
    }

    Write-Host ("Existing tags: " + ($existingTags | ConvertTo-Json))
    #Using Tags so that we can run a script to change VM Size back to standard after cap issues.
    $existingTags['VMSizeAllocationFailure'] = 'True'
    
    # Update tags
    Try {
        Write-Host ("Updating tags for resource: " + $resourceId)
        Update-AzTag -ResourceId $resourceId -Tag $existingTags -Operation Merge -WhatIf - 
        ErrorAction Stop
        Write-Host ("WhatIf: Successfully updated tags for resource: " + $resourceId)
    }
    Catch {
        $ErrorMessage = $_.Exception.Message
        Write-Host ('Error assigning failure tag: ' + $ErrorMessage)
    }

    $resourceGroupName = Get-ResourceGroupName -resourceId $resourceId
    $vmName = Get-VMName -resourceId $resourceId

    $vm = Get-AzVM -ResourceGroupName $resourceGroupName -Name $vmName -ErrorAction Stop
    $currentSku = $vm.HardwareProfile.VmSize
    $skuList = @()

    if ($currentSku -eq "Standard_D16ads_v5" -or $currentSku -eq "Standard_D16as_v4") {
        $skuList = @("Standard_D16ds_v5")
    } else {
        $skuList = @("Standard_E4s_v5", "Standard_D4s_v5", "Standard_D4s_v4") 
        #Can Add more SKUs here if needed.   
        #Standard_E4s_v5", "Standard_D4s_v5", "Standard_E4s_v4", "Standard_D4s_v4", 
        #"Standard_D4ads_v5"
    }

    Write-Host ("Current SKU: " + $currentSku)
    Write-Host ("Attempting to change SKU to one of: " + ($skuList -join ", "))

    foreach ($sku in $skuList) {
        # Change VM size and start the VM
        Try {
            Write-Host ("Deallocating VM: " + $vm.Name)
            Stop-AzVM -ResourceGroupName $resourceGroupName -Name $vmName -Force -WhatIf 
            -ErrorAction Stop
            Write-Host ("WhatIf: Deallocated VM: " + $vm.Name)
            
            Write-Host ("Changing VM size to: " + $sku)
            $vm.HardwareProfile.VmSize = $sku
            Update-AzVM -ResourceGroupName $resourceGroupName -VM $vm -WhatIf -ErrorAction Stop
            Write-Host ("WhatIf: Changed VM size to: " + $sku)
            
            Write-Host ("Starting VM: " + $vm.Name)
            Start-AzVM -ResourceGroupName $resourceGroupName -Name $vm.Name -WhatIf -ErrorAction 
            Stop
            Write-Host ("WhatIf: Successfully started VM with SKU: " + $sku)
            break
        }
        Catch {
            $ErrorMessage = $_.Exception.Message
            Write-Host ('Error changing VM size and starting the VM with SKU ' + $sku + ': ' +   
            $ErrorMessage)
            if ($sku -eq $skuList[-1]) {
                Write-Host ('Failed to start VM with all SKUs attempted.')
            }
        }
    }
}

# Check if VM start action failed
if ($eventGridEvent.data.status -eq "Failed" -and $eventGridEvent.data.authorization.action -eq "Microsoft.Compute/virtualMachines/start/action") {
    $resourceId = $eventGridEvent.data.resourceUri
    Write-Host ("Event indicates a VM start failure for resource: " + $resourceId)

    # Check if properties exist
    if ($null -ne $eventGridEvent.data.properties) {
        # Check if statusMessage exists
        if ($null -ne $eventGridEvent.data.properties.statusMessage) {
            # Parse the status message
            $statusMessage = $eventGridEvent.data.properties.statusMessage | ConvertFrom-Json

            # Check for specific failure reasons
            $errorDetails = $statusMessage.error.details
            $allocationFailed = $false
            foreach ($detail in $errorDetails) {
                if ($detail.code -eq "ZonalAllocationFailed" -or $detail.message -like "*insufficient capacity*") {
                    $allocationFailed = $true
                    break
                }
            }

            if ($allocationFailed) {
                Write-Host ("Failure due to ZonalAllocationFailed or insufficient capacity.")
                Handle-VMStartFailure -resourceId $resourceId
            } else {
                Write-Host ("Failure due to reasons other than ZonalAllocationFailed or insufficient capacity.")
            }
        } else {
            Write-Host ("statusMessage is null or not present in the event data.")
        }
    } else {
        Write-Host ("properties field is null or not present in the event data.")
        # Handle VM start failure without relying on properties field
        Handle-VMStartFailure -resourceId $resourceId
    }
} else {
    Write-Host ("Event does not match VM start failure criteria.")
}

Click Save when Finished.

Create Event Grid Event Subscription

Event Subscriptions listen for events emitted by the topic resource and send them to the endpoint resource.

Go to your Subscription and click on Events.

Click on Event Subscription

Specify the Name.

Specify Resource and System Topic Name. Information on System Topics can be found here. System topics in Azure Event Grid – Azure Event Grid | Microsoft Learn

Choose Resource Action Failure only.

Click Endpoint Type drop down and select Function App.

Select the Resource Group, Function App and Function.

Set the Key: data.authorization.action

Operator: String Contains

Value: Microsoft.Compute/virtualMachines/start/action

Click Create when finished.

If you do have Zone or capacity issues occurring then you can go to your function and view the logs and output by changing to Filesystem logs.

Share or Save this:
Share