AutoScaling in a Heat Stack

Sometimes we may need to have a stack that can respond when a group of servers are using a lot or little resources, such as memory usage. For example if a group of servers exceed a given memory usage threshold, we want that group of resources to scale up. This documentation will go through autoscaling, and how autoscaling could be implemented in a Heat stack.

To create an autoscaling stack we need:

  • AutoScaling Group: A group of servers defined so that the number of servers in the group and be increased or decreased.
  • Alarms: Alarms created using OpenStack Aodh to monitor the resource usage of the VMs in the autoscaling group. For example, we could create an alarm to monitor memory usage and alarm if the autoscaling group exceeds the alarm’s threshold.
  • Scaling Policies: Policies which are executed when an Aodh Alarm is triggered. When an alarm is triggered, the scaling policy attached to that alarm will instruct the autoscaling group to change in size, either increasing or decreasing the number of VMs.

Heat Resources

This section will cover the resources available in Heat that are required for creating an autoscaling stack.

AutoScaling involves resources from:

  • Heat: For creating an autoscaling group and defining the scaling policies
  • Aodh: For alarm creation
  • Gnocchi: For metrics that are used in threshold alarms

OS::Heat::AutoScalingGroup

This is an autoscaling group which can scale resources. This group can create the desired number of similar resources and we can define the minimum and maximum count for the given resource.

the_resource:
  type: OS::Heat::AutoScalingGroup
  properties:
    #required
    max_size: Integer # maximum number of resources in the group
    min_size: Integer # minimum number of resources in the group
    resource: {...} # resource definition for the resources in the group, written in HOT (Heat Orchestrated Template) format
    #optional
    desired_capacity: Integer # desired initial number of resources
    cooldown: Integer # cool down period in seconds
    rolling_updates: {"min_in_service": Integer, "max_batch_size": Integer, "pause_time": Number} # policy for rolling updates in the group, defaults to: {"min_in_service": 0, "max_batch_size": 1, "pause_time": 0}
      # min_in_service: minimum number of resources in service while rolling updates are executed
      # max_batch_size: maximum number of resources to replace at once
      # pause_time: number of seconds to wait between batches of updates

For example:

autoscaling-group:
  type: OS::Heat::AutoScalingGroup
  properties:
    min_size: 1
    max_size: 3
    resource:
      type: server.yaml #Refers to a Heat Template for creating a VM
      properties:
        flavor: {get_param: flavor}
        image: {get_param: image}
        key_name: {get_param: key_name}
        network: {get_param: network}
        metadata: {"metering.server_group": {get_param: "OS::stack_id"}}

OS::Heat::ScalingPolicy

the_resource:
  type: OS::Heat::ScalingPolicy
  properties:
    # required
    adjustment_type: String # Type of adjustment. Allowed values: “change_in_capacity”, “exact_capacity”, “percent_change_in_capacity”
    auto_scaling_group_id: String # AutoScaling Group ID to apply policy to
    scaling_adjustment: Number # Size of adjustment
    # Optional
    cooldown: Number # cooldown period, in seconds
    min_adjustment_step: Integer # minimum number of resources that are added or removed when the AutoScalingGroup scales up or down. Only used if specifying percent_change_in_capacity for adjustment_type property

For example:

scaleup_policy:
  type: OS::Heat::ScalingPolicy
  properties:
    adjustment_type: change_in_capacity
    auto_scaling_group_id: {get_resource: autoscaling-group}
    cooldown: 60
    scaling_adjustment: 1

scaledown_policy:
  type: OS::Heat::ScalingPolicy
  properties:
    adjustment_type: change_in_capacity
    auto_scaling_group_id: {get_resource: autoscaling-group}
    cooldown: 60
    scaling_adjustment: -1

OS::Aodh::GnocchiAggregationByResourcesAlarm

This resource creates an alarm as an aggregation of resources alarm. This alarm is a threshold alarm monitoring the aggregated metrics of the members of the autoscaling group defined above. Gnocchi provides the metrics which Aodh uses to determine whether an alarm should be triggered.

the_resource:
  type: OS::Aodh::GnocchiAggregationByResourcesAlarm
  properties:
    # required
    metric: String # metric name watched by the alarm
    query: String # query to filter the metrics
    resource_type: String # resource type
    threshold: Number # threshold to evaluate against

    # optional
    aggregation_method: String # method to compare to the threshold
    alarm_actions: [Value, Value, ...] # list of webhooks to invoke when state transitions to alarm
    alarm_queues: [String, String, ...] # list of Zaqar queues to post to when state transitions to alarm
    comparison_operator: String # operator used to compare specified statistic with threshold. Allowed values: “le”, “ge”, “eq”, “lt”, “gt”, “ne”
    description: String # alarm description
    enabled: Boolean # Defaults to true. Determines if alarm evaluation is enabled
    evaluation_periods: Integer # number of periods to evaluate over
    granularity: Integer # time range in seconds
    insufficient_data_actions: [Value, Value, ...] # list of webhooks to invoke when state transitions to insufficient data
    insufficient_data_queues: [String, String, ...] # list of Zaqar queues to post to when state transitions to alarm
    ok_actions: [Value, Value, ...] # list of webhooks to invoke when state transitions to ok
    ok_queues: [String, String, ...] # list of Zaqar queues to post to when state transitions to ok
    repeat_actions: Boolean # Defaults to True. False to trigger actions when the threshold is reached AND the alarm has changed state
    severity: String # severity of alarm. Allowed values: “low”, “moderate”, “critical”
    time_constraints: [{"name": String, "start": String, "description": String, "duration": Integer, "timezone": String}, {"name": String, "start": String, "description": String, "duration": Integer, "timezone": String}, ...] # Describe time constraints for alarm, defaults to []
      # description: description for time constraints
      # duration: duration for time constraint
      # name: name for time constraint
      # start: start time for time constraint. A CRON expression property
      # timezone: Timezone for the time constraint.

For example, for our autoscaling stack we could define the alarms in the following way:

memory_alarm_high:
  type: OS::Aodh::GnocchiAggregationByResourcesAlarm
  properties:
    description: Scale up if memory > 1000 MB
    metric: memory.usage
    aggregation_method: mean
    granularity: 300
    evaluation_periods: 1
    threshold: 1000
    resource_type: instance
    comparison_operator: gt
    query:
      list_join:
        - ''
        - - {'=': {server_group: {get_param: "OS::stack_id"}}}
    alarm_actions:
      - get_attr: [scaleup_policy, signal_url]

memory_alarm_low:
  type: OS::Aodh::GnocchiAggregationByResourcesAlarm
  properties:
    description: Scale down if memory < 200MB
    metric: memory.usage
    aggregation_method: mean
    granularity: 300
    evaluation_periods: 1
    threshold: 200
    resource_type: instance
    comparison_operator: lt
    query:
      list_join:
        - ''
        - - {'=': {server_group : {get_param: "OS::stack_id"}}}
    alarm_actions:
      - get_attr: [scaledown_policy, signal_url]