Understanding State

I want to discuss the basics of the Terraform state. It's not an overly complex subject but it's complex enough that I feel it needs a bit of explaining. I also believe that understanding how the state comes to be, how to store it and how to disect it makes a person a better Terraform user.

What is State?

It's essentially a database. It contains everything Terraform builds for you and it's used for calculating any changes to the code that you provide. After we've first instantiated a state file it lives on for as long as your infrastructure lives on. Terraform uses it for virtually every command you execute.

It's pretty important.

Most database systems store information in a compact binary form. This makes it faster and easier to load into memory and then process. However the state file is a plain text JSON document. This is an interesting point because it means you can read the state file and analyse its contents, either manually or with some tool other than Terraform.

That being said you should absolutely avoid editing the file manually. Making a copy and then reading it is fine. No harm can be done there. But writing to it outside of Terraform is a bad idea indeed.

Terraform internally versions the state file using two properties: the serial and the lineage. The serial is a counter - it increments when you make changes to your infrastructure. The lineage property is a unique value that can be used to determine if you're editing the correct state file. This is to prevent a cross over between code and states.

If you're using a remote backend, as we'll be doing below, then the information about that backend is actually the only information stored in the state file. My statements above about being able to read the state file goes out the window when you store it remotely or at least an extra step is required to be able to do so.

Diagram of Flow

This diagram demonstrates a simplified flow of how Terraform code goes from code to state. I'm using an S3 Bucket here because storing state locally isn't generally considered a good idea, nor is storing it alongside the code in a Git repository. So, S3 it is.

Terraform State Basics Flow

  1. We develop some Terraform code for AWS and GCP
  2. We run Terraform against that code
  3. Terraform creates the resources for us
  4. A state file is generated and eventually pushed to S3
  5. And then this state file is used during subsequent Terraform executions

State (File) Life Cycle

When we first start writing some Terraform code your state's lifecycle hasn't yet begun. The moment you apply a terraform apply, however, it springs into life and it's pretty bare bones:

{
  "version": 4,
  "terraform_version": "0.12.20",
  "serial": 1,
  "lineage": "2a7b84c9-75c0-054a-29df-217910f52a4c",
  "outputs": {},
  "resources": []
}

No resources, no outputs. But if we were to define some resources, outputs, etc, then we'd end up with a much more "complete" JSON file.

Once a state file has been created Terraform utilises it for all subsequent executions. This is what enables Terraform to do its job. It's how Terraform builds an Directed Acyclic Graph and decides how your changes should be implemented and what their effect is on the rest of the entries in the state.

What's interesting about the state file is it never ceases to exist until you delete the file its self. Even executing terraform destroy doesn't delete the file. It does result in all the resources being deleted, but the file lives on. It simply contains information about its own life cycle.

Metadata

Above we saw an example state file. There are two key pieces of information inside the file:

  • The serial
  • The lineage

These are interesting peices of metadata in that they designed to live on for the life of the state and in fact are designed to protect the state file from bad overrides using the terraformstate ... sub-commands.

The serial is a counter that keeps track of how many times the state has been computed. Not just changed, but actually interacted with. In the above case it's 1 but when I executed terraform apply without resources in my main.tf file nothing was created, yet the serial still incremented. This tells us the serial field is used to track the amount of interactions with the state file. At least that's what you'd think. It's actually for more than that.

After adding some resources to the state file I then executed terraform apply the lineage property never changed. The resources were created, but the lineage, a hash, never updated. That's because it's not designed to be updated. In fact, it's there for a specific reason: to prevent you from using terraform state push, overwriting the remote state with the local state.

The lineage property is used to determine if the state that's local to you, the state you're trying to push to the remote backend, is an ancestor of the remote state. You could say it's literally a "DNA test" between the two states to determine they're related. If lineage property isn't a match then you're pushing bad state to the remote backend and Terraform will reject the operation (to protect you from your self.)

And the serial property is designed to do a similar thing. It's not just a counter, it seems, but a means of making sure the state you're pushing (assuming its passed the lineage test) isn't an older version of the remote version. If the serial property is < the remote serial property then the operation will fail.

Slicing and Dicing State

Now that we've not only got a good idea of what the state is, but also what it looks like and how it contains protective elements, we can take a look at some of the useful commands an engineer can use for working with a state directly.

Warning

Do note that the use of mv, rm and push (see above) can be dangerous. I recommend you backup your state (file) before you start using these advanced commands.

To begin working the state using the built-in Terraform command, we simply use the state sub-command. This sub-command offers us a few options but we'll focus on four of them:

  • list - will list out everything in the state file by its address (root.aws_instance.name)
  • show - can be used with a resource address to print the resource to the console
  • mv - can be used to move a resource{} to another state, module or just rename it
  • rm - can be used to delete items from within the state

Example Code & State

I'm going to be working a state file created from the following HCL:

provider "aws" {
  region = "ap-southeast-2"
}

resource "aws_vpc" "main" {
  cidr_block = "10.99.0.0/16"
}

resource "aws_subnet" "a" {
  vpc_id            = aws_vpc.main.id
  cidr_block        = "10.99.1.0/24"
  availability_zone = "ap-southeast-2a"
}

resource "aws_subnet" "b" {
  vpc_id            = aws_vpc.main.id
  cidr_block        = "10.99.2.0/24"
  availability_zone = "ap-southeast-2b"
}

data "aws_ami" "ubuntu" {
  most_recent = true

  filter {
    name   = "name"
    values = ["ubuntu/images/hvm-ssd/ubuntu-trusty-14.04-amd64-server-*"]
  }

  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }

  owners = ["099720109477"] # Canonical
}

resource "aws_instance" "web" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = "t2.micro"
  subnet_id     = aws_subnet.a.id
}

resource "aws_instance" "database" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = "t2.micro"
  subnet_id     = aws_subnet.b.id
}

Don't think about this code too much. It's just designed to give us a complex (enough) state file we can use our commands on.

Listing the State

All we get from this is a list of the resources that are present in the Terraform state. This can be a quick way of finding the address to a resource which can then be used with show to print out its finer details.

Here is an extremely minimal state file as an example:

$ terraform state list
data.aws_ami.ubuntu
aws_instance.database
aws_instance.web
aws_subnet.a
aws_subnet.b
aws_vpc.main

More complex states will have more entries, of course, but here we can see the internal (to Terraform) addresses of our resources. We can see more details about each of these.

Show Me The Details

When we know the address of a resource (such as aws_vpc.main) we can then print out the finer details about this resource:

terraform state show aws_instance.web

Note

Did you notice that we didn't include a third octet in the address, like id, arn, or name? They're attributes and aren't required for Terraform to identify the resource you're addressing.

Note

We're not using any modules here other than the default root module. That means everything is at the top level and addressesed directly. Should you have any modules in your code you'll see module_name.resource.name instead of just resource.name.

We'll get back HCL output (not JSON). This will contain all the properties for the resource present in the state. With our example state file at hand, we get back these details:

resource "aws_instance" "web" {
    ami                          = "ami-0caee806a0c3782c4"
    arn                          = "arn:aws:ec2:ap-southeast-2:000000000000:instance/i-001f4ac3ee0216ad5"
    associate_public_ip_address  = false
    availability_zone            = "ap-southeast-2a"
    cpu_core_count               = 1
    cpu_threads_per_core         = 1
    disable_api_termination      = false
    ebs_optimized                = false
    get_password_data            = false
    hibernation                  = false
    id                           = "i-001f4ac3ee0216ad5"
    instance_state               = "running"
    instance_type                = "t2.micro"
    ipv6_address_count           = 0
    ipv6_addresses               = []
    monitoring                   = false
    primary_network_interface_id = "eni-0054d40a299d794bd"
    private_dns                  = "ip-10-99-1-97.ap-southeast-2.compute.internal"
    private_ip                   = "10.99.1.97"
    security_groups              = []
    source_dest_check            = true
    subnet_id                    = "subnet-06d86824bf049cccc"
    tags                         = {}
    tenancy                      = "default"
    volume_tags                  = {}
    vpc_security_group_ids       = [
        "sg-0ade53f1e8bf8848e",
    ]

    credit_specification {
        cpu_credits = "standard"
    }

    root_block_device {
        delete_on_termination = true
        encrypted             = false
        iops                  = 100
        volume_id             = "vol-04a40165065506c98"
        volume_size           = 8
        volume_type           = "gp2"
    }
}

A few of these details will vary for obvious reasons, such as the IDs being unique, ARNs being unique, and more.

Note

You'll see properties and values you didn't specify in your code. These are defaults Terraform is providing for you to fill in the "gaps".

Such a detailed view can be useful for collecting data such as IP address, ARNs and checking things are configured as you expect them to be. It's also useful for checking those default properties and values to see if there's anything you'd rather customise.

Moving a Resource

Sometimes a resource might be in the wrong "location". Or perhaps you've named it incorrectly internally and would like something to be referred to by a different name. This can be solved with the mv sub-command:

$ terraform state mv aws_subnet.a aws_subnet.az-a
Move "aws_subnet.a" to "aws_subnet.az-a"
Successfully moved 1 object(s).

$ terraform state mv aws_subnet.b aws_subnet.az-b
Move "aws_subnet.b" to "aws_subnet.az-b"
Successfully moved 1 object(s).

Here I'm running two mv operations as I want my internal subnet names to be something a bit more meaningful and easier to work with. Just a didn't mean anything, but az-a tells me the subnet represents our AWS AZ A part of the architecture.

After performing the move, what does the state list now show?

$ terraform state list
data.aws_ami.ubuntu
aws_instance.database
aws_instance.web
aws_subnet.az-a
aws_subnet.az-b
aws_vpc.main

From the perspective of the state the resources have been renamed as we can see above. And what about the code?

$ grep 'aws_subnet' main.tf
resource "aws_subnet" "a" {
resource "aws_subnet" "b" {
  subnet_id     = aws_subnet.a.id
  subnet_id     = aws_subnet.b.id

So it seems our state file was updated, but the HCL (our code) was not. What does a terraform plan look like, then?

  # ...

  # aws_subnet.az-b will be destroyed
  - resource "aws_subnet" "az-b" {
      - arn                             = "arn:aws:ec2:ap-southeast-2:040925562967:subnet/subnet-0a10fcf479842e263" -> null
      - assign_ipv6_address_on_creation = false -> null
      - availability_zone               = "ap-southeast-2b" -> null
      - availability_zone_id            = "apse2-az1" -> null
      - cidr_block                      = "10.99.2.0/24" -> null
      - id                              = "subnet-0a10fcf479842e263" -> null
      - map_public_ip_on_launch         = false -> null
      - owner_id                        = "040925562967" -> null
      - tags                            = {} -> null
      - vpc_id                          = "vpc-0774b876c33ae71a9" -> null
    }

  # aws_subnet.b will be created
  + resource "aws_subnet" "b" {
      + arn                             = (known after apply)
      + assign_ipv6_address_on_creation = false
      + availability_zone               = "ap-southeast-2b"
      + availability_zone_id            = (known after apply)
      + cidr_block                      = "10.99.2.0/24"
      + id                              = (known after apply)
      + ipv6_cidr_block                 = (known after apply)
      + ipv6_cidr_block_association_id  = (known after apply)
      + map_public_ip_on_launch         = false
      + owner_id                        = (known after apply)
      + vpc_id                          = "vpc-0774b876c33ae71a9"
    }

  # ...

Plan: 4 to add, 0 to change, 4 to destroy.

Note

I've omitted output to keep things short and sweet here. That's what # ... are: more contracting the output.

Or put another way: it wants to delete our subnets az-a and az-b because they "don't exist", which is technically true. This will result in our instances being deleted too because the subnets will be recreated so the ARNs/IDs will change.

To prevent this all we have to do is update our HCL after we use the mv command. If I rename the subnets to match what I did in the mv command the plan won't be affected at all:

$  terraform plan
# ...

aws_vpc.main: Refreshing state... [id=vpc-0774b876c33ae71a9]
data.aws_ami.ubuntu: Refreshing state...
aws_subnet.az-a: Refreshing state... [id=subnet-06d86824bf049cccc]
aws_subnet.az-b: Refreshing state... [id=subnet-0a10fcf479842e263]
aws_instance.database: Refreshing state... [id=i-0e749fb20039b2efe]
aws_instance.web: Refreshing state... [id=i-001f4ac3ee0216ad5]

------------------------------------------------------------------------

No changes. Infrastructure is up-to-date.
# ...

Now our code matches our state, and the state match AWS.

Removing Resources from State

Warning

Use this option with extreme care. If you delete one or more items that you later come to realise do actually exist or you need to clean them up you'll regret this decision.

It's very rare you need to do this, but you can actually remove resources from your state witha simple command:

terraform state rm <address>

The only time I've had to do this was when I've known something that was created by Terraform was deleted outside of its state. Another case was having an entire network of managed objects being updated outside of Terraform and the drift between implementation and code was so great the resources had to be managed manually.

Info

Once you move to an Infrastructure As Code methodology you have to adhere to it at all times. It's an all-or-nothing switch over. You can't have both.

However, removing resources from a state is a simple affair:

$ terraform state rm aws_instance.web
Removed aws_instance.web
Successfully removed 1 resource instance(s).

As you might suspect at this point a terraform plan will reveal that the resource still exists in our code, because these commands don't update our HCL for us (rightly so), so now Terraform wants to create the resource:

Terraform will perform the following actions:

  # aws_instance.web will be created
  + resource "aws_instance" "web" {
      # ...
    }

Plan: 1 to add, 0 to change, 0 to destroy.

Sure. OK. Let's go ahead and apply that change:

$ terraform apply -auto-approve
aws_vpc.main: Refreshing state... [id=vpc-0774b876c33ae71a9]
data.aws_ami.ubuntu: Refreshing state...
aws_subnet.az-b: Refreshing state... [id=subnet-0a10fcf479842e263]
aws_subnet.az-a: Refreshing state... [id=subnet-06d86824bf049cccc]
aws_instance.database: Refreshing state... [id=i-0e749fb20039b2efe]
aws_instance.web: Creating...
aws_instance.web: Still creating... [10s elapsed]
aws_instance.web: Still creating... [20s elapsed]
aws_instance.web: Still creating... [30s elapsed]
aws_instance.web: Creation complete after 33s [id=i-0d9728ba499e06a26]

Apply complete! Resources: 1 added, 0 changed, 0 destroyed.

Great! So we have two EC2 instances:

$ aws ec2 describe-instances | jq '.Reservations | length'
3

Actually we have three instances now. What happened?

  1. We deleted the instance from our state;
  2. We told Terraform to apply the HCL we had on disk;
  3. Terraform created a new instance for us;
  4. Now we have an orphaned instances;

We might have deleted the instance from our state but we didn't delete it from EC2. Let's do that now, but let's also use the knowledge we've learned so far to make sure we're deleting the correct instance:

$ terraform state list
data.aws_ami.ubuntu
aws_instance.database
aws_instance.web
aws_subnet.az-a
aws_subnet.az-b
aws_vpc.main

$ terraform state show aws_instance.web
# aws_instance.web:
resource "aws_instance" "web" {
    # ...
    id                           = "i-0d9728ba499e06a26"
    # ...
}

$ terraform state show aws_instance.database
# aws_instance.database:
resource "aws_instance" "database" {
    # ...
    id                           = "i-0e749fb20039b2efe"
    # ...
}

We want to delete any EC2 instances that aren't i-0d9728ba499e06a26 or i-0e749fb20039b2efe. I'll do this in the background, but after doing so we'll have rid ourselves of an orphaned EC2 instance costing us money. A new terraform plan reveals that there's now nothing to be implemented and everything is aligned correctly.

If we were to remove a resource from our state by accident or thinking it didn't exist in AWS when in fact it does, there's another way we could re-introduce it back into the state file...

Importing Resources into State

If you find your self in a situation in which you need to write Infrastructure As Code against an existing environment then Terraform has you (mostly) covered: it allows you to import existing infrastructure/resources to match up against the resources you define in your HCL. Let's review an example.

We have an example HCL file and an active state as it stands now. But I want a third subnet in AZ C. It seems that my colleague has already created it in AWS via the console, unaware that I've started the IAC process. So what do I do? First I write the HCL into my main.tf file:

resource "aws_subnet" "az-c" {
  vpc_id            = aws_vpc.main.id
  cidr_block        = "10.99.3.0/24"
  availability_zone = "ap-southeast-2c"
}

Then I find the AWS ID of the resource via the AWS Console, which is subnet-0e88257dfabf7884f for me. Now I can import it into the state so that Terraform doesn't attempt to create the resource but instead embraces the existing resource and begins the life cycle of managing it:

$ terraform import aws_subnet.az-c subnet-0e88257dfabf7884f
aws_subnet.az-c: Importing from ID "subnet-0e88257dfabf7884f"...
aws_subnet.az-c: Import prepared!
  Prepared aws_subnet for import
aws_subnet.az-c: Refreshing state... [id=subnet-0e88257dfabf7884f]

Import successful!

The resources that were imported are shown above. These resources are now in
your Terraform state and will henceforth be managed by Terraform.

Now a terraform plan will tell us that there's nothing to build... however, looking at our new subnet in our state we can see those default properties and values that Terraform manages for us:

$ terraform state show aws_subnet.az-c
# aws_subnet.az-c:
resource "aws_subnet" "az-c" {
    arn                             = "arn:aws:ec2:ap-southeast-2:040925562967:subnet/subnet-0e88257dfabf7884f"
    assign_ipv6_address_on_creation = false
    availability_zone               = "ap-southeast-2c"
    availability_zone_id            = "apse2-az2"
    cidr_block                      = "10.99.3.0/24"
    id                              = "subnet-0e88257dfabf7884f"
    map_public_ip_on_launch         = false
    owner_id                        = "040925562967"
    tags                            = {}
    vpc_id                          = "vpc-0774b876c33ae71a9"

    timeouts {}
}

What happens if I import a subnet that was created with values other than the defaults? In short if our HCL code doesn't match these non-default values then a plan will actually show changes. To resolve this you have to carefully review the existing resource in AWS to ensure the HCL you're writing matches its configuration to avoid a code-to-state mismatch.

Tainting State

With a state full of resources it's possible you'll want Terraform to recreate one of them from time to time. One of way of achieving this would be to comment out the code, do an apply, then uncomment the comment and do a second apply.

A better option would be to use the taint command followed by a single apply:

$ terraform taint aws_instance.web
Resource instance aws_instance.web has been marked as tainted.

$ terraform apply
# ...

Terraform will perform the following actions:

  # aws_instance.web is tainted, so must be replaced
-/+ resource "aws_instance" "web" {
    # ...
    }

Plan: 1 to add, 0 to change, 1 to destroy.

#...

aws_instance.web: Destroying... [id=i-0d9728ba499e06a26]
aws_instance.web: Still destroying... [id=i-0d9728ba499e06a26, 10s elapsed]
aws_instance.web: Still destroying... [id=i-0d9728ba499e06a26, 20s elapsed]
aws_instance.web: Destruction complete after 30s
aws_instance.web: Creating...
aws_instance.web: Still creating... [10s elapsed]
aws_instance.web: Still creating... [20s elapsed]
aws_instance.web: Creation complete after 23s [id=i-0e8d871315ce3913e]

Apply complete! Resources: 1 added, 0 changed, 1 destroyed.

We can see tainting the resource inside of the state resulted in a new one being produced, but not before destroying the first. This is a quick and easy way to refresh a resource if you've edited it outside of Terraform, something else has edited it outside of Terraform, or you want to refresh it to get the side effects, such as getting an ASG to refresh all of its instances with a new Launch Template.

State outputs

Another cool feature of the state is being able to define specific values that you're interested in dealing with after the state has been processed and finalised. These are called outputs. Let's define a few now.

We'll want the private IP addresses of our EC2 instances as well as the VPC ID for use later on:

output "web_ip" {
  value = aws_instance.web.private_ip
}

output "database_ip" {
  value = aws_instance.database.private_ip
}

output "vpc_id" {
  value = aws_vpc.main.id
}

The syntax is very simple. We're simply defining the name of the output, such as web_ip and then stating how the value is derived. In this case it's the private IP address assigned to the instance by AWS and our subnet. Before the outputs are available in a state we have to first apply them and then we can output them:

$ terraform apply
# ...

Apply complete! Resources: 0 added, 0 changed, 0 destroyed.

Outputs:

database_ip = 10.99.2.105
vpc_id = vpc-0774b876c33ae71a9
web_ip = 10.99.1.105

$ terraform output
database_ip = 10.99.2.105
vpc_id = vpc-0774b876c33ae71a9
web_ip = 10.99.1.105

We actually got the outputs twice! The first was one when applied the changes to our code and the values were computed, and then second was when I explicitly asked for them.

One cool feature about the output command is it supports the -json option:

$ terraform output -json
{
  "database_ip": {
    "sensitive": false,
    "type": "string",
    "value": "10.99.2.105"
  },
  "vpc_id": {
    "sensitive": false,
    "type": "string",
    "value": "vpc-0774b876c33ae71a9"
  },
  "web_ip": {
    "sensitive": false,
    "type": "string",
    "value": "10.99.1.105"
  }
}

I wonder what cool things one could do with that ;-)

What questions do you have?

I've got a Discord server for taking questions in near real-time. I'm available most days and will generally respond in an our or so (Australian time.)

Come join us.


Last update: March 19, 2020