Drupal on Amazon EC2 persistent storage problem
On a regular computer, the operating system and applications are loaded from a local hard disk into memory. Every application you launch runs in memory. When you want to keep data, you need to store it to the local hard disk. The advantage of memory is that it is fast, but it is not persistent. The local hard disk is slow, but it is persistent. Memory is also smaller than the hard disk. For example, the memory is 2GB and the hard disk is 160GB. An application such as a database runs in memory (for speed) but writes all data to the hard disk to not lose it when the computer crashes.
Now with Amazon EC2, this problems is complicated even further. An EC2 virtual machine has 1.75GB memory and a hard disk of 160GB. The virtual machine is launched from an Amazon Machine Image (AMI), and once the machine is running it can write to the hard disk. However, when the machine fails, the newly written information is not persisted. Restarting the machine will reload the original AMI, and any information you saved to the hard disk in a previous instance is lost. If you want to persist data, you need to store it to Amazon S3, which is slower than the hard disk. So storing data on Amazon S3 is similar to the regular computer, but needs an extra level of storage before you know the data is persisted.
This means regular applications that thing they are persisting data to hard disk, lose all their information when the instance crashes. And instances can crash!
I think that is the major problem when working with Amazon S3. The storage paradigm changes, and it takes time before applications get adapted.
The problem can be solved generally by performing regular backups of the data to Amazon S3, and automatically loading that backup when the EC2 instance starts. But a backup every hour means you can still lose up to an hour of information. It can also be solved by running multiple EC2 instances, and performing replication to another machine, and then backing up that second machine. For me, the ideal solution would be that applications become aware of this problem and solve it themselves, by storing their information to S3. some people have started this with a number of applications. For example Mark Atwood built a A MySQL Storage Engine for AWS S3MySQL. Other people are working on (or thinking about?) an Amazon S3 File-Store for Alfresco.
I haven’t tried any of these solutions yet, so I can’t tell you how good they work at this time. When I want to run Drupal on EC2, I will need to find some solution to this problem.