Content Migration

Imagine making the greatest ice cream sundae in the world. First, you start out with the best ingredients – sweet cream, cane sugar, real vanilla – and then you spend hours mixing them together . You add real whipping cream, homemade hot fudge, and the sweetest, most perfectly ripe banana and cherry the world has ever seen.

Then, to complete the masterpiece, you finish it off with a massive squirt of…ketchup.

You were doing great right up until the very end.

This is the story of many content migrations – the task of moving the existing content out of your old CMS and into your new CMS. Just as problems tend to occur when focusing too much on the software rather than the implementation, the exact same thing happens when those two are given too much precedence and the content migration is ignored. Organizations will find the perfect CMS, manage a fantastic implementation, and then completely botch the project at the very end with a disastrous content migration.

Remember, a CMS manages content. It’s only as good as the content you put into it. And the content your organization currently manages might be the result of years and years of creation, aggregation, and management. There could easily be millions of dollars invested in this content as a business asset. Ignoring this final phase of your project is disrespecting that content and devaluing all the effort that’s gone into it.

Migrations are often viewed as “extra” work. However, in some cases, the migration might be the majority of the project. A CMS implementation project might really be a migration project with a small development component attached to it.

More than one project has been canceled when the organization was confronted with the cost of moving all the existing content. In other cases, this cost drove the decision to simply upgrade and refresh an existing CMS, leaving the content where it was rather than moving it to a new CMS.

And as with content management itself, there is no Grand Unified Theory of Content Migration. Each one is idiosyncratic. Your current website uses a CMS that was modified through hundreds of implementation decisions. Your new website uses another CMS, which has likewise been modified by hundreds of different implementation decisions.

In this sense, both websites are unique little snowflakes, and there’s little way to generalize from migration to migration. The ability to perform a migration is less of a defined methodology, and more of a set of best practices and painful lessons that drive a unique plan for that specific migration.

Content migrations are simply an art and science all their own, and are chronically underestimated. Doing them effectively, on time, and within budget might be the hardest part of a project.

The Editorial Challenge

While the seemingly central challenge of a migration is to move bytes on disk from one system to another, the first challenge of a migration is actually editorial – what content is migrating, and how will it change?

For some projects (such as forklift implementations, discussed in The CMS Implementation), the answers are (1) all of the content, and (2) it won’t change at all. However, for many others, a migration presents a valuable opportunity to clean house, remove unwanted content, and change existing content to more effectively serve the organization’s goals.

A key point: the easiest content to migrate is content you don’t migrate. Now is the time to clean house . Reviewing your analytics for content that’s no longer accessed can remove an enormous amount of migration effort. I’ve seen intranet projects where 90% of the existing content was simply discarded. For any website that’s been in existence for multiple years, there’s little doubt that some of the content is simply no longer relevant.

Can these decisions be derived automatically? For instance, is it possible to say, “All news releases over three years old will be discarded”? Or will all these decisions require editorial input?

Many content migrations are preceded by content inventories, where all the existing content is identified and analyzed, and decisions are made regarding its future viability. Inventorying content is an art in itself, and far beyond the scope of this book , but the result of the inventory needs to be recorded somewhere, ideally in a form that can be used to make programmatic decisions about content.

Many inventories are accompanied by an unwieldy spreadsheet that is of little use to the developer trying to move content. A better idea is to record the intended disposition of content directly with the content itself, by adding Migrate to New Website or Requires Review checkboxes to the CMS, effectively making the existing CMS the record-keeping location for the content inventory. The developers performing the export of content from the existing CMS can then safely ignore that content in their code.

The first milestone in a content migration will always be a definitive decision and recording of all the content that must be moved to the new system. It’s hard to plan any further steps in a migration without knowing at least this.

Automated or Manual?

The most low-tech method of migrating content will always consist of a person copying content from one browser window and pasting it into another. While admittedly tedious, it does have advantages:

The drawback, of course, is that manual migrations are labor intensive and tedious. However, they’re not always the wrong answer. For migrations of a small amount of content that will change significantly on the way over, a manual migration might be exactly the right answer.

The argument against manual migration often comes down to volume or cost. Decisions about manual content migrations need to take into account the cost and availability of personnel. Many manual migrations have been performed by interns or college work study students. It might not be glamorous work, but it’s often effective.

Crossing a certain threshold of content, however, will make automation the most cost-effective choice. Content migrations that don’t require significant editorial decisions during migration can usually be automated far more efficiently.

That said, know that automation has limits, and it’s often easier to simply reconstruct selected content in the new CMS manually. Home pages, for example, tend to be very artisanal, with intricate content elements ordered and placed very carefully. Automating the extraction, importation, and placement of elements might be more trouble than it’s worth, especially when only a handful of pages need special handling. In these cases, be prepared for a hybrid approach, where certain content is simply rebuilt in place rather than automatically migrated.

The Migration Process

In a perfect world, you’d simply be able to open an administrative console in your new CMS and press a button that says, “Import content from [insert your existing CMS here].” This situation actually exists in some form for highly visible and competitive open source platforms like Drupal and WordPress, but it isn’t available for others.

Migrating content between two systems is usually a custom endeavor. There’s simply no standardization between systems, and even less standardization between methodologies of architecting content models. Even content coming out of and going into different installations of the same CMS might require significant changes, depending on how the content in each installation was modeled and aggregated. It would be rare for the content model in one installation to simply map directly to the content model in another installation.

Successfully migrating content is a loosely structured process, progressing through the following stages:

  1. Extraction. Content is extracted from the current environment.
  2. Transformation. Content is altered, to simply clean it up or to change it to work properly in the new environment.
  3. Reassembly. Content is aggregated to correctly fit the new environment.
  4. Import. Content is imported to the new environment.
  5. Resolution. Links between content objects are identified and resolved.
  6. QA. Imported content is checked for accuracy.

We’ll discuss each step in the process in greater depth in the following sections.

Extraction

The content inside your current CMS will need to be accessed and transferred to a neutral format from which it can be transformed and imported. Content needs to be extracted at two levels: (1) individual content objects, which are broken into (2) individual content attributes.

So, you need to extract all of your articles, but also have those articles broken down by attribute.

So long as those two criteria are met, the actual target format doesn’t matter. XML is common, as is JSON. Even inserting the content into a simple database might work fine. It simply needs to be in a format that is free from presentation data (such as extra HTML inserted from a rendering template) and is easily manipulated and accessible. I’ve even seen extracted content simply stored in the new CMS, to be moved and refined later.

In a perfect world, your existing CMS has an export function that can give you all your content in a presentation-free format. Unfortunately, built-in export is often not supported, or it results in a format that isn’t workable for future steps in the migration process. Trying to work with a predefined export format over time might reveal that it would have been simpler to write your own export process in the first place.

Without a usable export function, there are two other ways to extract content:

While going directly to the repository might seem the simpler of the two methods, it depends greatly on the capabilities of the system’s API. The system might have a poor API, or be in a situation where the API is not available (a hosted system, for example, especially one with a vendor who doesn’t know their customer is planning to leave them).

Even if this is possible, there’s risk because the repository stores its content optimized for that particular system. It would be uncommon for the CMS to store content in a way designed specifically for export From repository to screen, content might be transformed. How content sits inside the repository might not be how it’s output to the end user. This might be further changed by the templating code, making the actual HTML that is output substantially different from the HTML in the repository.

When screen scraping, you’re guaranteed to get the content in the correct output form (given that it is, in fact, being output at that exact moment), but you’re limited to the content that is actually output. There might be many unrendered, administrative content properties such as expiration dates, author names, permissions, and metadata that are not output to the end user .

Screen scraping is also limited by the quality of the current HTML. If the current site uses a CMS, then it’s probably templated, so you can expect at least a minimum amount of consistency. It’s even better when the templates can be modified to make this process easier – it can be very helpful, for instance, to temporarily put some content in clearly defined HTML structures, then simply hide those from the public via CSS during the extraction process. This content will still be available to the screen scraping process, but the page will not appear to have changed to the public.

Sites that are currently static and not templated can be extremely problematic. When the HTML has been hand-coded, there’s usually much less consistency, and trying to extract data might be impossible. (Mercifully, sites that have been hand-coded are usually so small that it’s easier to just migrate them manually.)

No two extraction scenarios are the same. In any migration, a multitude of factors will need to be analyzed to determine the best method to extract content in a neutral format.

Transformation

When content has been extracted, it’s rarely in a form appropriate for your new CMS. There’s a good chance it came out with extra HTML tags or structure that is not appropriate for your new system and implementation standards.

For example, content that was created many years ago might be full of obsolete HTML tags, such as FONT and even BLINK . More commonly, styling information that was valid in your old implementation will have simply changed. The new implementation might have new CSS classes, new methods of specifying content headers, new methods of aligning images, etc.

HTML content will need to be changed to reflect these new standards. You will usually extract content that contains large blocks of HTML, and you can’t treat this HTML as an impenetrable unit. You will often need to “reach into” this HTML and change it in some way.

Common transformations include:

The end result should be HTML that can be imported into the new CMS and be compatible with new styles, coding standards, and rich text editors.

While rich text requires the lion’s share of transformation, other data might need to be modified as well:

The number of potential transformation is limitless. Once the cleanest possible data has been extracted from the old CMS, the developer of the new implementation needs to evaluate it for potential problems and identify all the ways in which it must change before import.

Reassembly

When discussing content modeling in Content Modeling, we differentiated between discrete modeling and relational modeling. The former was describing the information about content that is limited to the content object itself. The latter is about how that content fits into (“relates”) to other content.

After you’ve extracted hundreds or thousands of content objects from your existing CMS, these objects will need to be assembled and organized to correctly reflect their relationships in the new system. It’s not only the content that has to be migrated, but the relationships between content as well.

Content trees, in particular, need to be transferred, which means content needs to be extracted in such a way that parent/child relationships remain intact or can be reconstructed. In some cases, this might mean exporting the parent ID with each content object. If you’re screen scraping, this might mean outputting the parent ID in a META tag, or even attempting to reverse engineer the hierarchy from the URL paths (assuming they correctly reflect the tree structure) .

In some cases, there is simply no way to reconstruct the structure of content. This might be due to an inherent structural parameter (thousands of blog posts ordered by nothing but date, for example), or because of poor organization and architecture in a legacy site.

Sites that have grown organically over time often reflect poor and idiosyncratic navigation, where menu options were added on an ad hoc basis to create a desired navigation pattern without any thought to an overarching content geography. These sites can be notoriously hard to migrate since it’s hard to impose structure on something that was poorly structured at best, and wildly unstructured at worst.

In these cases, content might have to be imported without relational structure and then structured in the new system. Groups of content can be imported to a “holding area” on the new site, then organized using the tools of the new system.

Import

Up until this point, we’ve only been getting content out of the old system. Once content has been extracted, transformed, and reassembled into a workable structure, the content actually needs to be brought into the new CMS. This is usually a task involving custom programming.

The only exception would be when your new system has an import function, and it has a known, documented format where you can organize your exported content. This is rare.

In most systems, a developer will write a custom job to get new content into the system. This can either be in the form of a standalone program that uses a web service or similar API to “push” content, or as code that runs inside the new system that “pulls” content.

In many cases, the developer will not just have to import the content, but will have to create other data structures to support secondary geographies, such as tags, categories, or menus.

For example, if your content objects are assigned to categories in the old system, then these categories will need to be created in the new system in advance of a migration (perhaps through a separate “pre-import” script), or created in real time as content is imported. Either way, incoming content will have to be checked for category assignments, which will need to be created at that time.

Also, given the iterative nature of content migrations (discussed more later in this chapter), an import job cannot assume the content hasn’t already been imported once before. Any particular execution of an import job might be a rerun to update or refine imported content. This being the case, any import job needs to determine if the content object being imported already exists. If so, the existing object should be updated in place.

There might be a temptation to simply delete the imported object and recreate it, but this becomes complicated when dealing with relational content. Once imported object X has a “resolved” relationship (see the next section) to imported object Y, a deletion and recreation will break that relationship. As such, once created, imported objects should be updated.

Resolution

Content objects have links between them. They might exist in a geography that was recreated during the reassembly phase discussed earlier, but they might also have explicit references – the Author property of an Article object, for example, might link to another content object. Additionally, there might be numerous HTML links inside rich text.

These links will likely break during extraction and import. If an HTML link deep inside the rich text of content object X links to content object Y, you need to ensure that link is still valid once X and Y have moved into their new system. When migrating content, the URL structure of content often changes. These internal links need to be found and corrected to represent the new URL structure.

To do this, you must always store an old identifier with the new content object. The imported content object must know where it came from, which means it needs to know the ID or URL of the corresponding content object in the old CMS. It’s quite common to create temporary properties on content types in the new CMS to hold these values during development and migration, then delete these fields and their values after a successful launch, when they’re no longer needed.

The ability to discover links between content objects depends highly on the API of the existing system. When processing an Article, can you simply export the ID of the Author? Or does your existing CMS store that as the public URL to the author? Or does the API of the system give you the entire Author content object when that property is referenced?

For referential attributes, attempt to export an identifier if at all possible. If your article links to an author, bring over the ID of that Author object as the value of the attribute. You’d much rather know that the Author is content object #634 than that it’s “Bob Jones.” In the latter case, you’re going to have search for authors named “Bob Jones,” and hope there’s only one of them.

The process of reconnecting or “resolving” all these references happens at the end of an import job. Content is imported with broken links, then once all the content is in the new system, those links are resolved to point back to the correct objects. This cannot be done as content is imported, because there’s no guarantee that the target object is already in the system – an Article might be imported before its Author is imported, for example.

In some cases, you might have to adjust your content model to allow weaker references during import. For example, if the Author property of your Article content type is intended to be required, you might have to relax this during import to allow Articles to be imported without an Author, then have the Author resolved later in the process. Once all content is in, the references can be resolved, and required restrictions can be reenabled.

To resolve HTML links, you will usually have to parse the HTML, which means finding a competent parsing library such as AngleSharp for .NET or Beautiful Soup for Python. All HTML needs to be processed, looking for all anchor or media tags, which then must be examined to determine if they link to external websites or internal content objects. For anchors linking to other objects that are imported, those objects need to be found based on the link and have the target of the link changed to reflect the new URL (or alternate method of linking). The URL should be inserted in the correct repository format for the new CMS, which, might not be the public URL, but rather a placeholder URL intended for request-time resolution.

Normally, the resolution of content references doesn’t happen immediately. It’s common for several import jobs to occur before all the content is imported successfully and reference resolution can begin.

QA

Once content is in the new CMS and the links are resolved, migration QA can begin. Migration QA is designed to verify that content was moved into the new system successfully.

It has two levels:

  • Functional QA: This can be performed by someone with no domain knowledge, which means no knowledge of what the content actually means. All this tester is reviewing is whether or not the content is generally intact – whether all the content properties are populated, all the links work, any images are broken, etc. This person does not need to understand the content itself.
  • Domain QA: This needs to be performed by someone with domain knowledge, which means an understanding of the subject matter of the content. This tester is reviewing whether content is in the right place in the navigation, whether it was categorized correctly, if it’s responding to search queries correctly, etc. This person needs to be qualified to make editorial decisions about content.

Ideally, there will be a specific checklist of content to review and a highly structured method of recording problems. If a tester finds a problem with content, where is that information logged? In many cases, adding temporary content properties is helpful, such as a checkbox for Migration QA Complete or a text box to record migration defects directly in the content object itself. Alternately, the ticket or issue management system used for functional QA can be used for migration defects.

When a defect is found, it needs to be evaluated for scope. Defects can be one of two types:

  • Import defects: These are defects that need to be fixed at the import level, which means they’re likely widespread. Often, small defects are harbingers of a larger problem. Finding one or two articles that have no Author property populated might reveal that a large portion of Author objects were accidentally skipped during migration and the only solution is to rerun the import and start over. Import defects can be very disruptive, and the entire migration team might need to stop in the middle of what they’re doing while the import is corrected and rerun.
  • Object defects: These are defects specific to a particular content object. These aren’t the result of the import, but are issues that were either present on the old site and carried over, or resulted from something introduced through interaction with the new CMS – a missing style or JavaScript library, for instance. You do not have to reexecute the entire import for these, but they need to be marked for manual correction after the import has run for the final time.

Efficiency is key in these situations. Having the new website on one screen and the old website on another screen can ease the process of comparing versions of content. If the old URL is stored with the new content object, the old page could even be displayed under the new page in a temporary IFRAME, so testers can review both simultaneously.

Automated QA can be helpful during migration testing. Having a link checker running once a day and delivering a report of broken links can increase the testers’ ability to find problems.

Migration Script Development

The process of automated migration tends to be iterative, with phases running in cycles. It’s very much a process of performing some action, reviewing the result, modifying the process, then repeating.

The goal is to develop a migration script that exports content, transforms it correctly, imports it, and resolves all the references in one uninterrupted execution that might take minutes or hours. Then this script can be executed immediately prior to launch. All prior work during the migration cycle might be considered a “dry run” for the actual migration to take place closer to launch.

The word “script” here has dual meaning: it usually takes the form of an actual programming script that is executed, and in a more generic sense, it refers to a choreographed series of actions – both machine-powered and human-powered – that are intended to be executed in sequence at a later time.

Migration script development often looks like this:

  1. Concurrently with the start of implementation of the new CMS, a developer begins investigating options for exporting content from the existing CMS. Multiple methods might be tested until one is identified that provides the least number of obstacles.

  2. Once a workable method of export is found, the developer performs a test export. The results are reviewed, often found to be deficient in some way (a property is missing, the references are not correct, etc.), and the export is repeated. The developer might iterate through this cycle for days or weeks until arriving at an export that is deemed acceptable.

  3. The exported content is compared against the requirements for the new CMS (which, in many cases, are still developing), and required transformations are identified. Methods of making these transformations are developed and incorporated into the export job, which can then be rerun with the transformations executed in real time.

  4. When the new CMS has reached a state where content can be imported (at the very least, the content model must be implemented), an import job is developed to bring the exported content into the new CMS. Like the export, the import is performed once, reviewed, often found to be lacking, modified, and run again. This process is repeated multiple times until the imported content is found to be satisfactory. Often, the process of importing reveals a defect with the export or transformation, which moves the developer backward in the process.

At a certain point, the migration script has been refined to the point where further work is inefficient. If the launch date is still far in the future, development on the migration script might halt for weeks or months at this stage until the launch date approaches.

Content Velocity and Migration Timing

The rate of content change on a particular website can be referred to as its “velocity.” A news website has a high velocity of content, meaning new content is added multiple times per day. A small website for, say, a dental office might have a slower velocity, with pages that change every few months at most.

Even different areas on the same website can have differing velocities. On a high-traffic media site, content like the privacy policy likely has an extremely low velocity. It may be reviewed once a year, at most, and change once every few years.

The perfect content for migration has a velocity of zero, meaning the content will not change from the beginning of migration to the launch of the new website. Referring to the migration cycle we just discussed, a developer can begin exporting content and know that none of that content will change during the inevitable trial and error process that might take weeks or months.

In the real world, content will change. The content that is initially exported early in the cycle might change the very next day. Thus, the ideal situation is to refine the migration script to the point that nothing further is required to migrate content, and then run the completed script immediately prior to launch.

This type of “push-button migration” is a bit of a mirage. It can be done, but usually takes an enormous amount of work. Migrations can be idiosyncratic, in that specific content items might need intricate fine-tuning that’s not easily scriptable. These will surface as object defects in the QA process. These one-off content corrections are quite common in order to fix problems with individual content items that are not efficient to incorporate into the migration script.

What normally happens is that a developer refines the migration script until further refinement is impossible or inefficient. The developer might get the migration script to the point where the content is extremely close to a launchable state. Even so, there will almost always be some amount of manual correction that needs to take place after the script completes execution.

The goal is to run this script as close to launch as possible, in order to include the most recent content changes from the existing site, then plan and execute the manual interventions immediately between that moment and launch.

At a scheduled point prior to launch, the migration script is executed for the final time. Rehearsal is over, and this is the actual migration. Content brought over during this execution will be officially considered “rehomed” in the new CMS. Unless mass import defects are found during QA, the migration script will not be executed again.

This period of time starts what’s known as a “content freeze,” because the editorial team is told to cease content changes on the existing site. Once the migration script has executed for the final time, the old site should not be changed because those changes will never make it to the new site. Content on the existing site is considered frozen, and cannot be changed until the new site is launched and it is changed there.

Content freezes are always stressful, as the editorial team has their hands tied while the organization has one foot in the old system and one foot in the new system. The goal is to resolve the object defects and finish the fine-tuning required to launch the new site as soon as possible and to allow the editorial team to begin managing content in the new system.

Sadly, some projects can run into major problems right before launch that push the launch date back. In these cases, staying in a content freeze might not be reasonable, and it makes sense to allow content editing to resume in the existing system with the intention of rerunning the migration script closer to the new launch date. Any manual interventions that were already made to the migrated content might be lost and have to be repeated during the new content freeze prior to the new launch date.

For these reasons, the timing of a migration can be an intricate balance between the velocity of content changes and the intended launch date. The goal is to refine a migration script to the point where manual interventions are minimal, and to schedule and execute those interventions during a content freeze window that is kept as short as is reasonably possible.

A Final Word of Warning

Do not underestimate a content migration. It can easily be the most labor-intensive and riskiest portion of a CMS implementation.

As soon as the CMS project is identified, a content inventory should be started to identify which content is moving and how it needs to be changed. You do not need to even know the new CMS platform to start this. If you know a migration will have to occur, it’s time to start planning.

If you’re ambitious and have the capacity, a developer might even start on extraction prior to any activity on the new CMS. Remember, the content has to be extracted at some point, and the extraction is fairly universal and not particularly dependent on the new CMS.

Work on the actual migration script should begin concurrently with development, as show-stopping problems with content import, export, and transformation are common. Do not simply lump migration script development in with other development work. Development of the migration script should be an assigned task, just like any other, and the developer should be given adequate time to complete it. In migration-intensive projects, a developer might be assigned to migration work and nothing else.

The migration script can often be some of the most complicated code in the entire project. And while it is temporary code, resist the urge to treat it as such. Good development practices should still be followed, including source control, testing, and continuous integration. This code is just as important to the success of the implementation as anything else the development team does.

Editorial staff need to be acutely aware of the migration schedule. They need to know, long in advance, when a content freeze will be imposed. During this time, it usually becomes an “all hands on deck” environment as the team works to QA and fine-tune migrated content in preparation for launch. Having half your editorial team go on vacation during the final weeks prior to launch is a recipe for a failed migration attempt.

Finally, over-budget for your migration, in terms of both time and funding. Too many projects have fallen over right before launch because of a migration that simply wasn’t planned adequately. The industry is saturated with stories of new CMS implementations that stood idle for months, or even years, waiting for content to be migrated.

Footnote #1

Clearly, I know nothing about making ice cream.

Footnote #2

A technical reviewer noted, “Have a yard sale before you move.”

Footnote #3

Paula Land has written a handbook called Content Inventories and Audits on this subject (XML Press). Similarly, David Hobbs (see the sidebar at the end of the chapter) has written a report on the topic called "Rethinking the Content Inventory.”

Footnote #4

Some might say that a decoupled system is designed in exactly this way, and that the act of publishing is really a form of export.

Footnote #5

Not to mention prior versions of content, though I have yet to see a content migration that bothered to bring over any version other than the current, published version. Bringing over the entire version history of every content object would be extremely ambitious. Many systems don’t even have the ability to explicitly recreate older versions from the API (by design), so the content would have to be first imported as its oldest known version, then successively overwritten with newer versions, while hoping that all relational content references in use for a particular version would also be available at the time the object was being imported. Suffice it to say that most organizations are satisfied with simply keeping the old CMS available somewhere in case they have to refer to older versions of content.

Footnote #6

Carrying a BLINK tag over to a new implementation might violate international treaties. Check with your attorney.

Footnote #7

I remember a particularly difficult project with an existing CMS that had no built-in hierarchy and a new CMS with a very strong content tree. Unfortunately, the URLs had been “SEO optimized” to make all content appear to be on the top level, containing just a single URL segment. With absolutely no other way to figure out content geography, we were reduced to parsing the HTML that formed the crumbtrails and re-constructing the hierarchy from that information.

Footnote #8

Which is, let’s face it, just another form of ID.

Footnote #9

In some cases, a CMS will use a 32-bit GUID as an ID. With these systems, explicitly specifying the ID on content creation is sometimes possible. If both CMSs have this format, it’s theoretically possible to retain the same IDs during a migration. Clearly, however, this would be rare, and even then, the actual text of the link (which is detected and replaced) would be different.

This is item #13 in a sequence of 15 items.

You can use your left/right arrow keys to navigate