Improve your workflow when using exports from Content Analysis Tool in Excel. Tips and tricks to clean-up its raw crawl data.

Content Inventory Example

Content audits and content inventories are essential documents for understanding the scale, scope, fate and quality of content. However they notoriously laborious to create and maintain.

I typically use Content Analysis Tool (CAT) to automate an initial inventory, saving considerable effort and improving accuracy over a manual inventorying process.

However a raw job-level export typically needs cleaning-up to begin to make its data more usable and useful; both for the lone content wrangler and for any collaborative audits with other content or domain experts.

These tips will help improve your workflow as you begin to work with CATs raw crawl data from your site.

Content inventory clean-up tips

  1. Transform bytes to kilobytes (Kb)
  2. Condensed, readable and usable URLs
  3. Find duplicates and junk via conditional formatting
  4. Make MIME content types understandable
Download example inventory (438Kb, Excel)

Transform bytes to kilobytes (Kb), megabytes (Mb) and gigabytes (Gb)

Problem:

CAT has provided you with a file size as a string of numbers for a content item. This is hard to read.

Example: Column D “Size” : Row 2 “22146”

Solution

Transform 22146 (bytes) into 21.63K (kilobytes), an understandable and qualified number, using this formula:

=IF(('Raw catExport'!D2>=POWER(2,30)),
TEXT(('Raw catExport'!D2/POWER(2,30)),"##0.00\G"),
IF(('Raw catExport'!D2>=POWER(2,20)),
TEXT(('Raw catExport'!D2/POWER(2,20)),"##0.00\M"),
IF(('Raw catExport'!D2>=1024),
TEXT(('Raw catExport'!D2/1024),"##0.00\K"),
TEXT(('Raw catExport'!D2),"##0.00\B")
)
)
)

Tip: change 'Raw catExport' and !D1 to match where you are working in your inventory.

Condensed, readable and usable URLs via CONCATENATE and HYPERLINK functions

Problem

You find it frustrating to have long and complex URLs that:

  • take up too much horizontal screen space.
  • repeat useless data about the protocol, domain, parent folders.
  • do not link to the live content for easy referencing of production.

Example: 

  • http://www.domain.com/root-folder/interesting-folder/file1.html
  • http://www.domain.com/root-folder/interesting-folder/file2.html
  • http://www.domain.com/root-folder/interesting-folder/file3.html

versus

Solution

You will be working with three worksheets:

  1. The original worksheet from CAT.
  2. A new worksheet with a complete copy of the original worksheet.
  3. An additional worksheet to store the part of the URL that you don’t want to see (but which a formula must reference).

These are Cleaned catExport, Raw catExport and URL in the example.

The process
  1. Create a copy of your export in a new worksheet.
  2. In both worksheets use Find and Replace to remove the generic parts of the URL that you’d like to hide, e.g., “http://www.domain.com/root-folder”.
  3. In a new, third, worksheet create the part of the URL that you removed in step two so that it can be referenced by a formula, i.e., paste “http://www.domain.com/root-folder” into cell A1.
  4. In the cloned worksheet use the following formula to create the concatenated hyperlink:

=HYPERLINK(CONCATENATE(URL!$A$1,'Raw catExport'!A2),'Raw catExport'!A2)

Tip: If your worksheet names are a single word or phrase then you don’t need to wrap them in 'single quotes' in the formula.

Find duplicates and junk via conditional formatting

Problem

CATs crawl of your content has returned duplicate items or junk that you want to rapidly identity and exclude from your inventory.

Example: ../_edit type link suffixes for the Editors in your CMS:

  • ../root-folder/interesting-folder/file1.html
  • ../root-folder/interesting-folder/file1.html /_edit 

Solution

  1. Identify a pattern of duplication or junk that you’d like to highlight or Filter, like /_edit.
  2. From the Home menu ribbon choose Conditional Formatting.
  3. Then select Highlight Cell Rules, Text that Contains… .
  4. Consider setting the beginning with, or ending with rules.
  5. Enter your pattern, like: ending with “ /_edit ”.
  6. Run the Conditional Formatting.
  7. Filter the column where the pattern occurs and review.

Tips:

  • Use Conditional Formatting to find duplicate items.
  • Use  colour  to flag items and to Filter them.

Make MIME content types understandable

Problem

CAT returns full MIME content type descriptions in Column C “Type”. These are hard to read and could confuse a lay audience.

Examples:

  • application/x-javascript
  • text/html
  • application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
  • image/jpeg

Solution

  • Select a content type
  • use Find and Replace to create an understandable equivalent

Examples (from above):

  • Script
  • Page
  • Spreadsheet
  • Image

Tips:

  • Types marked “unknown” are often redirected items or dead links
  • use Filters to reduce rows of irrelevant content types

Summary

Let me know in the comments if you have any improvements or suggestions for working with CAT exports, or with content inventories and audits more generally. Likewise if you need help with any of the advice in this blog post.

Further Reading

About The Author

I blog about how to collaborate to design simple, usable and inclusive information experiences that make the lives of customers easier. Read more in the blog Categories and Tags.

Leave a Reply

Close