web automation – József Jároli

From Plone to WordPress – Migrating a site with web automation tools

As I have had quite a few, complex Plone sites, none of the proposed methods I found on the net would have been useful for me. Some of my sites had a complex subsystem built with Plomino – a kind of CMS in a CMS, while other sites had a forum or a multilingual structure, therefore I had to resort to fire up my beloved web automation tools and build my own scripts to export Plone content in a structured format and import it in my WordPress sites. This article will describe a unique method of exporting a Plone site with web scraping tools and importing it into a WordPress site – a process which could be adapted to practically any kind of CMSes and any type of website migrations.

Why is this method so unique?

I am imitating a human being during the whole migration process, that is visiting every web page or its editing interface and the site’s management interface to extract the necessary information. Then, in the second step, the content is entered to the new site just like an ordinary web editor would do: by pushing the appropriate buttons, typing in the text and filling in other fields of a page/post edit screen. I am just speeding up the whole process by automating this process with my web automation tools. All in all, instead of querying databases, executing SQL commands, filtering and normalizing data, using import/export add-ons, I am dealing with the ordinary web interfaces of both the new and the old site during the whole migration process. The disadvantage of this process is that it is way slower, but its advantage is that you can do whatever you want.

Complexity of Plone and the export process

You might have read the logic behind how I had exported the whole content of a Zwiki-based site, but exporting all kind of content from a Plone site resulted in a much more difficult process. The main difference between a simple CMS – like Zwiki – and Plone is that the latter has various content types – both folder-like content types (that is a content which can contain other content elements) and document-like content types. And each of them has a bunch of special fields. Also listing every piece of content was not that easy either, as a Plone site does not have an extensive Table of Contents-like page where every content item is listed — unlike the wiki-based sites.

Step #1: Scraping the old Plone site

a.) getting a list of every content item

First I thought that I would create a script which could scrape every published content from the site without having an administrator access. In that case what my script should have done is to go to the advanced search form, check all available content types and run a search without a search keyword, then go through the result pages by keeping on pressing the next buttons. But I had to realize that there is a setting for each content type which controls whether that content type is searchable or not. Therefore, so as to get a full list of every content item, you should go to the settings page first and make every necessary content item searchable. Then I found out that the portal_catalog must have some bugs because I realized during the process that the advanced search was omitting a couple of content items – and I still could not figure out why.

Finally, I decided to get the list of every content item’s URL, therefore my script first visits the Zope Management Interface, creates a Python script with the following content:

for i in context.portal_catalog.searchResults(Language=''): print i.getURL()
return printed

Then it runs the above script and saves the output of the whole list of URLs in a text file.

b.) getting all the attributes

First I thought that it would be enough to open the edit page of every content item ( {-Variable.url-}/edit ), because every attribute could be extracted from there, but it turned out that for instance the Last Modified date is not listed there, or at least I could not find it, so first I had to open up the ZMI at the {-Variable.url-}/manage_metadata screen to figure out this information. Also, the workflow status was easier to get from the {-Variable.url-}/manage_workflowsTab page.

But then the script opens the {-Variable.url-}/edit page and scrapes basically every input, textarea and select fields except the hidden ones, plus it gets the list of parents from the breadcrumb menu and computes the hierarchy level.

In the next step, it opens the folder content page of its parent ( {-Variable.url-}/../@@folder_contents?show_all=true ) to figure out the position of that content item in its folder, so that later the menu order could be set.

It also downloads images stored as leadImages for news items and similar content types, plus the image itself for the image content type into a similar folder structure to the site’s original folders.

Finally, it adds a new line to the .csv file containing all the common attributes plus the ones who are specific to certain content types. Because of these latter field categories, I always save in the following format: {field name}::{field value} pairs separated by tab characters.

Step #2: Uploading content to the new site

a.) working with the scraped data

Once every content item is scraped with the first script, it’s time to fire the second one. It will process the log file produced by the scraper script. As the first column contains the hierarchy level and the position in the folder, the whole list of items can be sorted so that the script will start uploading the items located in the site root, and continue with the contained items. There is also a check in the script so if a parent item is not found, then it will be skipped and logged – so that you can identify those content items which were not listed somehow (don’t ask me why) in the very first step by querying the Plone database.

b.) decide what to do with different content types

For some content types, like the Document and News Item, it is quite obvious what to do: you can pair the Title, Description/Excerpt and Body text. With some ”folderish” content types like the Folder and Ploneboard, the process can be quite similar.

I decided that I would not create a separate Page or Post for Images. Where they are embedded in documents, I will just upload them to WordPress’ Media Library and insert them as you would normally do with WordPress. But where images were just listed with a thumbnail-based folder view, there I would manually add a WordPress Gallery item to the parent page.

I had a couple of Link items, there I also opted for not creating a separate page/post for them, but rather listing these link items on their parent page, that is only adding a header, link and a description to the Body text of parent pages. I also did something similar with the PloneBoard forum conversations: while these were separate content items in Plone, I just concatenated them to one page in WordPress.

Apart from these modifications, I decided to skip the Topic and Collage items, on one hand, both serve for listing content on one page, and on the other hand they basically mean quite a difficult setup to reproduce: perhaps you can create something similar with the AutoNav plugin, but it is far from obvious, and therefore the better if you really do it manually – if you need to do it at all.

I decided to use Pages for every kind of content, even for the News Item content types used in the Blog section of the old Plone site. I opted for this because pages can have parents and menu (folder) order, similarly to Plone’s structure, but of course, based on the logic of your old site, they could be translated as Posts as well.

c.) Logging what has been uploaded

Once the post/page has been created, it is important to get its ID and log it along with the attributes of the original content. Later on, you can use this information if you want to replace internal links, and it is also important in case something goes wrong. Then you can remove all the uploaded content by their WordPress IDs, and restart the process.

d.) Debugging and error handling

One of the peculiar problems with this ”imitating a manual data extraction / manual data entry” is rooted in the fact that the process takes quite a lot of time. I guess every web page fails to load quite frequently, you just would not notice it as you are usually not downloading hundreds of pages from a server. But in our case, if a page is not loaded, then we can miss a certain piece of content – and believe me, this happens quite often. Therefore every script has to be written with this in mind.

The first, scraper script, for instance, can be restarted when an error occurs, but most importantly it won’t start the whole process from the start, just continue the process where it stopped, finding out the next thing to do by analysing the previously written log files.

But when you start uploading the stuff you want to migrate, errors might occur even more frequently. On one hand, you will be surprised how many times your website will fail to respond, fail to save your content. It’s because normally you never invoke that many web pages or you never try to save so many new pages to see your site dropping your request or failing to respond. On the other hand, when you start to create the content at the new site, only then you will see if something went wrong by scraping the content, or then you will figure out if you have forgotten to scrape a necessary information, etc. For all these reasons, the best thing is to run the upload process in debug mode – so you can effectively see which step is happening right now, and if a page is not loaded for some reason, you can go to the site and check whether the page has been uploaded, just the confirmation was not displayed, or you should restart the upload process of that item.

In addition to that, I had to create a short script which was used in case something went entirely wrong: it would just delete every previously uploaded page (sometimes when there are already other content items on the new page, it is not straightforward to find out which content should be deleted).

Step #3: postprocessing content

a.) updating internal links and image references

There are certain steps you can accomplish only when every content item is already uploaded to the new site: for instance replacing the URLs of old internal links to the new WordPress ID-based links (which is better to use than the permalinks, just because if you, later on, fancy to restructure the migrated site, you will not break any internal link).

Therefore the post-processing script will open every uploaded content item for editing, gather all the URLs in href (and src) attributes, and replace them with the {-Variable.domainprefix-}{-Variable.domain-}/?p={-Variable.postid-} URLs. Things become tricky if originally relative URLs have been used – even more tricky when you consider Zope’s interesting concept called inheritance, so it’s better to translate relative URLs to absolute URLs before you attempt to find the WordPress ID belonging to the referenced content item in the log of the uploaded content.

When it is about replacing the URL of an embedded image (that is changing the reference from the old site to the new one), it is advisable to open the media item’s edit page: {-Variable.domain-}/wp-admin/post.php?post={-Variable.postid-}&action=edit and figure out the image’s file name as it is uploaded.

b.) handling private content items

There is also one more thing to do if you happen to have some content on the old site hidden from the public. Unfortunately, there is a WordPress bug, which has not been fixed during the last eight years: https://core.trac.wordpress.org/ticket/8592 . That prevents me from setting private status right when I upload the content in the previous step. The problem is rooted in the fact that if a parent page’s visibility is set to private, it will not show up in the dropdown list of parent pages on the edit screen either when you open one of the private page’s siblings, or if you plan to create a new sibling page. Therefore editing a page with a parent page having a private status will change the page’s parent to root – and similarly, your recently uploaded page will be created in the site’s root.

In theory, you could install a plugin called Inclusive Parents, but in practice, your site will throw frequent errors when you resort to this kind of ”hacking a bug with a plugin” solution.

Summary

Translating the logic and the structure of a Plone website to a WordPress-based site is not that straightforward. Even if you don’t have to deal with specific content types created by plugins such as Plomino of Ploneboard, you might want upload Links, Folders, News Items in a specific format other than a WordPress Post item. Also, Images are handled in a very different way in both content management systems, causing a couple of headaches too. This might be one big reason why there are no simple export/import add-ons between these two CMSes. Luckily enough with the web automation approach using software packages like Zennoposter (or Ubot Studio), you can build your own Plone to WordPress migration process. Should you need my scripts as a basis for that, don’t hesitate to drop me a line!

»more»

Migrating a ZWiki site to WordPress

In this article I will show you how to migrate a site created with an old, and nowadays deprecated content management system to a contemporary CMS, using web automation tools – that is grabbing a site’s content by walking through it with bots imitating human visitors, and uploading it to the new site similarly, by acting just like a human editor.

More than a decade ago I started to build a very successful website with a simple yet powerful Zope-based Wiki engine called Zwiki. As both Wikis (with the one notable exception of Wikipedia) and Zope usage has been in decline for many years, I haven’t actively developed that site anymore, but as I did not want to lose its content, I decided to migrate it to WordPress.

Step #1: Scraping a Zwiki site

Getting structured content from a wiki

When moving a site, the first challenge is to download the website’s content in a structured format. As we are talking about a wiki-type site, there is a strong emphasis on the word: structured, as the basic philosophy of the wikis consists of adding content to one single page body field, and using certain kind of formatting notations inside of that one big content field to display the information in a format which resembles some structure. Zwiki, for instance, has a handy function which allows commenting and following others’ comment on any wiki page, but all the comments are to be added to the very same field where the actual page content is stored, therefore I had to find in the pages’ text where each comment begins, and store them separately from the content.

Dealing with the special content markup

Yet another challenge was that Zwiki, just as many of its counterparts, uses a specific, simpler-than-HTML kind of markup code, which cannot be recognised by any contemporary content management system, so I could not rely on the content I could get by opening all the pages for editing, so I had to scrape the public web pages, where the special markup is already interpreted and translated to ordinary HTML.

Imitating human visitors with web automation tools

As I have experience working with a few web scraping/web automation software my obvious choice was to scrape the Zwiki site as if a human visitor would click through each and every link on the page and download its content. This way you are not limited by the export/import formats a certain CMS would offer when it comes to acquiring and uploading content, but you can get whatever part of the content you want, and process them with whatever regular expressions you want and log the results in any format. If a human visitor can walk through the entire site, you can grab all the information.

The logic behind the scraper script

Wiki-based content management systems tend to have a feature which greatly simplifies the content scraping process: they usually have a wiki contents page where all the pages of the wiki are listed. Therefore it seemed to be a very easy task to get all the content I needed to move: just open the contents page, scrape all the links, go through the list of them and visit, download and post-process each one. As an output, I have generated a .csv file where the page hierarchy, that is all the parent pages has been logged, another .csv file where the actual content of each page has been logged with a few pieces of key information such as title, URL or last modified date. This last piece of information could be obtained by visiting each wiki page’s history sub page and reading the dates of previous changes listed there. The third file had every comment in a separate row, extracted by regular expressions from the page content. I have also generated another file with the raw content for debugging purposes. It records the page content plus the comments in their original format so that if something went wrong with the processing of the comments, the original source could be at hand.

Putting it all together with UBot Studio

As the whole process didn’t seem to be too difficult, I opted for using Ubot Studio for downloading and structuring the site’s content. It is marketed as an automation tool for internet marketers, but to be honest its main purpose was once to scrape and spam websites by link submissions, comments, etc. But nevertheless it can be used for various web automation purposes, and one of its key function that the Bots I create can be compiled in a .exe format, which can be run on any Windows computer, without having to buy the software itself. I would not publish this executable as I don’t want anyone to play around with scraping Zwiki sites, thus putting an unnecessary load on their servers, but feel free to contact me by commenting this page or dropping me a mail (kedves /at/ oldalgazda /dot/ hu) if you need that .exe file to migrate your Zwiki site as well.

Another interesting feature of Ubot is that although its primary interface is a visual programming UI, you can still switch to code view, where you can edit the script as if it was coded in an ”ordinary” programming language. The Zwiki scraper script, for instance, looks like this below in code view. If you have some patience, you can go through the script and understand what each step did, and see which regular expressions I used when structuring the data:

 ui text box("Domain to scrape (without http(s)://):",#domain)
 allow javascript("No")
 navigate("{#domain}/FrontPage/contents","Wait")
 wait for browser event("Everything Loaded","")
 wait(5)
 set(#scraped,$scrape attribute(<class="formcontent">,"innerhtml"),"Global")
 add list to list(%pageurls,$find regular expression(#scraped,"(?<=href=\")[^\"]+"),"Delete","Global")
 loop($list total(%pageurls)) {
     set(#pageurl,$list item(%pageurls,1),"Global")
     navigate(#pageurl,"Wait")
     wait for browser event("Everything Loaded","")
     wait(5)
     set(#content,$scrape attribute(<class="content">,"innerhtml"),"Global")
     set(#content,$replace regular expression(#content,"<a\\ class=\"new\\ .+?(?=</a>)</a>","<!-- no wikipage yet -->"),"Global")
     set(#content,$replace(#content,$new line,$nothing),"Global")
     set(#content,$replace regular expression(#content,"\\t"," "),"Global")
     set(#contentonly,$replace regular expression(#content,"<p><div\\ class=\"subtopics\"><a\\ name=\"subtopics\">.+",$nothing),"Global")
     set(#contentonly,$replace regular expression(#contentonly,"<p><a name=\"comments\">.+",$nothing),"Global")
     set(#contentonly,$replace regular expression(#contentonly,"<a name=\"bottom\">.+",$nothing),"Global")
     add list to list(%parents,$scrape attribute(<class="outline expandable">,"innertext"),"Delete","Global")
     set(#parentlist,$list item(%parents,0),"Global")
     clear list(%parents)
     add list to list(%parents,$list from text(#parentlist,$new line),"Delete","Global")
     set(#parentlist,$replace(#parentlist,$new line,";"),"Global")
     set(#posttitle,$list item(%parents,$eval($subtract($list total(%parents),1))),"Global")
     set(#posttitle,$replace(#posttitle," ...",$nothing),"Global")
     if($comparison($list total(%parents),"> Greater than",1)) {
         then {
             set(#parent,$list item(%parents,$eval($subtract($list total(%parents),2))),"Global")
         }
         else {
             set(#parent,$nothing,"Global")
         }
     }
     append to file("{$special folder("Desktop")}\\{#domain}-page-hierarchy.csv","{#pageurl}    {#posttitle}    {#parent}    {#parentlist}    {$new line}","End")
     clear list(%parents)
     add list to list(%comments,$find regular expression(#content,"<p><a[^>]+name=\"msg.+?(?=<p><a[^>]+name=\"msg.+)"),"Delete","Global")
     loop($list total(%comments)) {
         set(#comment,$list item(%comments,0),"Global")
         set(#date,$find regular expression(#comment,"(?<=name=\"msg)[^@]+"),"Global")
         set(#title,$find regular expression(#comment,"(?<=<b>).+?(?=</b>\\ --)"),"Global")
         set(#title,$replace regular expression(#title,"<[^>]+>",$nothing),"Global")
         set(#author,$find regular expression(#comment,"(?<=</b>\\ --).+?(?=<a\\ href=\"{#pageurl})"),"Global")
         set(#author,$replace regular expression(#author,"<[^>]+>",$nothing),"Global")
         set(#author,$replace regular expression(#author,",\\ *$",$nothing),"Global")
         set(#comment,$find regular expression(#comment,"(?<=<br(|\\ /)>).+"),"Global")
         set(#comment,"<p>{#comment}","Global")
         set(#comment,$replace regular expression(#comment,"\\t"," "),"Global")
         append to file("{$special folder("Desktop")}\\{#domain}-page-comments.csv","    {#pageurl}    {#date}    {#title}    {#author}    {#comment}    {$new line}","End")
         remove from list(%comments,0)
     }
     navigate("{#pageurl}/history","Wait")
     wait for browser event("Everything Loaded","")
     wait(5)
     scrape table(<outerhtml=w"<table>*">,&edithistory)
     set(#lastedited,$table cell(&edithistory,0,4),"Global")
     clear table(&edithistory)
     append to file("{$special folder("Desktop")}\\{#domain}-page-content-raw.csv","{#pageurl}    {#lastedited}    {#content}    {$new line}","End")
     append to file("{$special folder("Desktop")}\\{#domain}-page-content-only.csv","{#pageurl}    {#posttitle}    {#lastedited}    {#contentonly}    {$new line}","End")
     remove from list(%pageurls,0)
 }

Step #2 Uploading the content to WordPress

Now that I have all the necessary data downloaded to .csv files in a structured format, I needed to create other scripts to upload the content to a WordPress site. Here I opted for the same technique, that is imitating a human visitor, which hits the ”Create a new page” button each and every time, and fills all the edit fields with the data grabbed from the downloaded .csv files. More details about this part can be read here: From Plone to WordPress — Migrating a site with web automation tools

»more»

How can simple scripts help online marketers?

Having read my previous post about how I found myself messing around with visual programming as an online marketer, you might have wondered: and what would be the everyday uses of those scripts when dealing with ad campaigns and web sites? Well, let me share some examples from the last few years I was working with these web automation tools to illustrate this:

Overcoming limitations of AdWords: finding more manual display network placements

Have you ever wondered whether AdWords will suggest you all the relevant display network placements, Youtube videos or Youtube playlists when you try to add them by entering relevant keywords? Well, the answer is that relying only on the ad management interface of AdWords, you will miss a lot of relevant placements. Fortunately, with some web automation skills, you can quickly build a script which finds even more relevant placements for instance by executing site searches on Youtube for a certain list of keywords, automatically pressing the next-next buttons and generating a simple list of URLs based on what has been displayed on the search result pages.

Analyzing your data the way you want: exporting external link data in a meaningful format

Although Google Webmaster Tools (a.k.a. Google Search Console) lets you browse through a huge list of web pages where a certain site of yours is linked, you cannot really export that data in a usable format, such as linking domain, linking web page, linked page in the same row. Although you could click through the list of linking domains, then the list of linking pages and export a bunch of tables based on this hierarchy, this sounds like a kind of repetitive task which can be fairly easily automated. Adding a few more steps like scraping the title of the linking page plus the anchor of the link, you can end up having a really informative list of your external links – at least of those which are displayed by Google.

Analyzing your data the way you want: obtain raw engagement data

While Facebook shows you some insights about how your pages or posts are performing, you cannot simply grab the raw data of these statistics: such as the number of visitors liked or shared certain posts in a given timeframe. But you can build a script which automatically scrolls and scrolls and scrolls – and extracts any data about the posts displayed. Having all the data in a spreadsheet format, you can visualize it the way you want. As a bonus: you can even do this with your competitors’ Facebook pages.

Automating repetitive tasks: checking link building results

Way back when we have been building tens of thousands of links on web directory sites, no link submission software could provide us with detailed and reliable data about which directories had accepted and published our link submissions and which had not. Without knowing how many links were eventually generated and where were those links located, we could not create detailed reports for our clients. On the other hand, the biggest problem of directory link building was that you never knew at the time of submission where the submitted link would be displayed in the directory, so the challenge was not only going through a list of URLs and see whether our link was found on those pages or not, but you had to look through the entire directory to figure out where exactly that link was. All in all, this task was more complicated than curl or wget a list of URLs and grep the results. Before I knew how to automate this process with visual scripting, we had to do this highly repetitive task by hand – so scripting could save us a lot of manual work.

Process spreadsheet data: check and merge what’s common in two tables

When you have to work with email lists and related data coming from different sources, you could quickly diff or merge two .csv files with Unix command line tools. But sometimes not everyone in your organization possesses those “geeky” skills to fire up awk for that, and many times you are also too lazy to find the best solution on StackOverflow. In these cases, with automation software, you can even create an .exe file with an easy to use interface where two files and a few more parameters are asked, such as which column’s data should be matched in the other spreadsheet to merge a table with the unified rows based on those matches – or whatever you can achieve with regular expressions, if / then statements and loops.

Extracting structured data from unstructured source: list of products in a website

Unfortunately, still there are many web shop owners who are running their sites based on proprietary webs hop management systems, which are not prepared for simply exporting the list of products from, or not in the appropriate format, with all the desired data, etc. In these cases, it is very handy if you can quickly build a script which scrapes the entire webs hop and outputs a spreadsheet of every product, containing all the important attributes and product data. Based on the result, you can start working on either the on-site SEO or importing those lists to Google AdWords of Facebook Ads.

Migrating web sites: exporting and importing from/to any CMS

There are quite a few ways of importing data into WordPress, but still, you might miss some features which can be normally accessed only if you upload the content manually, such as attaching images to a certain post or set the featured images. Not to mention that before that, you’ll have to get to the point of already having the data extracted from the old web site to a structured format such as XML and CSV. As many older CMSes and proprietary content management systems do not have such data exporting features, this part of the job could be also quite complicated, if not impossible. On the other hand, with some web automation skills you can extract any data in any format from the original site and imitate a human being filling out the corresponding data simply automating the new site’s administration interface – you don’t have to rely on any export-import plugin’s features – and shortcomings.

Web spamming: black hat SEO, fake Facebook accounts…

The tools I’m using for automating the above tasks are originally meant for creating accounts, posting content to a wide range of sites: thus spamming the entire web — but this is something I have never used these tools for – believe it or not 🙂

Coming next:

»more»