scripts – József Jároli

From Plone to WordPress – Migrating a site with web automation tools

As I have had quite a few, complex Plone sites, none of the proposed methods I found on the net would have been useful for me. Some of my sites had a complex subsystem built with Plomino – a kind of CMS in a CMS, while other sites had a forum or a multilingual structure, therefore I had to resort to fire up my beloved web automation tools and build my own scripts to export Plone content in a structured format and import it in my WordPress sites. This article will describe a unique method of exporting a Plone site with web scraping tools and importing it into a WordPress site – a process which could be adapted to practically any kind of CMSes and any type of website migrations.

Why is this method so unique?

I am imitating a human being during the whole migration process, that is visiting every web page or its editing interface and the site’s management interface to extract the necessary information. Then, in the second step, the content is entered to the new site just like an ordinary web editor would do: by pushing the appropriate buttons, typing in the text and filling in other fields of a page/post edit screen. I am just speeding up the whole process by automating this process with my web automation tools. All in all, instead of querying databases, executing SQL commands, filtering and normalizing data, using import/export add-ons, I am dealing with the ordinary web interfaces of both the new and the old site during the whole migration process. The disadvantage of this process is that it is way slower, but its advantage is that you can do whatever you want.

Complexity of Plone and the export process

You might have read the logic behind how I had exported the whole content of a Zwiki-based site, but exporting all kind of content from a Plone site resulted in a much more difficult process. The main difference between a simple CMS – like Zwiki – and Plone is that the latter has various content types – both folder-like content types (that is a content which can contain other content elements) and document-like content types. And each of them has a bunch of special fields. Also listing every piece of content was not that easy either, as a Plone site does not have an extensive Table of Contents-like page where every content item is listed — unlike the wiki-based sites.

Step #1: Scraping the old Plone site

a.) getting a list of every content item

First I thought that I would create a script which could scrape every published content from the site without having an administrator access. In that case what my script should have done is to go to the advanced search form, check all available content types and run a search without a search keyword, then go through the result pages by keeping on pressing the next buttons. But I had to realize that there is a setting for each content type which controls whether that content type is searchable or not. Therefore, so as to get a full list of every content item, you should go to the settings page first and make every necessary content item searchable. Then I found out that the portal_catalog must have some bugs because I realized during the process that the advanced search was omitting a couple of content items – and I still could not figure out why.

Finally, I decided to get the list of every content item’s URL, therefore my script first visits the Zope Management Interface, creates a Python script with the following content:

for i in context.portal_catalog.searchResults(Language=''): print i.getURL()
return printed

Then it runs the above script and saves the output of the whole list of URLs in a text file.

b.) getting all the attributes

First I thought that it would be enough to open the edit page of every content item ( {-Variable.url-}/edit ), because every attribute could be extracted from there, but it turned out that for instance the Last Modified date is not listed there, or at least I could not find it, so first I had to open up the ZMI at the {-Variable.url-}/manage_metadata screen to figure out this information. Also, the workflow status was easier to get from the {-Variable.url-}/manage_workflowsTab page.

But then the script opens the {-Variable.url-}/edit page and scrapes basically every input, textarea and select fields except the hidden ones, plus it gets the list of parents from the breadcrumb menu and computes the hierarchy level.

In the next step, it opens the folder content page of its parent ( {-Variable.url-}/../@@folder_contents?show_all=true ) to figure out the position of that content item in its folder, so that later the menu order could be set.

It also downloads images stored as leadImages for news items and similar content types, plus the image itself for the image content type into a similar folder structure to the site’s original folders.

Finally, it adds a new line to the .csv file containing all the common attributes plus the ones who are specific to certain content types. Because of these latter field categories, I always save in the following format: {field name}::{field value} pairs separated by tab characters.

Step #2: Uploading content to the new site

a.) working with the scraped data

Once every content item is scraped with the first script, it’s time to fire the second one. It will process the log file produced by the scraper script. As the first column contains the hierarchy level and the position in the folder, the whole list of items can be sorted so that the script will start uploading the items located in the site root, and continue with the contained items. There is also a check in the script so if a parent item is not found, then it will be skipped and logged – so that you can identify those content items which were not listed somehow (don’t ask me why) in the very first step by querying the Plone database.

b.) decide what to do with different content types

For some content types, like the Document and News Item, it is quite obvious what to do: you can pair the Title, Description/Excerpt and Body text. With some ”folderish” content types like the Folder and Ploneboard, the process can be quite similar.

I decided that I would not create a separate Page or Post for Images. Where they are embedded in documents, I will just upload them to WordPress’ Media Library and insert them as you would normally do with WordPress. But where images were just listed with a thumbnail-based folder view, there I would manually add a WordPress Gallery item to the parent page.

I had a couple of Link items, there I also opted for not creating a separate page/post for them, but rather listing these link items on their parent page, that is only adding a header, link and a description to the Body text of parent pages. I also did something similar with the PloneBoard forum conversations: while these were separate content items in Plone, I just concatenated them to one page in WordPress.

Apart from these modifications, I decided to skip the Topic and Collage items, on one hand, both serve for listing content on one page, and on the other hand they basically mean quite a difficult setup to reproduce: perhaps you can create something similar with the AutoNav plugin, but it is far from obvious, and therefore the better if you really do it manually – if you need to do it at all.

I decided to use Pages for every kind of content, even for the News Item content types used in the Blog section of the old Plone site. I opted for this because pages can have parents and menu (folder) order, similarly to Plone’s structure, but of course, based on the logic of your old site, they could be translated as Posts as well.

c.) Logging what has been uploaded

Once the post/page has been created, it is important to get its ID and log it along with the attributes of the original content. Later on, you can use this information if you want to replace internal links, and it is also important in case something goes wrong. Then you can remove all the uploaded content by their WordPress IDs, and restart the process.

d.) Debugging and error handling

One of the peculiar problems with this ”imitating a manual data extraction / manual data entry” is rooted in the fact that the process takes quite a lot of time. I guess every web page fails to load quite frequently, you just would not notice it as you are usually not downloading hundreds of pages from a server. But in our case, if a page is not loaded, then we can miss a certain piece of content – and believe me, this happens quite often. Therefore every script has to be written with this in mind.

The first, scraper script, for instance, can be restarted when an error occurs, but most importantly it won’t start the whole process from the start, just continue the process where it stopped, finding out the next thing to do by analysing the previously written log files.

But when you start uploading the stuff you want to migrate, errors might occur even more frequently. On one hand, you will be surprised how many times your website will fail to respond, fail to save your content. It’s because normally you never invoke that many web pages or you never try to save so many new pages to see your site dropping your request or failing to respond. On the other hand, when you start to create the content at the new site, only then you will see if something went wrong by scraping the content, or then you will figure out if you have forgotten to scrape a necessary information, etc. For all these reasons, the best thing is to run the upload process in debug mode – so you can effectively see which step is happening right now, and if a page is not loaded for some reason, you can go to the site and check whether the page has been uploaded, just the confirmation was not displayed, or you should restart the upload process of that item.

In addition to that, I had to create a short script which was used in case something went entirely wrong: it would just delete every previously uploaded page (sometimes when there are already other content items on the new page, it is not straightforward to find out which content should be deleted).

Step #3: postprocessing content

a.) updating internal links and image references

There are certain steps you can accomplish only when every content item is already uploaded to the new site: for instance replacing the URLs of old internal links to the new WordPress ID-based links (which is better to use than the permalinks, just because if you, later on, fancy to restructure the migrated site, you will not break any internal link).

Therefore the post-processing script will open every uploaded content item for editing, gather all the URLs in href (and src) attributes, and replace them with the {-Variable.domainprefix-}{-Variable.domain-}/?p={-Variable.postid-} URLs. Things become tricky if originally relative URLs have been used – even more tricky when you consider Zope’s interesting concept called inheritance, so it’s better to translate relative URLs to absolute URLs before you attempt to find the WordPress ID belonging to the referenced content item in the log of the uploaded content.

When it is about replacing the URL of an embedded image (that is changing the reference from the old site to the new one), it is advisable to open the media item’s edit page: {-Variable.domain-}/wp-admin/post.php?post={-Variable.postid-}&action=edit and figure out the image’s file name as it is uploaded.

b.) handling private content items

There is also one more thing to do if you happen to have some content on the old site hidden from the public. Unfortunately, there is a WordPress bug, which has not been fixed during the last eight years: https://core.trac.wordpress.org/ticket/8592 . That prevents me from setting private status right when I upload the content in the previous step. The problem is rooted in the fact that if a parent page’s visibility is set to private, it will not show up in the dropdown list of parent pages on the edit screen either when you open one of the private page’s siblings, or if you plan to create a new sibling page. Therefore editing a page with a parent page having a private status will change the page’s parent to root – and similarly, your recently uploaded page will be created in the site’s root.

In theory, you could install a plugin called Inclusive Parents, but in practice, your site will throw frequent errors when you resort to this kind of ”hacking a bug with a plugin” solution.

Summary

Translating the logic and the structure of a Plone website to a WordPress-based site is not that straightforward. Even if you don’t have to deal with specific content types created by plugins such as Plomino of Ploneboard, you might want upload Links, Folders, News Items in a specific format other than a WordPress Post item. Also, Images are handled in a very different way in both content management systems, causing a couple of headaches too. This might be one big reason why there are no simple export/import add-ons between these two CMSes. Luckily enough with the web automation approach using software packages like Zennoposter (or Ubot Studio), you can build your own Plone to WordPress migration process. Should you need my scripts as a basis for that, don’t hesitate to drop me a line!

»more»

Free link checker script for directory link building

Six years ago, when we were pretty much in mass link building and link directory submissions, this script literally substituted a half-time employee, as it helped us to automatically check the presence of our links on various link directory sites. Although there are a couple of link checkers covering the use case when you know the specific URLs where to look for your links or an entire site is to be crawled in search of a link, but my directory link checker script does something different. You just have to enter the list of link directory domains where your links have been submitted and this script just tells you if those links have already been approved or not. This proved to be a very handy script for creating link building reports for our clients.

My very first web automation script

It took more than a week to write this script, and finally, when it was ready, It took again a week to re-write the whole thing from the ground up again. Way back then the challenge was that in many link directories you would never know the URL of the subpage where your submitted link showed up once it had been approved, so the only thing you could do is to check the mailboxes used for link building and go through the emails which were about submission approvals, plus open the sites and checking whether the links are already online – all of this made by hand. Obviously, gathering the list of successful submissions meant a lot of manual labour.

(The link submission process itself was already highly automated as we had been using a semi-automated link directory submission software, which meant a good compromise between speed and accuracy —as before submitting the links, we could choose the best category by hand, or enter appropriate data in a couple of directory-specific submission form fields.)

The logic of the link checking process

Using search engines to find the links

As the above, simplified diagram outlines, the script first scrapes the search engine search results restricted to the link directory domain, using search expressions like:

promoteddomain.com site:linkdirectorydomain.com

It crawls all the URLs listed as a result for these queries and if a link is found there, it logs the data of the successful link submission, loads the following link directory domain name and starts the process again.

Searching with the site search function of link directories

But if it fails to find traces of the submitted links with a web search engine (for instance Google or Bing), then it attempts to make a site search on the link directory page itself. As some directories are already linking the promoted domain from the search result page, the script might already succeed here. But if not, it starts to loop through the subpages listed as search results, loading each of these pages and checking the presence of the outgoing links pointing to the promoted domains.

Figuring out which internal links are search results

I could have created a database for each link directory, specifying the necessary parameters for doing a site search and evaluating it. Figuring out these parameters for each directory site, such as the search query URL schema and the regular expression to identify which internal URLs are search results on a site search result page. In this case the script could quickly and efficiently check the presence of links in the already known link directories, but the problem was that we had been constantly adding a lot of new directories to our database, plus as we had been doing link building in many different languages, it would have meant a lot of extra work to create and maintain such a database of parameters for thousands of link directories.

Therefore, I opted for a slower, brute-force solution which meant at least one extra query: searching for something nonsense, random string (such as the date and time), which ensured that no search results were displayed for it. Comparing the internal links of this with the normal search result pages, the difference shows us which internal links are search results, and which are other navigational, e.g. header or footer links of the page template.

The detailed link checking process

The below process seems to be a lot more complicated than the previous one, but there are good reasons for it. Mostly because I wanted to include certain modules only once, and re-use them as much as possible so that later on I could easily add improvements to the process. Interacting with many different kinds of websites is always a tricky issue, something you have to constantly refine. For instance, your regular expression does a perfect job on hundreds of sites but eventually fails on a specific web page.

Identifying search forms

On the other hand, there is a second trick apart from figuring out the search result links, that is how to find out which form is a search form, and what should be done to submit that search query? Here again, I opted for the brute-force solution: the script enumerates all the forms on the link directory home page and tries to submit these forms using search queries like the promoteddomainname.com or {specific keyword combinations used in link submission texts}. If upon submitting the first form, there will be no results —just because the submitted link is not yet listed in the directory— it attempts to submit the second form, even if it is a login form.

Attempting to submit forms

Similarly, if the submit button can be easily found, the script attempts to submit the form by either clicking on other elements where chances are high that it could submit the search form with or emulating the keystroke of the enter button while the cursor is in the search field, etc.

Loops

As you can see, there are a lot of loops in this process given the brute-force nature of the script. First, it loads the link directory to work with, then it loops through the search keywords you will use to find traces of your submitted links (at least there should be two expressions: the promoted domain name itself and a specific brand or a very specific keyword combination you have included in any link submission text you have spun —something to take into consideration as early as setting up your link submission project.)

Controls

There are subsequent steps like home, search, list (not always a meaningful nomenclature, but nevermind). In every step, a different module is called. If the module exits with success then it returns to the OK branch, if not, then to the NO branch. Depending on the value of the step variable, these two branches point to different modules, that is if it could not succeed with one search method then it goes on to try another one.

Your very own free link checker script

The script has been created with ZennoPoster, a powerful web automation tool, so first, you have to get and install this wonderful piece of software. Then click here to download the script.

Feel free to use, adapt and re-share it under the following terms: https://creativecommons.org/licenses/by-sa/4.0/ and please point a link to this page or to www.jaroli.hu.

Notes

If you already know ZennoPoster and/or find some of the solutions quite odd: Well, the script was originally written with ZennoPoster 3 way back in 2012, but soon afterwards an entirely rewritten ZennoPoster was brought to the market, with a lot of new concepts and many advancements in debugging, therefore I had to rewrite the entire script, but I just wanted to keep the changes to the minimum, like using lists instead of files or using switch boxes instead of a series of if boxes: as you can observe in the below screen capture which shows the imported Zenno3 script along with the recently rewritten one.

»more»

Systematic Job Search on LinkedIn

LinkedIn is a really cool platform when it comes to job search—or looking for your next step in your career, whatever. But what can you do when the site lacks a couple of vital features, and therefore it just wastes your time unnecessarily? Plus, how can you make sure that you have seen all the potentially interesting job offers when there are a couple of hundreds of positions available in your region?

Reading through hundreds of job offers?

Yes, I might be in a unique situation, where many circumstances just do not matter that much, so perhaps I keep more on my radar and browse through more adverts than an ordinary—or a casual job hunter. And this is where my problem is rooted: it is just too cumbersome to regularly check new job adverts on LinkedIn. Imagine that you have to enter many keywords one by one, then enter the location, set the desired distance radius, sort them by the date and start scrolling through the list, click each job offer which might seem interesting—judged by the search result page excerpt. Usually, you end up checking many ads you have already seen, again and again.

Save button for Job offers is just not enough

The whole process could be much easier with a simple feature like a button next to the Save/Unsave button with the text ”That’s not for me” or simply “Hide”. Luckily, I can quickly write a script with a web automation software to implement some of the missing functions of LinkedIN—or any other web site. However, the vast majority of ZennoPoster or Ubot Studio users are using these pieces of software to build bots which scrape a considerable amount of information from LinkedIn (something that LinkedIn hates and tries to prevent as much as possible), but you can use these tools for legitimate purposes too: to hack together something which provides you with the missing features.

It will not work with Ubot Studio alone

Ubot Studio has a nice feature I needed. It allows you to combine a web browser window with an additional user interface for data input, therefore I had initially started to implement this simple script in Ubot. Unfortunately, again I found some very basic obstacles which prevented me from building anything usable. <rant begins> To be honest, Ubot is just the least stable piece of software I have ever seen. If you take into consideration its price too, I think Ubot could be nominated for the title of the most time and money wasting application ever. I have already wasted so much time because of its frequent crashes, inability to deal with certain types of websites, unpredictable behavior, etc., that I always regret when I pay for an upgrade again. As I mentioned there are only two features why I still keep on struggling with this tool: the additional user interface and the ability to easily compile standalone .exe files for the bots. All in all, if I logged in to LinkedIN, Ubot just could not detect anything on the web page loaded, and tech support had no solution for that either — which is a shame and gives me the feeling that something is just screwed up with this tool from the ground up. <rant ends>

Let’s write two scripts then!

This is why I fired up ZennoPoster, and put together a simple script—without any problems. It logs in to my LinkedIN profile, sets the location and enters the keywords for job search, like online marketing, digital marketing, social media, google, facebook, adwords, seo, hongarije, hongaars, archicad, spanish, hungarian, hungary, marketing, marketeer, etc. Then it goes through the search results list by clicking on the pagination links, and scrapes all the job advert links to a plain text list ensuring that the link is not already listed among those which have been checked previously. It also checks the presence of the keywords on the exclusion list, such as recruit, stage, stagiair, intern, php(\ |-), javascript, frontend, backend, front-end, back-end, webdeveloper, \.net\, etc. and omit these job offers obviously not made for me.

Step #2 – back to Ubot Studio!

Now, I have a URL list of all the job offers which have certain chance to be promising for me. The next step is to read through them as quickly as possible—we are talking about hundreds of job offers. Sometimes there are many false positives: for instance, if a recruiter company includes phrases like ”visit our Facebook page”, in each of their job descriptions, then the fact that I am looking for Facebook-related jobs with the ”facebook” keyword just makes everything much more difficult. In addition to that, I could find a couple of interesting job offers which did not match any of the specific keywords like ”online marketing”, only the very generic ones like ”marketing”.

I have already had similar, very simple bots written in Ubot for going through a long list of URLs to add feedback to every loaded web page and save back the URLs with the manually added data. (Think about quickly tagging or writing titles for hundreds of products in a web shop while understanding what those products are really about.) So I just quickly modified an older script. It just loads one job offer page, waits until I click an appropriate check box: either ”Interesting” or ”Delete, and then it automatically loads the next web page while administering the process by adding the URL to the list of URLs already checked plus to the list of interesting job adverts if I clicked the corresponding box. With this method I could so quickly go through hundreds of job offers that once LinkedIn thought I am a bot (as I was logged in, since I also wanted to save some of the interesting jobs), and practically did not let me do anything without solving those silly captchas again about cars, roads and road signs, so I had to give up using LinkedIn for a day or so.

For someone who cannot just that quickly write web automation scripts, it might not have been worth setting up an automation hack for this case, but for me, I think it was worth it. On one hand, I guess the fact that I don’t speak Dutch (yet) will make my job hunting a little bit longer than the average, so it will save a lot of time for me. On the other hand by reading through so many job descriptions, now I have a better understanding of what kind of jobs are available in The Hage area.

»more»

How can simple scripts help online marketers?

Having read my previous post about how I found myself messing around with visual programming as an online marketer, you might have wondered: and what would be the everyday uses of those scripts when dealing with ad campaigns and web sites? Well, let me share some examples from the last few years I was working with these web automation tools to illustrate this:

Overcoming limitations of AdWords: finding more manual display network placements

Have you ever wondered whether AdWords will suggest you all the relevant display network placements, Youtube videos or Youtube playlists when you try to add them by entering relevant keywords? Well, the answer is that relying only on the ad management interface of AdWords, you will miss a lot of relevant placements. Fortunately, with some web automation skills, you can quickly build a script which finds even more relevant placements for instance by executing site searches on Youtube for a certain list of keywords, automatically pressing the next-next buttons and generating a simple list of URLs based on what has been displayed on the search result pages.

Analyzing your data the way you want: exporting external link data in a meaningful format

Although Google Webmaster Tools (a.k.a. Google Search Console) lets you browse through a huge list of web pages where a certain site of yours is linked, you cannot really export that data in a usable format, such as linking domain, linking web page, linked page in the same row. Although you could click through the list of linking domains, then the list of linking pages and export a bunch of tables based on this hierarchy, this sounds like a kind of repetitive task which can be fairly easily automated. Adding a few more steps like scraping the title of the linking page plus the anchor of the link, you can end up having a really informative list of your external links – at least of those which are displayed by Google.

Analyzing your data the way you want: obtain raw engagement data

While Facebook shows you some insights about how your pages or posts are performing, you cannot simply grab the raw data of these statistics: such as the number of visitors liked or shared certain posts in a given timeframe. But you can build a script which automatically scrolls and scrolls and scrolls – and extracts any data about the posts displayed. Having all the data in a spreadsheet format, you can visualize it the way you want. As a bonus: you can even do this with your competitors’ Facebook pages.

Automating repetitive tasks: checking link building results

Way back when we have been building tens of thousands of links on web directory sites, no link submission software could provide us with detailed and reliable data about which directories had accepted and published our link submissions and which had not. Without knowing how many links were eventually generated and where were those links located, we could not create detailed reports for our clients. On the other hand, the biggest problem of directory link building was that you never knew at the time of submission where the submitted link would be displayed in the directory, so the challenge was not only going through a list of URLs and see whether our link was found on those pages or not, but you had to look through the entire directory to figure out where exactly that link was. All in all, this task was more complicated than curl or wget a list of URLs and grep the results. Before I knew how to automate this process with visual scripting, we had to do this highly repetitive task by hand – so scripting could save us a lot of manual work.

Process spreadsheet data: check and merge what’s common in two tables

When you have to work with email lists and related data coming from different sources, you could quickly diff or merge two .csv files with Unix command line tools. But sometimes not everyone in your organization possesses those “geeky” skills to fire up awk for that, and many times you are also too lazy to find the best solution on StackOverflow. In these cases, with automation software, you can even create an .exe file with an easy to use interface where two files and a few more parameters are asked, such as which column’s data should be matched in the other spreadsheet to merge a table with the unified rows based on those matches – or whatever you can achieve with regular expressions, if / then statements and loops.

Extracting structured data from unstructured source: list of products in a website

Unfortunately, still there are many web shop owners who are running their sites based on proprietary webs hop management systems, which are not prepared for simply exporting the list of products from, or not in the appropriate format, with all the desired data, etc. In these cases, it is very handy if you can quickly build a script which scrapes the entire webs hop and outputs a spreadsheet of every product, containing all the important attributes and product data. Based on the result, you can start working on either the on-site SEO or importing those lists to Google AdWords of Facebook Ads.

Migrating web sites: exporting and importing from/to any CMS

There are quite a few ways of importing data into WordPress, but still, you might miss some features which can be normally accessed only if you upload the content manually, such as attaching images to a certain post or set the featured images. Not to mention that before that, you’ll have to get to the point of already having the data extracted from the old web site to a structured format such as XML and CSV. As many older CMSes and proprietary content management systems do not have such data exporting features, this part of the job could be also quite complicated, if not impossible. On the other hand, with some web automation skills you can extract any data in any format from the original site and imitate a human being filling out the corresponding data simply automating the new site’s administration interface – you don’t have to rely on any export-import plugin’s features – and shortcomings.

Web spamming: black hat SEO, fake Facebook accounts…

The tools I’m using for automating the above tasks are originally meant for creating accounts, posting content to a wide range of sites: thus spamming the entire web — but this is something I have never used these tools for – believe it or not 🙂

Coming next:

»more»