Categories
Technical Articles

A deep dive into Texport – Alfresco Exports & Imports

Texport is a tool that was built to migrate content between Alfresco repositories, in the most easy and clean way possible. It was developed with a special focus on doing a fast migration even of repositories with large data sets.

As a “deep dive”, the objective of this article is to explain in a more specific and technical way, how it works. Nonetheless, if you want, you can check more generic information about the Texport tool in here!

First export and then import

For a migration tool to work seamlessly for the final user, it needs two tools combined together. One tool to extract content and another to import content. We’ll start by the natural flow of the Texport. First, we’ll talk a little bit about the export, and then we’ll talk a little bit about the import.

A deep dive into Texport – The export

A deep dive into Texport - The export

Right from the beginning, our thoughts were on performance. So, while architecting the tool, we decided that the best programming language to use was python, since it has a close relationship with the operating system, and we can fully optimize it.

Avoid the single file package solution

Alfresco already had a previous solution to extract content, the ACP. However, it doesn’t work very well for large repositories, since any operating system will struggle while processing large sized files. It is common for users to get memory leak problems while generating and/or consuming these ACP files.

Texport follows a different approach. Instead of saving everything in one package, the repository folder structure is replicated into the file system. You will have all of your content in your OS file system, and in the same folder structure. This way, we don’t have any memory problems while generating the extraction package.

Taking advantage of the of the Alfresco REST API

To perform the extraction of content we use the available Alfresco REST API. We crawl the repository starting by the root node and calling the list node children REST API, and then if we find a node of type cm:content we download it and retrieve its properties and metadata.

Crawling the repository efficiently

The most time-consuming operation when extracting large repositories is to crawl through all the content. A possible solution to do this is to use the recursive approach. Yet, it’s not the most efficient approach! Unless we could run our recursive function in parallel. Well, Python allows us to do this in a couple of ways. 

Multithreading vs Multiprocessing

Just to give a little context, the main difference between these two options is that threads run in the same memory space, while processes run in a separate memory.

Since processes run in a separate memory space, it’s not easy to share objects between processes with multiprocessing. And since threads use the same memory, precautions have to be taken by the Python libraries, or two threads will write to the same memory at the same time. A Python global interpreter lock was created for this, preventing a real parallel multithreading. 

In our specific case, there’s no need to share objects between processes. Each process will call a specific and independent REST API and download the content to a specific path. It was decided to use a multiprocessing approach. Using this approach, we can use a true parallel processing and have a better performance than the multithreading approach.

Adding multiprocessing to the recursive function

For each recursive iteration, the tool spawns a new independent process. It connects to the REST API and creates a folder (if the node is a folder) or downloads a file (if it’s a file with content). After this, it will call the same function and spawn another new independent process.

The problem with this is that the spawning of processes will be much faster than the execution of processes. For each recursive iteration, before spawning a new process, the tool has to check if it can spawn a new process or if it continues the execution with the current process.

To achieve this, we defined a property named maxProcesses that determines the maximum number of processes allowed to run. Before each recursive iteration, the tool checks the number of processes currently running and compares this value with the constant maxProcesses. If the number of processes currently running is lower than the maxProcesses constant, the tool can spawn a new process. The following diagram shows what was previously described:

Adding multiprocessing to the recursive function

This way we can improve the performance of the export depending on the machine it is running, simply by editing a property in the properties file.

 

A deep dive into Texport – The import

Since Alfresco already provides a well-built import tool out of the box, we thought that we didn’t need to reinvent the wheel. We used the bulk import tool, by Peter Monks as a base. We adapted it to work seamlessly with our export tool and added some for features that weren’t available. The original tool uses the Alfresco Java API to perform the import and has two options of import.

Streaming

This option is where the content is located in the hard drive, but in any specific location. Before starting the import, the user must define the path in a property, to point to the extracted package. After that it will stream it and import it. It’s a simple upload and not very fast.

In-place

On the other hand, when the content is in-place, meaning that the export package is already located in the destination Alfresco content store, instead of a normal upload of content, the import will only convert the files into binaries, making the import much faster. In fact, one of the main features of the import tool is the possibility to do the in-place import:

    • The data is already placed in the destination content store
    • No copying data
    • No moving data

Since we are building a migration tool, and we have total control of the location where the export package will be stored, plus the in-place migration is a much better option in terms of performance, we adapted the bulk import tool to only work with the in-place import.

Export with the eyes on the import

This export is made in the context of a migration. It’s not just only an export that retrieves data efficiently and downloads it onto the filesystem for it to be there as a simple backup. The objective of this export is to retrieve and prepare the data for it to be imported after the export is finished. That’s why the Texport is a tool that needs to be installed onto the Alfresco destination instance – the Alfresco instance where the migration will take place. The following image shows the view of the architecture:

Export with the eyes on the import

In blue, we have two instances of the ACS, on the left side the source and on the right side the destination. In orange, we have a representation of the Texport tool installed on the ACS destination instance. The migration is explained in a number of steps:

  1. The export reaches the ACS source instance using the Alfresco REST API.
  2. The content (files, folders, folder structure) is downloaded to the ACS destination instance content store.
  3. After the export is concluded, the import will start. It uses the JAVA API to:
  4. Convert the content from files/folders into binaries.

After the import finishes, the content will be available in the ACS destination instance.

 

A deep dive into Texport – Besides the content

We now have the main content covered up, meaning the files and the folders. But the Texport tool can migrate some other things such as:

  • Properties
  • Permissions
  • Versions
  • Relationships
  • Categories
  • Tags
  • Users
  • Sites
  • Groups
  • Site and Group memberships

Properties

Properties are the file properties, the metadata (like created date, title or description). Upon exporting, a new xml file is created containing this data. It will have the same name of the file that is being downloaded plus the suffix .properties.metadata.xml. For example, if the export is downloading a file named ReadMe – Alfresco in the Cloud.pdf a new metadata file will be created with the name ReadMe – Alfresco in the Cloud.pdf.properties.metadata.xml containing the properties of the ReadMe – Alfresco in the Cloud.pdf. To get the properties, we use this REST API.

Permissions

Using the same logic as previously, but now for the permissions, we create a json file with the permission information. Using the same example, for a file named ReadMe – Alfresco in the Cloud.pdf the tool will create a file ReadMe – Alfresco in the Cloud.pdf.permissions.json.

Versions

The same goes for versions, with the difference that now each version will have a new file. For instance, if we have a file ReadMe – Alfresco in the Cloud.pdf, each version will be in the same path, and will be having the suffix .v[number of version].  In the next image we have an example of a package containing a file with versions:

Versions

And each file will have each a .metadata.properties.xml corresponding file as well. After the import, we will have:

Versions

The versions will correspond with the number after the .v of the files. The last version will be the 3.2 and corresponds with the file “ReadMe – Alfresco in the Cloud.pdf”. Due to the fact that it’s the latest version, it didn’t need the extension .v3.2.

Relationships

So far, each of the previous features are exported and will be imported in real time, this means that it will be imported as soon as the importer thread is crawling the content store. But for the relations that is not as easy since a node can have a relationship with a different node that is in a completely different place of the repository, and the import must have all the nodes imported before it can set the relationships. 

With this in mind, the solution was to store all the relationship information of the nodes in a folder in the root of the export package, referencing the node ID. Upon the import, we added a post-processing function that will read the contents of this folder and will set up the relationships after the import ends.

Tags and Categories

We decided to group the tags and categories in this section because we migrate them in the same way. The only difference is that the categories have a hierarchical structure unlike the tags.

Still, to export the information we continue to use the Alfresco REST API, but instead of creating a new file, we add the information on the metadata properties file. Upon importing, we basically create the values if they don’t yet exist, and after that we associate them to the node, after creating it.

Sites

All sites and site information will be exported in a json file that will be located in the root of the package. After the export is finished, the import tool will check for the existence of this json file and will create all the sites.

Users and Groups

The export tool can export information related to users and groups, but the import cannot create them. Since the REST API doesn’t return passwords due to security constraints, it’s not possible for the Texport to migrate users or groups. The only thing it can do is to update/synchronize user and group properties.

Site and Group memberships

It’s not possible to create users and groups automatically, but if these users and groups are already created in the destination instance and have the same name of the source instance, the Texport can associate the users to a certain group and can add users to the sites where they already belonged on the source instance.

           

And this is it! A deep dive into a tool that exports and imports at object level supporting all Alfresco object types and it carries the symbology of true export. If you like it, please send your comments, your feedback is very important for me.