Get information from a XWiki database

Introduction

It is really a problem when in a file you have characters displayed as something like ^@ or <E5>. Indeed, yesterday, my XWiki (installed in local on my computer) has been crashed by a computer freeze. I hardly reboot my computer and never been able to start the wiki after that.

On this wiki, I was developping some new Velocity scripts and if I was not able to get them back, it would be a lot of lost work. I used the default database system for XWiki which is HSQLDB.

Step 1 - Backup the work

My first action has been to backup all my work before deploying a new XWiki. On this step, I was lucky because all my work was store in a subdirectory. All stuff concerning the database was in a xwiki.data directory and this is what this directory contains.

xwiki.data
|--> database/
|--> extension/
|--> jobs/
|--> lucene/
|--> solr/

Step 2 - The database

Exporting the database

The database file is inside de database/xwiki_db.lobs file. I copy it into a new directory where I will be able to parse it and recover my work.

For example, you can do this.

$ cd ~/
$ mkdir restore-xwiki/
$ cd restore-xwiki/
$ cp <path-of-xwiki-database>/xwiki_db.lobs xwiki.db

What’s in this database file?

This file is almost a simple text file with not-valid XML in it. Almost because there is a lot of ^@ in it; indeed, between each readable character, there is the ^@ character. Not-valid XML file because it is in fact a concatenation of multiple well-formed XML file (i.e. you have multiple <?xml …​?> elements in the file).

Step 3 - Clean the file

Your file is now called xwiki.db. We will parse it to remove all the ^@ characters. You may want to try with any file editor like gedit, emacs or vim. However, the file is a database and in my case, was a more than 300MB file. Opening a large file like this may be a problem (and parsing it to replace characters and even bigger problem in this case).

To parse this file, we will use the tr command which is way more efficient. The only thing to do before using tr is to identify the ASCII/UTF-8/WHATEVER code of this ^@ character.

Identifying the character

To identify this character, I used a tricky way which consist to take a small part of this file, opening it with my favorite editor vim and identifying the code under it.

To create the small part of the database file, use the following then open with vim.

$ cat xwiki.db | head > temp.db
$ vim temp.db

Once you’re inside vim, you can see ^@ characters. Place your cursor on it and type ga on your keyboard (ga is for Get ASCII). You should see at the bottom of the window the code of the character we are looking for: this is the one with the code 0.

Bonus to identify the character

We could have done a easier method to identify the character (but this method may not work on other cases). You can open vim and typing the following: CTRL+v then CTRL+@. You should obtain the character we are looking for. Then to identify it, you can still use the ga shortcut.

Clean the file

Now, we will use the tr command which can replace each character of a file with another (this functionality doesn’t interests us) and also delete specific characters from a file (this functionality seems more interesting!).

tr take the standard input as an entry and we will save the output in the xwiki.xml file (which is not a real XML file).

$ tr -d '\000' < xwiki.db > xwiki.xml

Step 4 - Find the XWiki document

Each page of the XWiki is stored in xwiki.xml as an XML file which is presented like that.

<?xml version="1.0" encoding="UTF-8"?>
<xwikidoc>
<web>Main</web>
<name>WebHome</name>
[...]
<contentUpdateDate>1383876486000</contentUpdateDate>
[...]
<content>[...]</content>
</xwikidoc>

<xwikidoc> is the root element of these XML file. After this element, the <web> and <name> elements respectively describes the space and name of your XWiki page. After <name> element, there is a few header elements that describes the page. For example, you may find a few <object> elements that describes the objects stored in that page. One element that is interesting for us is the <contentUpdateDate> which contains the update date of this page. Finally, the <content> element contains the content of our page (in my case, the Velocity I am looking for, remember, I’m trying to restore it because I lost Velocity work).

Given these informations, it seems pretty easy to find only the document we want and copy/paste the content to save it. However, XWiki store each saved version of each page. Though if I’m looking for the page Main.WebHome, I will find multiple instances and that’s where the <contentUpdateDate> will help us.

Step 4.1 - Search a multiline pattern

The first thing we want to do is identifying the right page. And because a page is identified by it’s name and it’s space, we need to look into the file for a multiline pattern that contains <web> and <name>.

For a multiline research, people says (see references) you should be able to use the -P option of grep. But for a mysterious reason, the -P option doesn’t works for me and ends up with the following message.

[1]    2618 abort (core dumped)  /usr/bin/grep [...]

So I used another solution proposed in the same forum reference: pcregrep.

$ pcregrep -M -n -A 11 '<web>Main<\/web>[^<]*\n[^<]*<name>WebHome<\/name>' temp.xml

I add also the -n which numerotates each line: this will help us to find the line number of what we are looking for. And finally, I add the -A 11 option in order to display 11 lines after each matching pattern. Why I display 11 lines? That’s because 11 lines later, we can find the <contentUpdateDate> element which is the information we are interested in.

Step 4.2 - Filter the <contentUpdateDate> elements

With the previous command, we have searched for documents with a particular name and space. We have also displayed each pattern with some extra lines wich contains the following <contentUpdateDate> element. So each <contentUpdateDate> element that is displayed in this output is an potentially interesting one. Everything else in the output is of no interest now.

$ <output-4.1> | grep '<contentUpdateDate>'
Note
About the output

In the following commands, I will use the <output-4.?> notation to designate the output of the previous step command.

Step 4.3 - Keep the numbers we want

In this step, I will remove the XML elements and keep only the line number and the date (the data inside the <contentUpdateDate> element). I will also place the date at the beginning of the line in order to be able to sort them in the next step.

$ <output-4.2> | sed 's/^\([0-9]\+\).*<contentUpdateDate>\([0-9]\+\)<\/.*$/\2-\1/g'

Step 4.4 - Sort by date and keep the last one

This is pretty easy since the sort command exists. To keep only the last one, we can use the tail command.

$ <output-4.3> | sort | tail -n 1

Step 4.5 - Get the line number

Now, we just need to get the line number and save it in a variable. We will use the awk command (we could also use the sed command) to get the second field of our line which should look like 1383876486000-2356 (the first field is the date and the second the line number).

$ LINE_NUMBER=`<output-4.4> | awk -F '-' '{print $2}'`

Step 4.6 - Find the end of the XWiki document

Now that we have the line number of the <contentUpdateDate> of the document we are looking for, we will parse the file from this element to find the first </xwikidoc> element (which is the close element of the document).

First of all, we will use grep -n '' in order to add the line numbers to our input. Then we will remove all the lines before our LINE_NUMBER with tail because we want the first </xwikidoc> after this line number. We will filter all the </xwikidoc> elements with grep and keep only the first with head. Finally, we will get the line number from this input with awk and store it in the END_LINE variable.

END_LINE=`cat xwiki.xml | grep -n '' | tail -n +$LINE | grep '<\/xwikidoc>' | head -n 1 | awk -F ':' '{print $1}'`

Step 4.7 - Find the start of the XWiki document

The operation is nearly the same that for the step 4.6. We will just revert some of the sub-steps. In this case, we will only keep the part of the document that is before LINE_NUMBER with head. Then we will look for <?xml pattern, keep the last found pattern and get then store the line number into the BEGIN_LINE variable.

$ BEGIN_LINE=`cat xwiki.xml | grep -n '' | head -n $(($END_LINE - 1)) | grep '<?xml' | tail -n 1 | awk -F ':' '{print $1}'`

Step 4.8 - Get the XWiki document

We now have BEGIN_LINE and END_LINE, we will be able to keep only this part of the database with head and tail with a bit of bash arithmetic.

$ cat xwiki.xml | tail -n +$BEGIN_LINE | head -n $(($END_LINE - BEGIN_LINE + 1 )) > Main.WebHome.xml

You can now get what’s in the <content> element. It is also possible to get the informations from the <object> (where you may also have code and configuration). If you want, you probably can create a XAR file from this XML file that you can import into your new XWiki.

You still may need to be careful about the encoding of the file because I notice that the encoding was latin1 in my case but a copy/paste of the content part into a new XWiki page need some utf-8 encoding.

links

social