Introduction
It is really a problem when in a file you have characters displayed as something
like ^@
or <E5>
. Indeed, yesterday, my XWiki
(installed in local on my computer) has been crashed by a computer freeze. I
hardly reboot my computer and never been able to start the wiki after that.
On this wiki, I was developping some new Velocity scripts and if I was not able to get them back, it would be a lot of lost work. I used the default database system for XWiki which is HSQLDB.
Step 1 - Backup the work
My first action has been to backup all my work before deploying a new XWiki. On
this step, I was lucky because all my work was store in a subdirectory. All
stuff concerning the database was in a xwiki.data
directory and this is what
this directory contains.
xwiki.data |--> database/ |--> extension/ |--> jobs/ |--> lucene/ |--> solr/
Step 2 - The database
Exporting the database
The database file is inside de database/xwiki_db.lobs
file. I copy it into a
new directory where I will be able to parse it and recover my work.
For example, you can do this.
$ cd ~/ $ mkdir restore-xwiki/ $ cd restore-xwiki/ $ cp <path-of-xwiki-database>/xwiki_db.lobs xwiki.db
What’s in this database file?
This file is almost a simple text file with not-valid XML in it. Almost
because there is a lot of ^@
in it; indeed, between each readable character,
there is the ^@
character. Not-valid XML file because it is in fact a
concatenation of multiple well-formed XML file (i.e. you have multiple <?xml
…?>
elements in the file).
Step 3 - Clean the file
Your file is now called xwiki.db
. We will parse it to remove all the ^@
characters. You may want to try with any file editor like gedit
, emacs
or
vim
. However, the file is a database and in my case, was a more than 300MB
file. Opening a large file like this may be a problem (and parsing it to
replace characters and even bigger problem in this case).
To parse this file, we will use the tr
command which is way more efficient.
The only thing to do before using tr
is to identify the ASCII/UTF-8/WHATEVER
code of this ^@
character.
Identifying the character
To identify this character, I used a tricky way which consist to take a small part of
this file, opening it with my favorite editor vim
and identifying the code
under it.
To create the small part of the database file, use the following then open with vim.
$ cat xwiki.db | head > temp.db $ vim temp.db
Once you’re inside vim, you can see ^@
characters. Place your cursor on it
and type ga
on your keyboard (ga
is for Get ASCII). You should see at the
bottom of the window the code of the character we are looking for: this is the
one with the code 0.
Bonus to identify the character
We could have done a easier method to identify the character (but this method
may not work on other cases). You can open vim
and typing the following:
CTRL+v
then CTRL+@
. You should obtain the character we are looking for.
Then to identify it, you can still use the ga
shortcut.
Clean the file
Now, we will use the tr
command which can replace each character of a file
with another (this functionality doesn’t interests us) and also delete specific
characters from a file (this functionality seems more interesting!).
tr
take the standard input as an entry and we will save the output in the
xwiki.xml
file (which is not a real XML file).
$ tr -d '\000' < xwiki.db > xwiki.xml
Step 4 - Find the XWiki document
Each page of the XWiki is stored in xwiki.xml
as an XML file which is
presented like that.
<?xml version="1.0" encoding="UTF-8"?> <xwikidoc> <web>Main</web> <name>WebHome</name> [...] <contentUpdateDate>1383876486000</contentUpdateDate> [...] <content>[...]</content> </xwikidoc>
<xwikidoc>
is the root element of these XML file. After this element, the
<web>
and <name>
elements respectively describes the space and name of your
XWiki page. After <name>
element, there is a few header elements that
describes the page. For example, you may find a few <object>
elements that
describes the objects stored in that page. One element that is interesting for
us is the <contentUpdateDate>
which contains the update date of this page.
Finally, the <content>
element contains the content of our page (in my case,
the Velocity I am looking for, remember, I’m trying to restore it because I lost
Velocity work).
Given these informations, it seems pretty easy to find only the document we want
and copy/paste the content to save it. However, XWiki store each saved version
of each page. Though if I’m looking for the page Main.WebHome
, I will find
multiple instances and that’s where the <contentUpdateDate>
will help us.
Step 4.1 - Search a multiline pattern
The first thing we want to do is identifying the right page. And because a page
is identified by it’s name and it’s space, we need to look into the file for a
multiline pattern that contains <web>
and <name>
.
For a multiline research, people says (see references) you should be able to use the -P
option
of grep
. But for a mysterious reason, the -P
option doesn’t works for me
and ends up with the following message.
[1] 2618 abort (core dumped) /usr/bin/grep [...]
So I used another solution proposed in the same forum reference: pcregrep
.
$ pcregrep -M -n -A 11 '<web>Main<\/web>[^<]*\n[^<]*<name>WebHome<\/name>' temp.xml
I add also the -n
which numerotates each line: this will help us to find the
line number of what we are looking for. And finally, I add the -A 11
option
in order to display 11 lines after each matching pattern. Why I display 11
lines? That’s because 11 lines later, we can find the <contentUpdateDate>
element which is the information we are interested in.
Step 4.2 - Filter the <contentUpdateDate>
elements
With the previous command, we have searched for documents with a particular name
and space. We have also displayed each pattern with some extra lines wich
contains the following <contentUpdateDate>
element. So each
<contentUpdateDate>
element that is displayed in this output is an potentially
interesting one. Everything else in the output is of no interest now.
$ <output-4.1> | grep '<contentUpdateDate>'
Note
|
About the output
In the following commands, I will use the |
Step 4.3 - Keep the numbers we want
In this step, I will remove the XML elements and keep only the line number and
the date (the data inside the <contentUpdateDate>
element). I will also place
the date at the beginning of the line in order to be able to sort them in the
next step.
$ <output-4.2> | sed 's/^\([0-9]\+\).*<contentUpdateDate>\([0-9]\+\)<\/.*$/\2-\1/g'
Step 4.4 - Sort by date and keep the last one
This is pretty easy since the sort
command exists. To keep only the last one,
we can use the tail
command.
$ <output-4.3> | sort | tail -n 1
Step 4.5 - Get the line number
Now, we just need to get the line number and save it in a variable. We will use
the awk
command (we could also use the sed
command) to get the second field
of our line which should look like 1383876486000-2356
(the first field is the
date and the second the line number).
$ LINE_NUMBER=`<output-4.4> | awk -F '-' '{print $2}'`
Step 4.6 - Find the end of the XWiki document
Now that we have the line number of the <contentUpdateDate>
of the document we
are looking for, we will parse the file from this element to find the first
</xwikidoc>
element (which is the close element of the document).
First of all, we will use grep -n ''
in order to add the line numbers to our
input. Then we will remove all the lines before our LINE_NUMBER
with tail
because we want the first </xwikidoc>
after this line number. We will filter
all the </xwikidoc>
elements with grep
and keep only the first with head
.
Finally, we will get the line number from this input with awk
and store it in
the END_LINE
variable.
END_LINE=`cat xwiki.xml | grep -n '' | tail -n +$LINE | grep '<\/xwikidoc>' | head -n 1 | awk -F ':' '{print $1}'`
Step 4.7 - Find the start of the XWiki document
The operation is nearly the same that for the step 4.6. We will just revert
some of the sub-steps. In this case, we will only keep the part of the document
that is before LINE_NUMBER
with head
. Then we will look for <?xml
pattern, keep the last found pattern and get then store the line number into the
BEGIN_LINE
variable.
$ BEGIN_LINE=`cat xwiki.xml | grep -n '' | head -n $(($END_LINE - 1)) | grep '<?xml' | tail -n 1 | awk -F ':' '{print $1}'`
Step 4.8 - Get the XWiki document
We now have BEGIN_LINE
and END_LINE
, we will be able to keep only this part
of the database with head
and tail
with a bit of bash
arithmetic.
$ cat xwiki.xml | tail -n +$BEGIN_LINE | head -n $(($END_LINE - BEGIN_LINE + 1 )) > Main.WebHome.xml
You can now get what’s in the <content>
element. It is also possible to get
the informations from the <object>
(where you may also have code and
configuration). If you want, you probably can create a XAR file from this XML
file that you can import into your new XWiki.
You still may need to be careful about the encoding of the file because I notice
that the encoding was latin1
in my case but a copy/paste of the content part
into a new XWiki page need some utf-8
encoding.