Each page of the XWiki is stored in xwiki.xml as an XML file which is
presented like that.
<?xml version="1.0" encoding="UTF-8"?>
<xwikidoc>
<web>Main</web>
<name>WebHome</name>
[...]
<contentUpdateDate>1383876486000</contentUpdateDate>
[...]
<content>[...]</content>
</xwikidoc>
<xwikidoc> is the root element of these XML file. After this element, the
<web> and <name> elements respectively describes the space and name of your
XWiki page. After <name> element, there is a few header elements that
describes the page. For example, you may find a few <object> elements that
describes the objects stored in that page. One element that is interesting for
us is the <contentUpdateDate> which contains the update date of this page.
Finally, the <content> element contains the content of our page (in my case,
the Velocity I am looking for, remember, I’m trying to restore it because I lost
Velocity work).
Given these informations, it seems pretty easy to find only the document we want
and copy/paste the content to save it. However, XWiki store each saved version
of each page. Though if I’m looking for the page Main.WebHome, I will find
multiple instances and that’s where the <contentUpdateDate> will help us.
Step 4.1 - Search a multiline pattern
The first thing we want to do is identifying the right page. And because a page
is identified by it’s name and it’s space, we need to look into the file for a
multiline pattern that contains <web> and <name>.
For a multiline research, people says (see references) you should be able to use the -P option
of grep. But for a mysterious reason, the -P option doesn’t works for me
and ends up with the following message.
[1] 2618 abort (core dumped) /usr/bin/grep [...]
So I used another solution proposed in the same forum reference: pcregrep.
$ pcregrep -M -n -A 11 '<web>Main<\/web>[^<]*\n[^<]*<name>WebHome<\/name>' temp.xml
I add also the -n which numerotates each line: this will help us to find the
line number of what we are looking for. And finally, I add the -A 11 option
in order to display 11 lines after each matching pattern. Why I display 11
lines? That’s because 11 lines later, we can find the <contentUpdateDate>
element which is the information we are interested in.
Step 4.2 - Filter the <contentUpdateDate> elements
With the previous command, we have searched for documents with a particular name
and space. We have also displayed each pattern with some extra lines wich
contains the following <contentUpdateDate> element. So each
<contentUpdateDate> element that is displayed in this output is an potentially
interesting one. Everything else in the output is of no interest now.
$ <output-4.1> | grep '<contentUpdateDate>'
|
Note
|
About the output
In the following commands, I will use the <output-4.?> notation to designate
the output of the previous step command.
|
Step 4.3 - Keep the numbers we want
In this step, I will remove the XML elements and keep only the line number and
the date (the data inside the <contentUpdateDate> element). I will also place
the date at the beginning of the line in order to be able to sort them in the
next step.
$ <output-4.2> | sed 's/^\([0-9]\+\).*<contentUpdateDate>\([0-9]\+\)<\/.*$/\2-\1/g'
Step 4.4 - Sort by date and keep the last one
This is pretty easy since the sort command exists. To keep only the last one,
we can use the tail command.
$ <output-4.3> | sort | tail -n 1
Step 4.5 - Get the line number
Now, we just need to get the line number and save it in a variable. We will use
the awk command (we could also use the sed command) to get the second field
of our line which should look like 1383876486000-2356 (the first field is the
date and the second the line number).
$ LINE_NUMBER=`<output-4.4> | awk -F '-' '{print $2}'`
Step 4.6 - Find the end of the XWiki document
Now that we have the line number of the <contentUpdateDate> of the document we
are looking for, we will parse the file from this element to find the first
</xwikidoc> element (which is the close element of the document).
First of all, we will use grep -n '' in order to add the line numbers to our
input. Then we will remove all the lines before our LINE_NUMBER with tail
because we want the first </xwikidoc> after this line number. We will filter
all the </xwikidoc> elements with grep and keep only the first with head.
Finally, we will get the line number from this input with awk and store it in
the END_LINE variable.
END_LINE=`cat xwiki.xml | grep -n '' | tail -n +$LINE | grep '<\/xwikidoc>' | head -n 1 | awk -F ':' '{print $1}'`
Step 4.7 - Find the start of the XWiki document
The operation is nearly the same that for the step 4.6. We will just revert
some of the sub-steps. In this case, we will only keep the part of the document
that is before LINE_NUMBER with head. Then we will look for <?xml
pattern, keep the last found pattern and get then store the line number into the
BEGIN_LINE variable.
$ BEGIN_LINE=`cat xwiki.xml | grep -n '' | head -n $(($END_LINE - 1)) | grep '<?xml' | tail -n 1 | awk -F ':' '{print $1}'`
Step 4.8 - Get the XWiki document
We now have BEGIN_LINE and END_LINE, we will be able to keep only this part
of the database with head and tail with a bit of bash arithmetic.
$ cat xwiki.xml | tail -n +$BEGIN_LINE | head -n $(($END_LINE - BEGIN_LINE + 1 )) > Main.WebHome.xml
You can now get what’s in the <content> element. It is also possible to get
the informations from the <object> (where you may also have code and
configuration). If you want, you probably can create a XAR file from this XML
file that you can import into your new XWiki.
You still may need to be careful about the encoding of the file because I notice
that the encoding was latin1 in my case but a copy/paste of the content part
into a new XWiki page need some utf-8 encoding.