Each page of the XWiki is stored in xwiki.xml
as an XML file which is
presented like that.
<?xml version="1.0" encoding="UTF-8"?>
<xwikidoc>
<web>Main</web>
<name>WebHome</name>
[...]
<contentUpdateDate>1383876486000</contentUpdateDate>
[...]
<content>[...]</content>
</xwikidoc>
<xwikidoc>
is the root element of these XML file. After this element, the
<web>
and <name>
elements respectively describes the space and name of your
XWiki page. After <name>
element, there is a few header elements that
describes the page. For example, you may find a few <object>
elements that
describes the objects stored in that page. One element that is interesting for
us is the <contentUpdateDate>
which contains the update date of this page.
Finally, the <content>
element contains the content of our page (in my case,
the Velocity I am looking for, remember, I’m trying to restore it because I lost
Velocity work).
Given these informations, it seems pretty easy to find only the document we want
and copy/paste the content to save it. However, XWiki store each saved version
of each page. Though if I’m looking for the page Main.WebHome
, I will find
multiple instances and that’s where the <contentUpdateDate>
will help us.
Step 4.1 - Search a multiline pattern
The first thing we want to do is identifying the right page. And because a page
is identified by it’s name and it’s space, we need to look into the file for a
multiline pattern that contains <web>
and <name>
.
For a multiline research, people says (see references) you should be able to use the -P
option
of grep
. But for a mysterious reason, the -P
option doesn’t works for me
and ends up with the following message.
[1] 2618 abort (core dumped) /usr/bin/grep [...]
So I used another solution proposed in the same forum reference: pcregrep
.
$ pcregrep -M -n -A 11 '<web>Main<\/web>[^<]*\n[^<]*<name>WebHome<\/name>' temp.xml
I add also the -n
which numerotates each line: this will help us to find the
line number of what we are looking for. And finally, I add the -A 11
option
in order to display 11 lines after each matching pattern. Why I display 11
lines? That’s because 11 lines later, we can find the <contentUpdateDate>
element which is the information we are interested in.
Step 4.2 - Filter the <contentUpdateDate>
elements
With the previous command, we have searched for documents with a particular name
and space. We have also displayed each pattern with some extra lines wich
contains the following <contentUpdateDate>
element. So each
<contentUpdateDate>
element that is displayed in this output is an potentially
interesting one. Everything else in the output is of no interest now.
$ <output-4.1> | grep '<contentUpdateDate>'
Note
|
About the output
In the following commands, I will use the <output-4.?> notation to designate
the output of the previous step command.
|
Step 4.3 - Keep the numbers we want
In this step, I will remove the XML elements and keep only the line number and
the date (the data inside the <contentUpdateDate>
element). I will also place
the date at the beginning of the line in order to be able to sort them in the
next step.
$ <output-4.2> | sed 's/^\([0-9]\+\).*<contentUpdateDate>\([0-9]\+\)<\/.*$/\2-\1/g'
Step 4.4 - Sort by date and keep the last one
This is pretty easy since the sort
command exists. To keep only the last one,
we can use the tail
command.
$ <output-4.3> | sort | tail -n 1
Step 4.5 - Get the line number
Now, we just need to get the line number and save it in a variable. We will use
the awk
command (we could also use the sed
command) to get the second field
of our line which should look like 1383876486000-2356
(the first field is the
date and the second the line number).
$ LINE_NUMBER=`<output-4.4> | awk -F '-' '{print $2}'`
Step 4.6 - Find the end of the XWiki document
Now that we have the line number of the <contentUpdateDate>
of the document we
are looking for, we will parse the file from this element to find the first
</xwikidoc>
element (which is the close element of the document).
First of all, we will use grep -n ''
in order to add the line numbers to our
input. Then we will remove all the lines before our LINE_NUMBER
with tail
because we want the first </xwikidoc>
after this line number. We will filter
all the </xwikidoc>
elements with grep
and keep only the first with head
.
Finally, we will get the line number from this input with awk
and store it in
the END_LINE
variable.
END_LINE=`cat xwiki.xml | grep -n '' | tail -n +$LINE | grep '<\/xwikidoc>' | head -n 1 | awk -F ':' '{print $1}'`
Step 4.7 - Find the start of the XWiki document
The operation is nearly the same that for the step 4.6. We will just revert
some of the sub-steps. In this case, we will only keep the part of the document
that is before LINE_NUMBER
with head
. Then we will look for <?xml
pattern, keep the last found pattern and get then store the line number into the
BEGIN_LINE
variable.
$ BEGIN_LINE=`cat xwiki.xml | grep -n '' | head -n $(($END_LINE - 1)) | grep '<?xml' | tail -n 1 | awk -F ':' '{print $1}'`
Step 4.8 - Get the XWiki document
We now have BEGIN_LINE
and END_LINE
, we will be able to keep only this part
of the database with head
and tail
with a bit of bash
arithmetic.
$ cat xwiki.xml | tail -n +$BEGIN_LINE | head -n $(($END_LINE - BEGIN_LINE + 1 )) > Main.WebHome.xml
You can now get what’s in the <content>
element. It is also possible to get
the informations from the <object>
(where you may also have code and
configuration). If you want, you probably can create a XAR file from this XML
file that you can import into your new XWiki.
You still may need to be careful about the encoding of the file because I notice
that the encoding was latin1
in my case but a copy/paste of the content part
into a new XWiki page need some utf-8
encoding.