Greymeister.net

Jackrabbit Clustering Primer

Introduction

I’ve worked with the Apache Jackrabbit implementation of the Java Content Repository (also known as JCR or JSR-170) for some time now, and found it was a bit confusing to get a load balanced implementation. There are plenty of guides and documentation on the Jackrabbit wiki but piecing them together in a way that makes sense took a significant effort. The purpose of this blog post is to describe my approach in the hope that it may make it easier for others with the same goal in mind. I have created a GitHub project with some of the key configuration files included, please feel free to check it out to refer to it as I go along.

Technology stack

I will be describing how to use Jackrabbit 2.2.x with Apache Tomcat 6.0. I’ll also be using an NFS shared filesystem and a PostgreSQL 8.4 database. Based on what you are using, there may be differences and so my steps may not directly apply. If you find something that gives you problems, please let me know and I’ll see if I can help. I don’t think it matters much, but this will be performed on VMs running CENTOS 5.x 64-bit. It should also work on any other system that the Java J2SE runs on, but you may have to alter configuration files such as the file paths and mount points.

Setting up the host systems

Assuming we have a vanilla GNU/Linux system, the first priority will be installing Tomcat. You may install using apt-get or some other package-based installation, but I tend to use the binary distribution, especially for test purposes. It essentially rules out mistakes I make when trying to conform to the proper directory layout of some other distribution. I initially create two directories, one for the tomcat server, and another for the repository. To follow along, I use

/srv/tomcat                 # Location for Tomcat 6.x installation
/srv/repository/datastore   # Location for NFS mount point

I assign the owner as a user I’ve created named “tomcat”, mainly to avoid using the root user for running the Tomcat servers. Generally, this is just a user who does not have sudo priveleges but that can SSH into the system. If your setup is similar, your /srv directory will probably look something like this:

drwxr-xr-x 5 tomcat users 4096 Nov 21 02:25 repository
drwxr-xr-x 9 tomcat users 4096 Nov 21 01:51 tomcat

I’ll also create a subdirectory “datastore” underneath the repository directory for the shared datastore mount point. Here is the relevant entry in my /etc/fstab file for each system:

//10.20.1.3/Public/datastore    /srv/repository/datastore   cifs    password="",uid=tomcat,gid=users    0 0

Any remote mount point will do. It must be accessible by all nodes in the Jackrabbit cluster. Make sure that the mount is active and that the tomcat user has write privileges to before proceeding.

Tomcat configuration

I’ll assume you’ve unzipped the Apache Tomcat binary and have the default layout. A directory listing should be similar to this one:

drwxr-xr-x 2 tomcat tomcat   4096 Nov 20 02:29 bin
drwxr-xr-x 3 tomcat tomcat   4096 Nov 21 01:09 conf
drwxr-xr-x 2 tomcat tomcat   4096 Nov 20 23:24 lib
-rw-r--r-- 1 tomcat tomcat  38657 Jan 10  2011 LICENSE
drwxr-xr-x 2 tomcat tomcat   4096 Nov 21 01:42 logs
-rw-r--r-- 1 tomcat tomcat    574 Jan 10  2011 NOTICE
-rw-r--r-- 1 tomcat tomcat   8672 Jan 10  2011 RELEASE-NOTES
-rw-r--r-- 1 tomcat tomcat   6836 Jan 10  2011 RUNNING.txt
drwxr-xr-x 8 tomcat tomcat   4096 Nov 28 06:55 temp
drwxr-xr-x 6 tomcat tomcat   4096 Nov 21 01:52 webapps
drwxr-xr-x 3 tomcat tomcat   4096 Nov 20 02:29 work

From now on, I’ll refer to this as CATALINA_HOME. Mine will be located in /srv/tomcat but yours can be anywhere else. I will not be referencing a CATALINA_BASE because I am not using a split Tomcat deployment. The Tomcat configuration consists of exposing both a PostgreSQL datasource and the Jackrabbit repository using JNDI. Inside of CATALINA_HOME/conf there are two XML files to edit. The first is server.xml. Edit the section “GlobalNamingResources” to contain a reference to your JDBC connection that the Jackrabbit repository will use.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
  <!-- Global JNDI resources
       Documentation at /docs/jndi-resources-howto.html
  -->
  <GlobalNamingResources>
    <!-- Editable user database that can also be used by
         UserDatabaseRealm to authenticate users
    -->
    <Resource name="UserDatabase" auth="Container"
        type="org.apache.catalina.UserDatabase"
        description="User database that can be updated and saved"
        factory="org.apache.catalina.users.MemoryUserDatabaseFactory"
        pathname="conf/tomcat-users.xml" />

    <Resource name="jdbc/repository" auth="Container"
        type="javax.sql.DataSource" driverClassName="org.postgresql.Driver"
        url="jdbc:postgresql://192.168.0.8:5432/jr_repository"
        username="jackrabbit" password="jackrabbit"
        validationQuery="select version();"
        maxActive="20" maxIdle="10" maxWait="-1"/>

  </GlobalNamingResources>

You’ll need to alter the configuration to suit your needs. If you are using PostgreSQL like I am, all you need to do is create a database and user for the repository cluster. This will need to be the same for each cluster node. The UserDatabase section is not required, I left it in as a reference to the location in the server.xml file. The next file to edit is context.xml. You’ll need to add another JNDI resource, this time for the repository.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
<!-- The contents of this file will be loaded for each web application -->
<Context>

    <!-- Default set of monitored resources -->
    <WatchedResource>WEB-INF/web.xml</WatchedResource>

    <!-- Uncomment this to disable session persistence across Tomcat restarts -->
    <!--
    <Manager pathname="" />
    -->

    <!-- Uncomment this to enable Comet connection tacking (provides events
         on session expiration as well as webapp lifecycle) -->
    <!--
    <Valve className="org.apache.catalina.valves.CometConnectionManagerValve" />
    -->

    <ResourceLink global="jdbc/repository"
      name="jdbc/repository"
      type="javax.sql.DataSource"/>

    <Resource name="jcr/repository"
        auth="Container"
        type="javax.jcr.Repository"
        factory="org.apache.jackrabbit.core.jndi.BindableRepositoryFactory"
        configFilePath="/srv/repository/repository.xml"
        repHomeDir="/srv/repository"/>

</Context>

You’ll notice that I left some of the default configuration in there just as a reference. The only tags relevant to the Jackrabbit configuration are the ResourceLink to the jdbc/repository resource and the jcr/repository Resource definition. You’ll notice the paths declared in that tag must be where you plan on having the repository configured on the node. I am still using my /srv/repository location.

The last step is to make sure that the proper libraries are available for tomcat to start the shared resources. I had some problems getting the exact right jar files in my CATALINA_HOME/lib directory, so I’m just going to show a directory listing. Note that several of these jars will be present in the default Tomcat installation.

-rw-r--r-- 1 tomcat tomcat  481535 Nov 20 23:24 log4j-1.2.16.jar
-rw-r--r-- 1 tomcat tomcat    9753 Nov 20 23:24 slf4j-log4j12-1.6.1.jar
-rw-r--r-- 1 tomcat tomcat   62086 Nov 20 23:23 commons-pool-1.3.jar
-rw-r--r-- 1 tomcat tomcat 2512189 Nov 20 23:22 derby-10.5.3.0_1.jar
-rw-r--r-- 1 tomcat tomcat  740930 Nov 20 23:21 jackrabbit-spi-commons-2.2.0.jar
-rw-r--r-- 1 tomcat tomcat   26822 Nov 20 23:20 jackrabbit-spi-2.2.0.jar
-rw-r--r-- 1 tomcat tomcat  286499 Nov 20 23:20 jackrabbit-jcr-commons-2.2.0.jar
-rw-r--r-- 1 tomcat tomcat   25496 Nov 20 23:20 slf4j-api-1.6.1.jar
-rw-r--r-- 1 tomcat tomcat  575389 Nov 20 23:19 commons-collections-3.2.1.jar
-rw-r--r-- 1 tomcat tomcat  121757 Nov 20 23:19 commons-dbcp-1.2.2.jar
-rw-r--r-- 1 tomcat tomcat  109043 Nov 20 23:19 commons-io-1.4.jar
-rw-r--r-- 1 tomcat tomcat 4326608 Nov 20 23:19 netcdf-4.2-min.jar
-rw-r--r-- 1 tomcat tomcat  189284 Nov 20 23:19 concurrent-1.3.4.jar
-rw-r--r-- 1 tomcat tomcat   23861 Nov 20 23:19 jackrabbit-api-2.2.0.jar
-rw-r--r-- 1 tomcat tomcat 2117338 Nov 20 23:18 jackrabbit-core-2.2.0.jar
-rw-rw-r-- 1 tomcat tomcat  539510 Nov 20 22:51 postgresql-8.4-702.jdbc4.jar
-rw-r--r-- 1 tomcat tomcat   69246 Nov 20 22:46 jcr-2.0.jar
-rw-r--r-- 1 tomcat tomcat   15239 Jan 10  2011 annotations-api.jar
-rw-r--r-- 1 tomcat tomcat   53756 Jan 10  2011 catalina-ant.jar
-rw-r--r-- 1 tomcat tomcat  129739 Jan 10  2011 catalina-ha.jar
-rw-r--r-- 1 tomcat tomcat 1208895 Jan 10  2011 catalina.jar
-rw-r--r-- 1 tomcat tomcat  237317 Jan 10  2011 catalina-tribes.jar
-rw-r--r-- 1 tomcat tomcat 1563059 Jan 10  2011 ecj-3.3.1.jar
-rw-r--r-- 1 tomcat tomcat   33410 Jan 10  2011 el-api.jar
-rw-r--r-- 1 tomcat tomcat  112550 Jan 10  2011 jasper-el.jar
-rw-r--r-- 1 tomcat tomcat  526946 Jan 10  2011 jasper.jar
-rw-r--r-- 1 tomcat tomcat   76692 Jan 10  2011 jsp-api.jar
-rw-r--r-- 1 tomcat tomcat   88210 Jan 10  2011 servlet-api.jar
-rw-r--r-- 1 tomcat tomcat  762878 Jan 10  2011 tomcat-coyote.jar
-rw-r--r-- 1 tomcat tomcat  253526 Jan 10  2011 tomcat-dbcp.jar
-rw-r--r-- 1 tomcat tomcat   70034 Jan 10  2011 tomcat-i18n-es.jar
-rw-r--r-- 1 tomcat tomcat   51965 Jan 10  2011 tomcat-i18n-fr.jar
-rw-r--r-- 1 tomcat tomcat   55036 Jan 10  2011 tomcat-i18n-ja.jar

Repository configuration

The repository configuration resides mostly in one file, the repository.xml file that must be at the root of each node’s repository location. The complete repository.xml file I’m using will be in the linked GitHub project, so check that out for the complete copy. I will be describing each section though here where I think it is relevant.

1
2
3
4
5
6
7
8
9
10
11
12
<Repository>
    <!--
        virtual file system where the repository stores global state
        (e.g. registered namespaces, custom node types, etc.)    
    -->

    <FileSystem class="org.apache.jackrabbit.core.fs.db.DbFileSystem">
       <param name="driver" value="javax.naming.InitialContext" />
       <param name="url" value="java:comp/env/jdbc/repository" />
       <param name="schemaObjectPrefix" value="rep_"/>
       <param name="schema" value="postgresql"/>
   </FileSystem>

The initial FileSystem definition is required to be shared based on the Jackrabbit Wiki clustering article. I am accessing it via the JNDI datasource set up in the previous section.

1
2
3
4
5
6
7
<!--
    data store configuration
-->
<DataStore class="org.apache.jackrabbit.core.data.FileDataStore">
    <param name="path" value="${rep.home}/datastore"/>
    <param name="minRecordLength" value="100"/>
</DataStore>

The Datastore implementation I’m using requires the file system to be shared amongst all the nodes. Here, I am pointing it at the mount point I created earlier on the file server, in a subdirectory of the repository home named “datastore”.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
<Workspaces rootPath="${rep.home}/workspaces" defaultWorkspace="default"/>
<!--
    workspace configuration template:
    used to create the initial workspace if there's no workspace yet
-->
<Workspace name="${wsp.name}">
    <!--
        virtual file system of the workspace:
        class: FQN of class implementing the FileSystem interface
    -->
    <FileSystem class="org.apache.jackrabbit.core.fs.local.LocalFileSystem">
        <param name="path" value="${wsp.home}"/>
    </FileSystem>
    <!--
        persistence manager of the workspace:
        class: FQN of class implementing the PersistenceManager interface
-->
<PersistenceManager class="org.apache.jackrabbit.core.persistence.pool.PostgreSQLPersistenceManager">
         <param name="driver" value="javax.naming.InitialContext"/>
         <param name="url" value="java:comp/env/jdbc/repository"/>
         <param name="schemaObjectPrefix" value="ws_"/>
         <param name="schema" value="postgresql"/>
</PersistenceManager>

</Workspace>

This is the workspace configuration I’m using. It is pretty bare bones, but it uses a PersistenceManager based on the recommendations of the Jackrabbit wiki. Again, it will be pointed at the PostgreSQL JNDI datasource we set up in Tomcat.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
   <Versioning rootPath="${rep.home}/version">
    <!--
        Configures the filesystem to use for versioning for the respective
        persistence manager
    -->
    <FileSystem class="org.apache.jackrabbit.core.fs.local.LocalFileSystem">
        <param name="path" value="${rep.home}/version" />
    </FileSystem>

    <!--
        Configures the persistence manager to be used for persisting version state.
        Please note that the current versioning implementation is based on
        a 'normal' persistence manager, but this could change in future
        implementations.
-->
<PersistenceManager class="org.apache.jackrabbit.core.persistence.pool.PostgreSQLPersistenceManager">
          <param name="driver" value="javax.naming.InitialContext"/>
          <param name="url" value="java:comp/env/jdbc/repository"/>
          <param name="schemaObjectPrefix" value="version_"/>
          <param name="schema" value="postgresql"/>
</PersistenceManager>
</Versioning>

This is the versioning configuration that I’m using. Again, make sure it is pointed at the PostgreSQL datasource using JNDI.

1
2
3
4
5
6
7
8
9
10
11
12
13
<!--
    Cluster configuration with system variables.

-->
<Cluster id="node1" syncDelay="2000">
    <Journal class="org.apache.jackrabbit.core.journal.DatabaseJournal">
            <param name="revision" value="${rep.home}/revision.log" />
            <param name="driver" value="javax.naming.InitialContext"/>
            <param name="url" value="java:comp/env/jdbc/repository"/>
            <param name="databaseType" value="postgresql"/>
             <param name="schemaObjectPrefix" value="journal_"/>
    </Journal>
</Cluster>

The last and one of the most important pieces of information in the repository.xml file is the Cluster configuration. Again we will point to the PostgreSQL datasource using JNDI to store the journal. The journal will allow a consistent view by producing a composite of the actions taken by individual nodes. The one piece of information here that will change node to node is the id attribute of the Cluster tag. This must be unique for every node.

Testing

Start up tomcat server on one of the nodes. You should have notifications that Jackrabbit has been started. You should see some directories created in your repository home directory for workspaces and versioning. I would recommend deploying something like JCR-Explorer and connecting to your JCR using JNDI. You should be able to browse and add files to the repository. Note: Be sure to use the JNDI name that we created, which within Tomcat will be java:comp/env/jcr/repository

JCR Explorer

Adding additional nodes

At this point, all we’ve really created is a single Jackrabbit server running on Tomcat. However, the next step allows a load balanced configuration. The Jackrabbit Wiki notes that there are some limitations, but based on this default configuration, it is very easy to add additional nodes. If you have been using a VM like me, all you need to do is:

  • Shut down Tomcat
  • Create a clone of the GNU/Linux machine with the configuration on it
  • Update networking/etc to not conflict with the first node
  • Get the current revision number from your first node. Here’s an SQL query on the PosgreSQL datasource
1
select * from journal_local_revisions where journal_id = 'node1';
  • Start up both GNU/Linux machines, and start Tomcat on the first node
  • Update the repository.xml file on the second node and give it a new Cluster ID, for example “node2”
  • Insert the new node’s location revision into the JOURNAL_LOCAL_REVISIONS table, again with my configuration:
1
insert into journal_local_revisions (journal_id, revision_id) values ('node2', 1);
  • Note, replace the revision_id value of the insert clause with the number you got in the select statement.
  • Start Tomcat on the second node.

Now, whenever you modify either repository directly (by using JCR-Explorer etc) you will see that the nodes synchronize using the journal. You can now put in any type of load balancing technology you wish in front of any number of Jackrabbit nodes, and have a fault-tolerant repository.