Automatic recovery concepts

Automatic (warm) recovery uses the information stored inside the LIXA state server to decide which transaction must be recovered and how it should be completed (committed or rolled back).

The above paragraphs explain what's happen when automatic recovery starts and completes (rolls back or commits) the transaction marked as recovery pending.

An equivalent Application Program starts and activates the LIXA Transaction Manager with tx_open(). The LIXA Transaction Manager autonoumosly coordinates the transaction completion and the Application Program is not aware of this under the covers operation.

Application Program equivalence

From the LIXA Transaction Manager point of view, two Application Programs are equivalent when they are associated to the same job.

The job associated to an Application Program can be:

  • the content of the environment variable LIXA_JOB if it is set

  • a string computed in this way if the environment variable LIXA_JOB is not set:

    branch qualifier + / + IP address

    where branch qualifier is computed as:

    MD5(lixac_conf.xml + $(LIXA_PROFILE) + gethostid())

An example of branch qualifier is 0fc29445b1d4c3f4ed6be2fea20f918b, while an example of job automatically associated to an Application Program is 0fc29445b1d4c3f4ed6be2fea20f918b/127.0.0.1

Note

If you don't set the environment variable LIXA_JOB all the Application Programs that meet this requirements:

  • they use a config file (lixac_conf.xml) with the same content

  • they use a LIXA_PROFILE environment variable with the same content

  • they run in a host that returns the same value to gethostid() function

  • they are calling the LIXA state server from the same IP address

are associated to the same job.

To pick-up the job associated to an Application Program you can activate the trace using the bit associated to the label LIXA_TRACE_MOD_CLIENT_CONFIG. Take a look to the section called “Tracing modules” for more information. This is an excerpt from the trace:

[...]	  
2011-12-03 17:00:59.746036 [6021/1078050640] client_config_job
2011-12-03 17:00:59.746073 [6021/1078050640] client_config_job: acquiring exclusive mutex
2011-12-03 17:00:59.746120 [6021/1078050640] client_config_job: 'LIXA_JOB' environment variable not found, computing job string...
2011-12-03 17:00:59.746175 [6021/1078050640] lixa_job_set_source_ip
2011-12-03 17:00:59.746275 [6021/1078050640] lixa_job_set_source_ip/excp=1/ret_cod=0/errno=0
2011-12-03 17:00:59.746339 [6021/1078050640] client_config_job: job value for this process is '0fc29445b1d4c3f4ed6be2fea20f918b/127.0.0.1      '
2011-12-03 17:00:59.746379 [6021/1078050640] client_config_job: releasing exclusive mutex
2011-12-03 17:00:59.746514 [6021/1078050640] client_config_job/excp=3/ret_cod=0/errno=0
[...]
	

Important

Setting the environment variable LIXA_JOB allows you to associate any Application Program to a custom user defined job: this may be interesting if you are using a workload balanced environment, this may be dangerous if you associate Application Programs using a different set of Resource Managers to the same job.

If you don't set LIXA_JOB environment variable, the default behavior should be strong enought to avoid issues when LIXA is used under standard conditions.

Automatic Recovery in a distributed environment

The previous section (see the section called “Application Program equivalence”) explains the conditions that must be met to enable automatic recovery. A tipical scenario that needs tuning is a workload balanced Application Server environment as is in the below picture:

Figure 9.3. Workload balanced Application Server

Workload balanced Application Server


The same program (Application Program 1) is executed by two different Application Servers: this is a typical configuration used to improve service availability and scalability. If the Application Server 1 is running in a different host than Application Server 2 (this is a de facto standard), by default LIXA will associate two different jobs.

Important

The LIXA default behavior is not the optimal one when you are using a workload balanced environment.

If the host of Application Server 1 crashed, the Application Program running inside Application Server 2 could not automatically recover the transactions in prepared/in-doubt/recovery pending of the Application Server 1 because they are associated to a different job.

This is a scenario when setting LIXA_JOB is strongly suggested.

Warning

When you set the LIXA_JOB environment variable to control LIXA automatic recovery feature you must not associate the same job to Application Programs that use different sets of Resource Managers or use the same set of Resource Managers but with different options for any Resource Manager. If you broke this rule, you would probably face difficult to troubleshoot issues: automatic recovery could fail and you would have to understand why.

Forcing automatic recovery

Sometimes you need to force the automatic recovery to happen because the crashed Applicaton Program is a one shot program and you can not execute it a second time due to some functional constrain.

Any application program meeting the requirements described above can be used, lixat utility command too. The following example will show you how it works using PostgreSQL and Oracle Resource Managers.

First of all, you must configure, build and install the LIXA project software enabling PostgreSQL, Oracle and crash simulation features:

tiian@ubuntu:~/lixa$ ./configure --with-oracle=/usr/lib/oracle/xe/app/oracle/product/10.2.0/server \
> --with-postgresql-include=/usr/include/postgresql --with-postgresql-lib=/usr/lib \
> --enable-crash
	

then you must follow the steps described in the section called “An example with PostgreSQL & Oracle” to prepare the scenario environment. Open three different terminal sessions as explained in the above example, and try to insert/delete a row:

[Shell terminal session]
tiian@ubuntu:~/tmp$ echo $LIXA_PROFILE
PQL_STA_ORA_DYN
tiian@ubuntu:~/tmp$ echo $ORACLE_HOME
/usr/lib/oracle/xe/app/oracle/product/10.2.0/server
tiian@ubuntu:~/tmp$ echo $ORACLE_SID
XE
tiian@ubuntu:~/tmp$ echo $LD_LIBRARY_PATH
/usr/lib/oracle/xe/app/oracle/product/10.2.0/server/lib:
tiian@ubuntu:~/tmp$ ./example6_pql_ora insert
Inserting a row in the tables...
Oracle INSERT statement executed!
tiian@ubuntu:~/tmp$ ./example6_pql_ora delete
Deleting a row from the tables...
Oracle DELETE statement executed!
	  

To simulate a crash after the xa_prepare() completed successfully, you can set the environment variable LIXA_CRASH_POINT to the value LIXA_CRASH_POINT_PREPARE_2 (see src/common/lixa_crash.h:

[Shell terminal session]
tiian@ubuntu:~/tmp$ export LIXA_CRASH_POINT=15
tiian@ubuntu:~/tmp$ echo $LIXA_CRASH_POINT
15
tiian@ubuntu:~/tmp$ ./example6_pql_ora insert
Inserting a row in the tables...
Oracle INSERT statement executed!
Aborted
	  

You can check there is a prepared (in-doubt) transaction inside Oracle:

[Oracle terminal session]
SQL> select * from dba_pending_transactions;

  FORMATID
----------
GLOBALID
--------------------------------------------------------------------------------
BRANCHID
--------------------------------------------------------------------------------
1279875137
97DD30A150604AFDBFA5FDC94B611FD5
9BAC7BE1C129EA6EE31F2D71B318120C
	  

And the same transaction inside PostgreSQL:

[PostgreSQL terminal session]
testdb=> select * from pg_prepared_xacts;
 transaction |                                    gid                                       |           prepared            | owner | database 
-------------+------------------------------------------------------------------------------+-------------------------------+-------+----------
         874 | 1279875137.97dd30a150604afdbfa5fdc94b611fd5.9bac7be1c129ea6ee31f2d71b318120c | 2011-12-14 22:02:50.462682+01 | tiian | testdb
	  

It is suggested to activate the trace related to the client recovery module (see the section called “Tracing modules”) before running lixat program:

[Shell terminal session]
tiian@ubuntu:~/tmp$ export LIXA_TRACE_MASK=0x00040000
tiian@ubuntu:~/tmp$ /opt/lixa/bin/lixat 
2011-12-14 22:22:01.740634 [27735/3073944240] client_recovery
2011-12-14 22:22:01.740771 [27735/3073944240] client_recovery: sending 197 bytes ('000191<?xml version="1.0" encoding="UTF-8" ?><msg level="0" verb="8" step="8"><client job="9bac7be1c129ea6ee31f2d71b318120c/127.0.0.1      " config_digest="9bac7be1c129ea6ee31f2d71b318120c"/></msg>') to the server for step 8
2011-12-14 22:22:01.759352 [27735/3073944240] client_recovery: receiving 561 bytes from the server |<?xml version="1.0" encoding="UTF-8" ?><msg level="0" verb="8" step="16"><answer rc="0"/><client job="9bac7be1c129ea6ee31f2d71b318120c/127.0.0.1      " config_digest="9bac7be1c129ea6ee31f2d71b318120c"><last_verb_step verb="5" step="16"/><state finished="0" txstate="3" will_commit="1" will_rollback="0" xid="1279875137.97dd30a150604afdbfa5fdc94b611fd5.9bac7be1c129ea6ee31f2d71b318120c"/></client><rsrmgrs><rsrmgr rmid="0" next_verb="0" r_state="1" s_state="33" td_state="10"/><rsrmgr rmid="1" next_verb="0" r_state="1" s_state="33" td_state="20"/></rsrmgrs></msg>|
2011-12-14 22:22:01.759776 [27735/3073944240] client_recovery_analyze
2011-12-14 22:22:01.759857 [27735/3073944240] client_recovery_analyze: the TX was committing
2011-12-14 22:22:01.759873 [27735/3073944240] client_recovery_analyze: rmid=0, r_state=1, s_state=33, td_state=10
2011-12-14 22:22:01.759884 [27735/3073944240] client_recovery_analyze: rmid=1, r_state=1, s_state=33, td_state=20
2011-12-14 22:22:01.759902 [27735/3073944240] client_recovery_analyze/excp=1/ret_cod=0/errno=0
2011-12-14 22:22:01.759921 [27735/3073944240] client_recovery: transaction '1279875137.97dd30a150604afdbfa5fdc94b611fd5.9bac7be1c129ea6ee31f2d71b318120c' must be committed
2011-12-14 22:22:01.759937 [27735/3073944240] client_recovery_commit
2011-12-14 22:22:01.759971 [27735/3073944240] client_recovery_commit: committing transaction '1279875137.97dd30a150604afdbfa5fdc94b611fd5.9bac7be1c129ea6ee31f2d71b318120c'
2011-12-14 22:22:01.759998 [27735/3073944240] client_recovery_commit: xa_commit for rmid=0, name='PostgreSQL_stareg', xa_name='PostgreSQL[LIXA]'...
2011-12-14 22:22:02.143764 [27735/3073944240] client_recovery_commit: rc=0
2011-12-14 22:22:02.143866 [27735/3073944240] client_recovery_commit: xa_commit for rmid=1, name='OracleXE_dynreg', xa_name='Oracle_XA'...
2011-12-14 22:22:03.188211 [27735/3073944240] client_recovery_commit: rc=0
2011-12-14 22:22:03.188272 [27735/3073944240] client_recovery_commit/excp=1/ret_cod=0/errno=0
2011-12-14 22:22:03.188318 [27735/3073944240] client_recovery: sending 187 bytes ('000181<?xml version="1.0" encoding="UTF-8" ?><msg level="0" verb="8" step="24"><recovery failed="0" commit="1"/><rsrmgrs><rsrmgr rmid="0" rc="0"/><rsrmgr rmid="1" rc="0"/></rsrmgrs></msg>') to the server for step 24
2011-12-14 22:22:03.188496 [27735/3073944240] client_recovery: sending 197 bytes ('000191<?xml version="1.0" encoding="UTF-8" ?><msg level="0" verb="8" step="8"><client job="9bac7be1c129ea6ee31f2d71b318120c/127.0.0.1      " config_digest="9bac7be1c129ea6ee31f2d71b318120c"/></msg>') to the server for step 8
2011-12-14 22:22:03.228361 [27735/3073944240] client_recovery: receiving 95 bytes from the server |<?xml version="1.0" encoding="UTF-8" ?><msg level="0" verb="8" step="16"><answer rc="1"/></msg>|
2011-12-14 22:22:03.228544 [27735/3073944240] client_recovery: the server answered LIXA_RC_OBJ_NOT_FOUND; there are no more transactions to recover
2011-12-14 22:22:03.228589 [27735/3073944240] client_recovery/excp=12/ret_cod=0/errno=0
tx_open(): 0
tx_close(): 0
	  

You can now verify there are no more prepared/in-doubt transactions inside the Resource Managers:

[Oracle terminal session]
SQL> select * from dba_pending_transactions;

no rows selected
	  

[PostgreSQL terminal session]
testdb=> select * from pg_prepared_xacts;
 transaction | gid | prepared | owner | database 
-------------+-----+----------+-------+----------
(0 rows)
	  

Important

The automatic (warm) recovery process completed successfully because ./example6_pql_ora and /opt/lixa/bin/lixat were associated to the same job and the LIXA state server (lixad) kept the state of the transaction in the meanwhile.

In the next paragraphs you can explore what happens if the previous conditions are not satisfied.