Dealing with workflow methods. Part II

Well, relying to previously posted diagram, what are the main problems in implementation of workflow engine in Documentum? Timeouts and error handling! Let’s explain why.

Actually, I have no idea what EMC was doing all that time, but current implementation of workflow engine is completely unreliable – workflow agent manages the execution of automatic activities in extremely odd way: it just sends http-requests to JMS and waits for response, in case of timeout it pauses the execution of workflow but meanwhile JMS continues to execute automatic task and sooner or later you will get something like:

DfException:: THREAD: http-0.0.0.0-9080-1; MSG: [DM_WORKFLOW_E_ACTION_NOT_ALLOWED]error:  "This operation is not allowed when the state is 'finished' for workitem '4a0011ec8004f500'."; ERRORCODE: 100; NEXT: null
    at com.documentum.fc.client.impl.docbase.DocbaseExceptionMapper.newException(DocbaseExceptionMapper.java:57)
    at com.documentum.fc.client.impl.connection.docbase.MessageEntry.getException(MessageEntry.java:39)
    at com.documentum.fc.client.impl.connection.docbase.DocbaseMessageManager.getException(DocbaseMessageManager.java:137)
    at com.documentum.fc.client.impl.connection.docbase.netwise.NetwiseDocbaseRpcClient.checkForMessages(NetwiseDocbaseRpcClient.java:310)
    at com.documentum.fc.client.impl.connection.docbase.netwise.NetwiseDocbaseRpcClient.applyForBool(NetwiseDocbaseRpcClient.java:354)
    at com.documentum.fc.client.impl.connection.docbase.DocbaseConnection$1.evaluate(DocbaseConnection.java:1151)
    at com.documentum.fc.client.impl.connection.docbase.DocbaseConnection.evaluateRpc(DocbaseConnection.java:1085)
    at com.documentum.fc.client.impl.connection.docbase.DocbaseConnection.applyForBool(DocbaseConnection.java:1144)
    at com.documentum.fc.client.impl.connection.docbase.DocbaseConnection.apply(DocbaseConnection.java:1129)
    at com.documentum.fc.client.impl.docbase.DocbaseApi.witemComplete(DocbaseApi.java:1193)
    at com.documentum.fc.client.DfWorkitem.completeEx2(DfWorkitem.java:505)
    at com.documentum.fc.client.DfWorkitem.completeEx(DfWorkitem.java:499)
    at com.documentum.bpm.DfWorkitemEx___PROXY.completeEx(DfWorkitemEx___PROXY.java)

such errors are extremely painful because before restarting failed workflow activities you always need to investigate whether you are actually need to re-execute activity’s body or not, i.e. if auto-activity get failed due to timeout and it’s body does something like i=i+i you will get wrong data upon restart. And it is not a joke, when restarting failed auto-activities you can specify wether it is required to execute activity’s body or not – webtop does allow to perform such thing:

there is just a mistake in API reference manual:

in order to skip execution of activity’s body you need to perform something like:

API> fetch,c,4a024be980001502
...
OK
API> get,c,l,r_runtime_state
...
5
API> get,c,l,r_act_seqno
...
0
API> get,c,l,r_workflow_id
...
4d024be980001101
-- this places auto-activity into 
-- DM_INTERNAL_MANUAL_COMPLETE queue
-- and workflow agent won't pick it up
API> restart,c,4d024be980001101,0,T
...
OK
API> revert,c,4a024be980001502
...
OK
API> get,c,l,a_wq_name
...
DM_INTERNAL_MANUAL_COMPLETE
API> complete,c,4a024be980001502
...
OK
API> 

So far, so good, now we know how to skip execution of activity’s body, but it is still required to investigate the root cause of why auto-activity got failed. Is it possible to prevent these painful timeouts at all? I do think that timeouts is a design gap in workflow engine because workflow agent is executed not inside JMS context. However, we are forced to work with current odd implementation and try somehow resolve such issues. Typically, java code which servers auto-activity execution looks like:

public final int execute(Map params, PrintWriter printWriter) throws Exception {
	parseArguments(params);
	IDfSession session = null; 
			
	try {
		session = getSession();
		IDfWorkitem workitem = getWorkItem();
		if (workitem.getRuntimeState() == IDfWorkitem.DF_WI_STATE_DORMANT) {
			workitem.acquire();
		}
		
		// perform business logic
		
		workitem.complete();
		
		return 0;
	} finally {
		if (session != null) {
			release(session);
		}
	}
}

but the correct one is:

public final int execute(Map params, PrintWriter printWriter) throws Exception {
	parseArguments(params);
	IDfSession session = null;

	try {
		session = getSession();
		session.beginTrans();
		IDfWorkitem workitem = getWorkItem();
		if (workitem.getRuntimeState() == IDfWorkitem.DF_WI_STATE_DORMANT) {
			// this puts exclusive lock on workitem
			// in underlying database and prevents
			// workflow agent from pausing workitem
			workitem.acquire();
		} else if (workitem.getRuntimeState() == IDfWorkitem.DF_WI_STATE_ACQUIRED) {
			// in case of restart workitem state is already
			// acquired, so, we are unable to call acquire,
			// but still need to put exclusive lock in database
			workitem.lock();
		} else {
			throw new DfException("Invalid workitem state");
		}

		// perform business logic

		workitem.complete();
		session.commitTrans();

		return 0;
	} finally {
		if (session != null) {
			if (session.isTransactionActive()) {
				session.abortTrans();
			}
			release(session);
		}
	}
}

next challenge is error handling. The problem is when we are dealing with Documentum we may face with a lot of weird errors, and some of these errors are soft (for example, DM_SYSOBJECT_E_VERSION_MISMATCH) – in order to resolve such errors we just need to reiterate the execution of code, others are not – we need to investigate the root cause, and it is a good idea in case of soft errors restart failed auto-activities automatically, so I invented following pattern:

@Override
public final int execute(Map params, PrintWriter printWriter) throws Exception {
	parseArguments(params);
	IDfSession session = null;
	IDfWorkitem workitem = null;
	try {
		try {
			session = getSession();
			session.beginTrans();
			workitem = getWorkItem();
			if (workitem.getRuntimeState() == IDfWorkitem.DF_WI_STATE_DORMANT) {
				// this puts exclusive lock on workitem
				// in underlying database and prevents
				// workflow agent from pausing workitem
				workitem.acquire();
			} else if (workitem.getRuntimeState() == IDfWorkitem.DF_WI_STATE_ACQUIRED) {
				// in case of restart workitem state is dormant
				// so, we are unable to call acquire, but still
				// need to put exclusive lock in database
				workitem.lock();
			} else {
				throw new DfException("Invalid workitem state");
			}

			// perform business logic

			if (isSomethingWrong()) {
				haltWorkitem(workitem);
				session.commitTrans();
				return 0;
			}

			workitem.complete();
			session.commitTrans();

			return 0;
		} finally {
			if (session.isTransactionActive()) {
				session.abortTrans();
			}
		}
	} catch (DfException ex) {
		if (!isSoftException(ex)) {
			throw ex;
		}
		haltWorkitem(workitem);
		return 0;
	}
}

protected void haltWorkitem(IDfWorkitem workitem) throws DfException {
	IDfSession session = workitem.getSession();
	IDfWorkflow workflow = (IDfWorkflow) session.getObject(workitem.getWorkflowId());
	// here transaction may be already inactive
	boolean txStartsHere = !session.isTransactionActive();
	try {
		// we need to start new transaction
		// in order to lock workitem
		if (txStartsHere) {
			session.beginTrans();
		}
		// exclusive access to workitem
		workitem.lock();
		workitem.revert();
		// restarting workitem - we are in transaction,
		// so workflow agent won't pickup it
		// actually we need to check both workitem
		// and workflow states
		workflow.restart(workitem.getActSeqno());
		workitem.revert();
		// let dm_WFSuspendTimer job to restart
		// our workitem
		workflow.haltEx(workitem.getActSeqno(), 1);
		if (txStartsHere) {
			session.commitTrans();
			txStartsHere = false;
		}
	} finally {
		if (txStartsHere && session.isTransactionActive()) {
			session.abortTrans();
		}
	}
}

2 thoughts on “Dealing with workflow methods. Part II

  1. Pingback: When documentum had started dying | Documentum in a (nuts)HELL
  2. Pingback: Never happened before and happened again | Documentum in a (nuts)HELL

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s