Why you need dfc.session.keepalive.enable

Have spent two days for troubleshooting another one DFC issue, however it was worth it.

Two days ago my colleagues started complaining about weird Webtop performance – sometimes it’s response time was good, sometimes it stucks (actually, for their configuration, i.e. 48 Weblogic nodes, it is some kind of “expected” behaviour – the challenge is to find failure nodes). Further investigation revealed that problem nodes got stuck with following stack:

"[STUCK] ExecuteThread: '114' for queue: 'weblogic.kernel.Default (self-tuning)'" daemon prio=10 tid=0x00007f3624096000 nid=0x7606 runnable [0x00007f34ae6e3000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.FileDispatcher.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
        at sun.nio.ch.IOUtil.read(IOUtil.java:171)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
        - locked <0x000000061f780c10> (a java.lang.Object)
        at com.documentum.fc.impl.util.io.MessageChannel.readSocket(MessageChannel.java:123)
        at com.documentum.fc.impl.util.io.MessageChannel.readLength(MessageChannel.java:100)
        at com.documentum.fc.impl.util.io.MessageChannel.getIncomingMessageLength(MessageChannel.java:92)
        at com.documentum.fc.impl.util.io.MessageChannel.read(MessageChannel.java:77)
        - locked <0x000000061f77edb8> (a com.documentum.fc.impl.util.io.MessageChannel)
        at com.documentum.fc.client.impl.connection.netwise.AbstractNetwiseRpcClient.receiveMessage(AbstractNetwiseRpcClient.java:183)
        at com.documentum.fc.client.impl.connection.docbase.netwise.NetwiseDocbaseRpcClient.applyForObject(NetwiseDocbaseRpcClient.java:638)
        - locked <0x000000061f77edd8> (a com.documentum.fc.client.impl.connection.docbase.netwise.NetwiseDocbaseRpcClient)
        at com.documentum.fc.client.impl.connection.docbase.DocbaseConnection$8.evaluate(DocbaseConnection.java:1372)
        - locked <0x000000061f77ee28> (a com.documentum.fc.client.impl.connection.docbase.DocbaseConnection)
        at com.documentum.fc.client.impl.connection.docbase.DocbaseConnection.evaluateRpc(DocbaseConnection.java:1131)
        - locked <0x000000061f77ee28> (a com.documentum.fc.client.impl.connection.docbase.DocbaseConnection)
        at com.documentum.fc.client.impl.connection.docbase.DocbaseConnection.applyForObject(DocbaseConnection.java:1364)
        at com.documentum.fc.client.impl.docbase.DocbaseApi.authenticateUser(DocbaseApi.java:1867)
        at com.documentum.fc.client.impl.connection.docbase.DocbaseConnection.authenticate(DocbaseConnection.java:436)
        at com.documentum.fc.client.impl.connection.docbase.DocbaseConnectionManager.authenticateConnection(DocbaseConnectionManager.java:356)
        at com.documentum.fc.client.impl.connection.docbase.DocbaseConnectionManager.assignConnection(DocbaseConnectionManager.java:202)
        at com.documentum.fc.client.impl.connection.docbase.DocbaseConnectionManager.getDocbaseConnection(DocbaseConnectionManager.java:105)
        at com.documentum.fc.client.impl.session.SessionFactory.newSession(SessionFactory.java:23)
        at com.documentum.fc.client.impl.session.PrincipalAwareSessionFactory.newSession(PrincipalAwareSessionFactory.java:44)
        at com.documentum.fc.client.impl.session.PooledSessionFactory.newSession(PooledSessionFactory.java:49)
        at com.documentum.fc.client.impl.session.SessionManager.getSessionFromFactory(SessionManager.java:134)
        at com.documentum.fc.client.impl.session.SessionManager.newSession(SessionManager.java:72)
        at com.documentum.fc.client.impl.session.SessionManager.getSession(SessionManager.java:191)
        - locked <0x0000000619d5a840> (a com.documentum.fc.client.impl.session.SessionManager)
        at com.documentum.web.formext.docbase.FormatService.queryFormats(FormatService.java:529)
        at com.documentum.web.formext.docbase.FormatService.getFormats(FormatService.java:581)
        - locked <0x00000005edaa41d8> (a java.lang.String)
        at com.documentum.web.formext.docbase.FormatService.getFormatsByFileExtension(FormatService.java:293)
        at com.documentum.web.formext.docbase.FormatService.guessFormatFromFileExtension(FormatService.java:376)
        at com.documentum.web.formext.docbase.FormatService.guessFormatFromFileExtension(FormatService.java:357)
        at com.documentum.web.servlet.FileFormatIconResolver.service(FileFormatIconResolver.java:122)

I have no idea why, but EMC employees love to use following anti-pattern:

synchronized (shared resource) {
  do some docbase stuff w/o
  exclusive access to docbase session
}

for that reason we have replaced a plenty of Webtop classes by our own non-blockingcorrect implementation, and that was exactly this case – wrong pattern. BUT, there was also unusual thing: as it is obvious from stack problem thread got stuck on authentication (“DocbaseConnection.authenticate(DocbaseConnection.java:436)”), typically such operation does not take a lot of time, but weblogic had marked thread as stuck, which in turn means that thread hasn’t changed it’s state for at least last 15 minutes (actual period was 15 hours :)). Sounds weird, doesn’t it? So, what is the issue? I have spent about a day to create a semi-reproducible testcase:

/**
 * @author Andrey B. Panfilov <andrey@panfilov.tel>
 */
public class Test implements Runnable {

	private static final Map<Integer, Thread> THREADS = new HashMap<Integer, Thread>();

	private static final List<AtomicLong> COUNTERS = new ArrayList<AtomicLong>();

	public static final IDfClient CLIENT;

	public static final int THREAD_COUNT = 20;

	public static final AtomicInteger INPROGRESS = new AtomicInteger(THREAD_COUNT);

	public static final Semaphore SEMAPHORE = new Semaphore(THREAD_COUNT);

	static {
		try {
			CLIENT = new DfClientX().getLocalClient();
		} catch (DfException ex) {
			throw new RuntimeException(ex);
		}
	}

	private int _threadNo;

	public Test(int threadNo) {
		_threadNo = threadNo;
	}

	public static void main(String[] args) throws Exception {

		SEMAPHORE.acquire(THREAD_COUNT);

		for (int i = 0; i < THREAD_COUNT; i++) {
			Thread t = new Thread(new Test(i));
			THREADS.put(i, t);
			COUNTERS.add(new AtomicLong(0));
			t.start();
		}

		while (INPROGRESS.get() > 0) {
			Thread.sleep(1000);
		}

		SEMAPHORE.release(THREAD_COUNT);

		int it = 0;
		while (true) {
			it++;
			long[] l = new long[COUNTERS.size()];
			for (int i = 0, n = COUNTERS.size(); i < n; i++) {
				l[i] = COUNTERS.get(i).get();
			}
			Thread.sleep(10000);
			int alive = 0;
			int stuck = 0;
			for (int i = 0, n = COUNTERS.size(); i < n; i++) {
				Thread t = THREADS.get(i);
				boolean isAlive = t.isAlive();
				if (isAlive) {
					alive++;
					long val = COUNTERS.get(i).get();
					if (val == l[i]) {
						stuck++;
					}
				}
			}
			System.out.println("Iteration: " + it + ", alive: " + alive + ", stuck: " + stuck);
			if (alive == 0) {
				return;
			}
		}

	}

	@Override
	public void run() {
		IDfSession session = null;
		try {
			IDfSessionManager sessionManager = CLIENT.newSessionManager();
			DfLoginInfo loginInfo = new DfLoginInfo("dmadmin", "dmadmin");
			loginInfo.setForceAuthentication(false);
			loginInfo.setPeriodicAuthentication(false);
			sessionManager.setIdentity("DCTM_DEV", loginInfo);
			session = sessionManager.getSession("DCTM_DEV");
			INPROGRESS.decrementAndGet();
			SEMAPHORE.acquire();
			while (true) {
				doJob(session);
				COUNTERS.get(_threadNo).incrementAndGet();
			}
		} catch (Exception ex) {
			// ignore
			// do not disconnect because we need to count time properly
		}
	}

	public static void doJob(IDfSession session) throws DfException {
		IDfQuery query = new DfQuery("SELECT user_name FROM dm_user ENABLE(RETURN_TOP 10)");
		IDfCollection collection = query.execute(session, IDfQuery.DF_EXEC_QUERY);
		while (collection.next()) {
			collection.getString("user_name");
		}
		// no finally block because we need to count time properly
		collection.close();
	}

}

which requires a bit tricky execution:

  1. start testcase
  2. after 10-20 seconds perform a hard shutdown of CS host (echo o > /proc/sysrq-trigger)
  3. if after 20 minutes testcase reports that there are stuck threads:
    Iteration: 114, alive: 2, stuck: 2
    Iteration: 115, alive: 2, stuck: 2
    Iteration: 116, alive: 2, stuck: 2
    Iteration: 117, alive: 2, stuck: 2
    Iteration: 118, alive: 2, stuck: 2

    you caught it, congratulations!

Now the interesting thing is when testcase got reproduced we have following execution stack:

"Thread-26" prio=10 tid=0x00007f84682f9000 nid=0x15d06 runnable [0x00007f84617bd000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.FileDispatcher.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
        at sun.nio.ch.IOUtil.read(IOUtil.java:171)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245)
        - locked <0x00000000f20e45e8> (a java.lang.Object)
        at com.documentum.fc.impl.util.io.MessageChannel.readSocket(MessageChannel.java:123)
        at com.documentum.fc.impl.util.io.MessageChannel.readLength(MessageChannel.java:100)
        at com.documentum.fc.impl.util.io.MessageChannel.getIncomingMessageLength(MessageChannel.java:92)
        at com.documentum.fc.impl.util.io.MessageChannel.read(MessageChannel.java:77)
        - locked <0x00000000f242b850> (a com.documentum.fc.impl.util.io.MessageChannel)
        at com.documentum.fc.client.impl.connection.netwise.AbstractNetwiseRpcClient.receiveMessage(AbstractNetwiseRpcClient.java:183)
        at com.documentum.fc.client.impl.connection.docbase.netwise.NetwiseDocbaseRpcClient.applyForCollection(NetwiseDocbaseRpcClient.java:414)
        - locked <0x00000000f242b7c8> (a com.documentum.fc.client.impl.connection.docbase.netwise.NetwiseDocbaseRpcClient)
        at com.documentum.fc.client.impl.connection.docbase.DocbaseConnection$3.evaluate(DocbaseConnection.java:1277)
        - locked <0x00000000f23f5e08> (a com.documentum.fc.client.impl.connection.docbase.DocbaseConnection)
        at com.documentum.fc.client.impl.connection.docbase.DocbaseConnection.evaluateRpc(DocbaseConnection.java:1139)
        - locked <0x00000000f23f5e08> (a com.documentum.fc.client.impl.connection.docbase.DocbaseConnection)
        at com.documentum.fc.client.impl.connection.docbase.DocbaseConnection.applyForCollection(DocbaseConnection.java:1265)
        at com.documentum.fc.client.impl.docbase.DocbaseApi.exec(DocbaseApi.java:83)
        at com.documentum.fc.client.impl.session.Session.query(Session.java:3630)
        at com.documentum.fc.client.impl.session.SessionHandle.query(SessionHandle.java:2322)
        at com.documentum.fc.client.DfQuery.runQuery(DfQuery.java:167)
        at com.documentum.fc.client.DfQuery.execute(DfQuery.java:216)
        at Test.doJob(Test.java:123)
        at Test.run(Test.java:112)
        at java.lang.Thread.run(Thread.java:662)

and following connection state from OS perspective:

[dmadmin@docu72dev02 ~]$ netstat -na --timers| sed -e '1,2p;/10000/!d'
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address               Foreign Address             State       Timer
tcp        0      0 ::ffff:172.16.49.138:47424  ::ffff:172.16.49.131:10000  ESTABLISHED off (0.00/0/0)
tcp        0      0 ::ffff:172.16.49.138:47406  ::ffff:172.16.49.131:10000  ESTABLISHED off (0.00/0/0)

and this is weird: operating system thinks that connections to CS are idle, but java waits for CS response!. Looks like a bug between java and operating system. Note that it is much simpler to reproduce testcase on SSL-connections – in case of plain connections I rarely get one or two stuck sockets, but in case of SSL even the first attempt created a plenty of them:

[dmadmin@docu72dev02 ~]$  netstat -na --timers| sed -e '1,2p;/10001/!d'
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address               Foreign Address             State       Timer
tcp        0      0 ::ffff:172.16.49.138:55960  ::ffff:172.16.49.131:10001  ESTABLISHED off (0.00/0/0)
tcp        0      0 ::ffff:172.16.49.138:55951  ::ffff:172.16.49.131:10001  ESTABLISHED off (0.00/0/0)
tcp        0      0 ::ffff:172.16.49.138:55954  ::ffff:172.16.49.131:10001  ESTABLISHED off (0.00/0/0)
tcp        0    122 ::ffff:172.16.49.138:55958  ::ffff:172.16.49.131:10001  ESTABLISHED on (73.96/10/0)
tcp        0    122 ::ffff:172.16.49.138:55961  ::ffff:172.16.49.131:10001  ESTABLISHED on (70.89/10/0)
tcp        0      0 ::ffff:172.16.49.138:55957  ::ffff:172.16.49.131:10001  ESTABLISHED off (0.00/0/0)
tcp        0      0 ::ffff:172.16.49.138:55959  ::ffff:172.16.49.131:10001  ESTABLISHED off (0.00/0/0)
tcp        0      0 ::ffff:172.16.49.138:55950  ::ffff:172.16.49.131:10001  ESTABLISHED off (0.00/0/0)
tcp        0    314 ::ffff:172.16.49.138:55949  ::ffff:172.16.49.131:10001  ESTABLISHED on (72.94/10/0)
tcp        0      0 ::ffff:172.16.49.138:55967  ::ffff:172.16.49.131:10001  ESTABLISHED off (0.00/0/0)
tcp        0    122 ::ffff:172.16.49.138:55955  ::ffff:172.16.49.131:10001  ESTABLISHED on (71.91/10/0)
tcp        0    122 ::ffff:172.16.49.138:55952  ::ffff:172.16.49.131:10001  ESTABLISHED on (69.87/10/0)
tcp        0    122 ::ffff:172.16.49.138:55965  ::ffff:172.16.49.131:10001  ESTABLISHED on (70.89/10/0)
tcp        0    122 ::ffff:172.16.49.138:55948  ::ffff:172.16.49.131:10001  ESTABLISHED on (70.89/10/0)
tcp        0    122 ::ffff:172.16.49.138:55964  ::ffff:172.16.49.131:10001  ESTABLISHED on (70.89/10/0)
tcp        0      0 ::ffff:172.16.49.138:55956  ::ffff:172.16.49.131:10001  ESTABLISHED off (0.00/0/0)
tcp        0    122 ::ffff:172.16.49.138:55962  ::ffff:172.16.49.131:10001  ESTABLISHED on (71.91/10/0)
tcp        0    122 ::ffff:172.16.49.138:55953  ::ffff:172.16.49.131:10001  ESTABLISHED on (70.89/10/0)
tcp        0    122 ::ffff:172.16.49.138:55963  ::ffff:172.16.49.131:10001  ESTABLISHED on (71.91/10/0)
tcp        0      0 ::ffff:172.16.49.138:55966  ::ffff:172.16.49.131:10001  ESTABLISHED off (0.00/0/0)

Now let’s read EMC’s documentation (I previously already criticized EMC’s whitepaper):

DFC application hangs on the Linux machine

If the client machine running the DFC client application is a Linux machine, and if the server to which it is connected goes off the network, the socket takes very long time to timeout, Occasionally, it causes the application server on the DFC client application to stop responding. This problem is related to low-level operating system socket handling. The developer must change the operating system level networking kernel settings and enable dfc.session.keepalive.enable property in the dfc.properties file to resolve this problem. It is also recommended that application administrators collaborate with server administrators before making the change. Server administrators must indetify appropriate settings for the kernel parameters based on their perfromance requirement and deployment.

Operating system level settings for Linux:

  • net.ipv4.tcp_keepalive_time = 1
  • net.ipv4.tcp_keepalive_intvl = 5
  • net.ipv4.tcp_keepalive_probes = 3
  • net.ipv4.tcp_retries2 = 3

What is the problem there? At first, the value of the last parameter (net.ipv4.tcp_retries2) EMC suggests is wrong – this parameter is responsible for RTO retransmissions and it is irrelevant to the situation where “the socket takes very long time to timeout” – 15 minutes is not a “long time”, below is a full description of this parameter:

tcp_retries2 - INTEGER
    This value influences the timeout of an alive TCP connection,
    when RTO retransmissions remain unacknowledged.
    Given a value of N, a hypothetical TCP connection following
    exponential backoff with an initial RTO of TCP_RTO_MIN would
    retransmit N times before killing the connection at the (N+1)th RTO.

    The default value of 15 yields a hypothetical timeout of 924.6
    seconds and is a lower bound for the effective timeout.
    TCP will effectively time out at the first RTO which exceeds the
    hypothetical timeout.

    RFC 1122 recommends at least 100 seconds for the timeout,
    which corresponds to a value of at least 8.

At second, manipulations with keepalives look like an attempt to not solve a problem but create a rude hack: java/operating system does not fire a timer for our connections, let introduce keepalive timeouts instead. However such setting does really work – in case of enabled dfc.session.keepalive.enable I get following connections state:

[dmadmin@docu72dev02 ~]$  netstat -na --timers| sed -e '1,2p;/10001/!d'
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address               Foreign Address             State       Timer
tcp        0      0 ::ffff:172.16.49.138:55988  ::ffff:172.16.49.131:10001  ESTABLISHED keepalive (7142.37/0/0)
tcp        0      0 ::ffff:172.16.49.138:55991  ::ffff:172.16.49.131:10001  ESTABLISHED keepalive (7142.37/0/0)
tcp        0      0 ::ffff:172.16.49.138:55999  ::ffff:172.16.49.131:10001  ESTABLISHED keepalive (7143.37/0/0)
tcp        0      0 ::ffff:172.16.49.138:55996  ::ffff:172.16.49.131:10001  ESTABLISHED keepalive (7142.37/0/0)
tcp        0      0 ::ffff:172.16.49.138:55994  ::ffff:172.16.49.131:10001  ESTABLISHED keepalive (7142.37/0/0)
tcp        0    314 ::ffff:172.16.49.138:55989  ::ffff:172.16.49.131:10001  ESTABLISHED on (3.09/5/0)
tcp        0      0 ::ffff:172.16.49.138:55998  ::ffff:172.16.49.131:10001  ESTABLISHED keepalive (7143.37/0/0)
tcp        0      0 ::ffff:172.16.49.138:55986  ::ffff:172.16.49.131:10001  ESTABLISHED keepalive (7142.37/0/0)
tcp        0      0 ::ffff:172.16.49.138:56003  ::ffff:172.16.49.131:10001  ESTABLISHED keepalive (7143.37/0/0)
tcp        0      0 ::ffff:172.16.49.138:55997  ::ffff:172.16.49.131:10001  ESTABLISHED keepalive (7142.37/0/0)
tcp        0      0 ::ffff:172.16.49.138:56000  ::ffff:172.16.49.131:10001  ESTABLISHED keepalive (7143.37/0/0)
tcp        0      0 ::ffff:172.16.49.138:55992  ::ffff:172.16.49.131:10001  ESTABLISHED keepalive (7142.37/0/0)
tcp        0    314 ::ffff:172.16.49.138:55985  ::ffff:172.16.49.131:10001  ESTABLISHED on (3.11/5/0)
tcp        0      0 ::ffff:172.16.49.138:55990  ::ffff:172.16.49.131:10001  ESTABLISHED keepalive (7142.37/0/0)
tcp        0      0 ::ffff:172.16.49.138:56001  ::ffff:172.16.49.131:10001  ESTABLISHED keepalive (7143.37/0/0)
tcp        0      0 ::ffff:172.16.49.138:56002  ::ffff:172.16.49.131:10001  ESTABLISHED keepalive (7143.37/0/0)
tcp        0      0 ::ffff:172.16.49.138:55984  ::ffff:172.16.49.131:10001  ESTABLISHED keepalive (7142.36/0/0)
tcp        0      0 ::ffff:172.16.49.138:55993  ::ffff:172.16.49.131:10001  ESTABLISHED keepalive (7142.37/0/0)
tcp        0      0 ::ffff:172.16.49.138:55995  ::ffff:172.16.49.131:10001  ESTABLISHED keepalive (7142.37/0/0)
tcp        0      0 ::ffff:172.16.49.138:55987  ::ffff:172.16.49.131:10001  ESTABLISHED keepalive (7142.37/0/0)

which promises that operating system will down stuck connections after (tcp_keepalive_time + tcp_keepalive_probes * tcp_keepalive_intvl) seconds.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s