Fighting with DEV ENV. Part II

It seems that talented team decided to move/upgrade JMS from JBossAS to WildFly even in 7.2 release:

Below is an adapted config for DEV ENV:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<server xmlns="urn:jboss:domain:3.0">
    <extensions>
        <extension module="org.jboss.as.deployment-scanner" />
        <extension module="org.jboss.as.ee" />
        <extension module="org.jboss.as.naming" />
        <extension module="org.jboss.as.remoting" />
        <extension module="org.jboss.as.security" />
        <extension module="org.wildfly.extension.io"/>
        <extension module="org.wildfly.extension.undertow" />
    </extensions>

    <profile>
        <subsystem xmlns="urn:jboss:domain:deployment-scanner:2.0">
            <deployment-scanner deployment-timeout="300" path="deployments" relative-to="jboss.server.base.dir" runtime-failure-causes-rollback="${jboss.deployment.scanner.rollback.on.failure:false}" scan-interval="5000" />
        </subsystem>
        <subsystem xmlns="urn:jboss:domain:ee:3.0" />
        <subsystem xmlns="urn:jboss:domain:io:1.1">
            <worker name="default"/>
            <buffer-pool name="default"/>
        </subsystem>
        <subsystem xmlns="urn:jboss:domain:naming:2.0">
            <remote-naming />
        </subsystem>
        <subsystem xmlns="urn:jboss:domain:remoting:3.0" />
        <subsystem xmlns="urn:jboss:domain:security:1.2" >
            <security-domains>
                <security-domain name="other" cache-type="default">
                    <authentication>
                        <login-module code="Remoting" flag="optional">
                            <module-option name="password-stacking" value="useFirstPass"/>
                        </login-module>
                        <login-module code="RealmDirect" flag="required">
                            <module-option name="password-stacking" value="useFirstPass"/>
                        </login-module>
                    </authentication>
                </security-domain>
            </security-domains>
        </subsystem>
        <subsystem xmlns="urn:jboss:domain:undertow:2.0">
            <buffer-cache name="default" />
            <server name="default-server">
                <http-listener max-post-size="0" name="default" redirect-socket="https" socket-binding="http" />
                <host alias="localhost" name="default-host">
                    <location handler="welcome-content" name="/" />
                    <filter-ref name="server-header" />
                    <filter-ref name="x-powered-by-header" />
                </host>
            </server>
            <servlet-container name="default">
                <jsp-config />
                <websockets />
            </servlet-container>
            <handlers>
                <file name="welcome-content" path="${jboss.home.dir}/welcome-content" />
            </handlers>
            <filters>
                <response-header header-name="Server" header-value="WildFly/9" name="server-header" />
                <response-header header-name="X-Powered-By" header-value="Undertow/1" name="x-powered-by-header" />
            </filters>
        </subsystem>
    </profile>

    <interfaces>
        <interface name="public">
            <inet-address value="${jboss.bind.address:0.0.0.0}" />
        </interface>
    </interfaces>

    <socket-binding-group default-interface="public" name="standard-sockets" port-offset="${jboss.socket.binding.port-offset:0}">
        <socket-binding name="ajp" port="9089" />
        <socket-binding name="http" port="9080" />
        <socket-binding name="https" port="9082" />
    </socket-binding-group>
</server>

Setup dfc.properties properly

Can’t stop catching myself thinking about JMS high availability misfeature, or more precisely: why it is supported by workflow methods only, and it seems that all CS instances are just misconfigured. Let’s explain.

Below is a brand new CS installation, which already has one dm_client_rights record:

Connected to Documentum Server running Release 7.3.0050.0039  Linux64.Postgres
Session id is s0
API> ?,c,select r_object_id, client_id from dm_client_rights
r_object_id       client_id                               
----------------  ----------------------------------------
0802994680000582  dfc_xxpa7jhtGFkRfHvrQmUIyaffxlka        
(1 row affected)

API> dump,c,0802994680000582
...
USER ATTRIBUTES

  object_name                     : dfc_docu73dev01_ffxlka
  title                           : 
  subject                         : 
  authors                       []: <none>
  keywords                      []: <none>
  resolution_label                : 
  owner_name                      : dmadmin
  owner_permit                    : 7
  group_name                      : docu
  group_permit                    : 1
  world_permit                    : 1
  log_entry                       : 
  acl_domain                      : dmadmin
  acl_name                        : dm_4502994680000222
  language_code                   : 
  client_id                       : dfc_xxpa7jhtGFkRfHvrQmUIyaffxlka
  public_key_identifier           : 77016FB9066276A0EF4801918F27F52C7176CD2F
  host_name                       : docu73dev01
  allowed_roles                 []: <none>
  allow_all_roles                 : T
  allow_all_priv_modules          : F
  principal_auth_priv             : T
  server_trust_priv               : T
  app_name                        : 
  is_globally_managed             : F

Where did this dm_client_rights record come from? According to the log file $DM_HOME/install/logs/install.log this dm_client_rights record was created by installer:

10:25:56,168  INFO [main] com.documentum.install.server.installanywhere.actions.DiWAServerCreateBofRegistryUser - Registering Client Roles.
10:25:56,198  INFO [main] com.documentum.fc.client.security.impl.JKSKeystoreUtilForDfc - keystore file name is /u01/documentum/cs/shared/config/dfc.keystore
10:25:56,382  INFO [main] com.documentum.fc.client.impl.connection.docbase.DocbaseConnection - Object protocol version 2
10:25:56,818  INFO [main] com.documentum.fc.client.security.impl.JKSKeystoreUtilForDfc - keystore file name is /u01/documentum/cs/shared/config/dfc.keystore
10:25:56,844  INFO [main] com.documentum.fc.client.security.impl.DfcIdentityPublisher - found client registration: false
10:25:57,148  INFO [main] com.documentum.fc.client.privilege.impl.PublicKeyCertificate - stored certificate for CN 
10:25:57,272  INFO [main] com.documentum.fc.client.security.impl.IpAndRcHelper - filling in DCTM_PSQL a new record with this persistent certificate:
-----BEGIN CERTIFICATE-----
MIIDHzCCAgcCEDM7pl2LftisOKZ3mYFjNigwDQYJKoZIhvcNAQELBQAwTjETMBEG
A1UECwwKRG9jdW1lbnR1bTEMMAoGA1UECgwDRU1DMSkwJwYDVQQDDCBkZmNfeHhw
YTdqaHRHRmtSZkh2clFtVUl5YWZmeGxrYTAeFw0xNzAzMTExMDEyMDVaFw0yNzAz
MDkxMDE3MDVaME4xEzARBgNVBAsMCkRvY3VtZW50dW0xDDAKBgNVBAoMA0VNQzEp
MCcGA1UEAwwgZGZjX3h4cGE3amh0R0ZrUmZIdnJRbVVJeWFmZnhsa2EwggEiMA0G
CSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQDPY3I1TzGodyI7Q7oHuLlA4N0IdLnp
oUEZqONy5Ev1f8RJLIJGowjSQme1HPrk6ZgdyBlGkLqguRdnn6hLrBIUiZSc7XRr
OM8xCYp6wEXUuxDTpX58Q32QLInjTjftHOblav201lOQStQUcrEAcUVZ+UK/Xt1t
Q9hQUpOvXWDRxBRPpN7VdTg1lCNuNC/BjYO/yBc2zUPYsarmnM1BcyeTi6RmlfME
PUsVPMqS8muBKP/o7ZUqXVMWNFNRHVbnOCX8KHZgO4DQLp7pcgYq0ak4vQ6BEfx8
fxo/egAS84JiemOxhslxytN5cBnFmc3NCdKKVzRcZE9fecC3DMt41dYfAgMBAAEw
DQYJKoZIhvcNAQELBQADggEBAARMI7W7ggw9GJZtfsqre5+idsfcxtuMPhI+bmU/
gAb3pyZug11z29AvCcprh545agcUQgm9wlgaJFUtiktd4n6hE12G46Vu/boqxy4J
iBs3kWQd2Qeh4Qobm8xvBu0VKSiHJRmbm5xslnq3yJorBZiNjvuoFVsaYtY24kiy
AxUB5y2vgUhZeLe+0WPrBEA3/I+ciGO/Jk6KXyL9vz8+04Hx6sBfkMsY1l8aa1HH
PtQdgfasysgVkIqCZ70zAXd5ARC4CXEwfhj6v/eq7X3CM4KCP4TiPqmzzapsLPn3
i0Or+fnwrOy/rYybndj0pgpnCbtinUZ7ZXmVtWDevMWey/Q=
-----END CERTIFICATE-----
10:25:57,280  INFO [main] com.documentum.fc.client.security.impl.DfcIdentityPublisher - found client registration: false
10:25:57,659  INFO [main] com.documentum.fc.client.security.impl.IpAndRcHelper - filling a new registration record for dfc_xxpa7jhtGFkRfHvrQmUIyaffxlka
10:25:57,672  INFO [main] com.documentum.fc.client.security.impl.DfcIdentityPublisher - [DFC_SECURITY_GR_REGISTRATION_PUBLISH] this dfc instance is now published in the global registry DCTM_PSQL
10:25:57,695  INFO [main] com.documentum.fc.client.security.impl.DfcRightsCreator - assigning rights to all roles for this client on DCTM_PSQL
10:25:57,701  INFO [main] com.documentum.fc.client.security.impl.DfcRightsCreator - found client rights: false
10:25:57,733  INFO [main] com.documentum.fc.client.security.impl.DfcIdentityPublisher - found client registration: true
10:25:57,746  INFO [main] com.documentum.fc.client.security.impl.DfcRightsCreator - found client rights: false
10:25:57,989  INFO [main] com.documentum.fc.client.security.impl.IpAndRcHelper - filling a new rights record for dfc_xxpa7jhtGFkRfHvrQmUIyaffxlka
10:25:58,015  INFO [main] com.documentum.fc.client.security.impl.DfcRightsCreator - [DFC_SECURITY_DOCBASE_RIGHTS_REGISTER] this dfc instance has now escalation rights registered with docbase DCTM_PSQL

How many dfc.keystore files do we have?

~]$ find /u01/documentum/cs/ -name dfc.keystore
/u01/documentum/cs/shared/config/dfc.keystore
.../ServerApps.ear/APP-INF/classes/dfc.keystore
.../com.emc.ide.external.dfc_1.0.0/documentum.config/dfc.keystore

How many dfc.properties files do we have?

~]$ find /u01/documentum/cs/ -name dfc.properties
/u01/documentum/cs/shared/config/dfc.properties
.../ServerApps.ear/APP-INF/classes/dfc.properties
.../com.emc.ide.external.dfc_1.0.0/documentum.config/dfc.properties

You might say the second one (JMS’s) is not actually dfc.properties because it looks like:

#include /u01/documentum/cs/shared/config/dfc.properties
dfc.bof.classloader.enable_extension_loader_first=false

but it is, moreover dfc.config.file read-only property defines the path to dfc.properties file, and dfc.config.dir read-only property defines the directory containing dfc.properties file. Now, the only option which default value depends on dfc.config.dir is:

# Fully qualified file name of the keystore file holding the PKI credentials for 
# DFC. 
# 
# Defaults to dfc.keystore in the same directory where the property file 
# (dfc.properies) is found.                                                     
# 
dfc.security.keystore.file = ${dfc.config.dir}/dfc.keystore

I think it is obvious that $DOCUMENTUM_SHARED/config/dfc.properties is misconfigured because it lacks dfc.security.keystore.file entry.

Weird DQL grammar. Part III

Today I noticed another odd behaviour:

API> ?,c,select count(*) from dm_sysobject 
  where a_is_template=FALSE enable(GENERATE_SQL_ONLY)
generate_sql
-------------------------------------------------
select all count(*) from dm_sysobject_sp  dm_sysobject 
  where ((dm_sysobject.a_is_template=0)) 
  and (dm_sysobject.i_has_folder = 1 and dm_sysobject.i_is_deleted = 0) 
(1 row affected)

API> ?,c,select count(*) from dm_sysobject 
  where a_is_template=0 enable(GENERATE_SQL_ONLY)
generate_sql
-------------------------------------------------
select all count(*) from dm_sysobject_sp  dm_sysobject 
  where ((dm_sysobject.a_is_template=0)) 
  and (dm_sysobject.i_has_folder = 1 and dm_sysobject.i_is_deleted = 0) 
(1 row affected)

BUT:

API> ?,c,select count(*) from dm_dbo.dm_sysobject_s 
  where a_is_template=FALSE enable(GENERATE_SQL_ONLY)
generate_sql
-------------------------------------------------
select all count(*) from docbase.dm_sysobject_s 
  where ((a_is_template=0 OR a_is_template IS NULL)) 
(1 row affected)

API> ?,c,select count(*) from dm_dbo.dm_sysobject_s 
  where a_is_template=0 enable(GENERATE_SQL_ONLY)
generate_sql
-------------------------------------------------
select all count(*) from docbase.dm_sysobject_s 
  where ((a_is_template=0 OR a_is_template IS NULL)) 
(1 row affected)

i.e. Content Server thinks that it is smart enough and in case of registered tables generates wrong SQL, there are two options to “override” such behaviour:

  • write odd DQL, like:
    API> ?,c,select count(*) from dm_dbo.dm_sysobject_s 
      where (a_is_template=FALSE and a_is_template IS NOT NULL) enable(GENERATE_SQL_ONLY)
    generate_sql                                                                                                                             
    -------------------------------------------------
    select all count(*) from bee_ssc_dev.dm_sysobject_s 
      where (((a_is_template=0 OR a_is_template IS NULL) and a_is_template is not null)) 
    (1 row affected)
    
  • write tricky DQL:
    API> ?,c,select count(*) from dm_dbo.dm_sysobject_s 
      where a_is_template IN (FALSE) enable(GENERATE_SQL_ONLY)
    generate_sql                                                                     
    -------------------------------------------------
    select all count(*) from bee_ssc_dev.dm_sysobject_s 
      where (a_is_template in (0)) 
    (1 row affected)
    

Never happened before and happened again

As I described previosely, in 2012 we were implementing a large Documentum project, and that time we were observing a lot of performance and stability issues (actually, it was bit weird to observe such issues in mature product maintained by talented team). And one of issues was following: it was not possible to restart JMS during business hours (from my perspective it is not a big deal to do that if your application is properly designed – business users do not consume JMS, so, bouncing JMS should not affect business users):

  • about 5 of 10 restarts ended with deadlocks in java
  • about 4 of 10 restarts ended with crash of java process with SIGBUS error in libzip.so, something like:
    #
    # A fatal error has been detected by the Java Runtime Environment:
    #
    #  SIGBUS (0x7) at pc=0x00007f881a262982, pid=19993, tid=140222193350400
    #
    # JRE version: Java(TM) SE Runtime Environment (7.0_72-b14) (build 1.7.0_72-b14)
    # Java VM: Java HotSpot(TM) 64-Bit Server VM (24.72-b04 mixed mode linux-amd64 compressed oops)
    # Problematic frame:
    # C  [libzip.so+0x4982]  newEntry+0x62
    #
    
  • and only 1 of 10 restarts was successful

So, we filed a couple SRs and after a while EMC support started claiming that first two issues were addressed in recent Documentum patches. Frankly speaking, during last 4 years I haven’t seen JMS crashes with SIGBUS error in libzip.so, but now I’m not sure that talented team had really fixed the issue, because two days ago our DevOps engineer claimed that Documentum 7.2 is still affected and he was a witness of three succeeded crashes of JMS with SIGBUS error in libzip.so – I just have lost faith in talented team 😦

Well, let discuss this odd error and try to fix it 🙂 If you google for java + sigbus + libzip.so you will find a plenty of bugs in openjdk JIRA and all of them states something like:

  • JVM maps central directory of zip file into memory using mmap system call due to performance considerations
  • it is expected that application receives SIGBUS error if mapped file gets overwritten:
           SIGBUS Attempted access to a portion of the buffer that does not
                  correspond to the file (for example, beyond the end of the
                  file, including the case where another process has truncated
                  the file).
    
  • there is a -Dsun.zip.disableMemoryMapping=true system property which was designed to “prevent crashes”

So, JDK seems to “work as designed”, nobody is going to fix this behaviour (not true for JDK9, but for current JDK versions we must take this behaivour as given) and the real root cause of our problem is “somebody” changes jar files on JMS while they are in use and, hence, causes it crashes, and this “somebody” is an odd TBO implementation. Is it possible to fix it? Actually, I have no idea what talented team was doing four years ago, but I have found robust solution within four hours (if my memory serves me right they was trying to fix the issue during four months) – the good starting point here is to find/write a good testcase and I have found it in JDK-8017777 SIGBUS in java.util.zip.ZipFile.getEntry:

$ java ZipFileTest
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0xa) at pc=0x00000001054cfe92, pid=5775, tid=0x0000000000004d03
#
# JRE version: Java(TM) SE Runtime Environment (8.0_111-b14) (build 1.8.0_111-b14)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.111-b14 mixed mode bsd-amd64 compressed oops)
# Problematic frame:
# C  [libzip.dylib+0x2e92]  newEntry+0x154
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /Users/apanfilov/hs_err_pid5775.log
[thread 21251 also had an error]
[thread 22275 also had an error]
[thread 23299 also had an error]
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
[thread 24323 also had an error]
Abort trap: 6

Cool! Not lets check the behaviour of JVM with -Dsun.zip.disableMemoryMapping=true system property:

$ java -Dsun.zip.disableMemoryMapping=true ZipFileTest
Exception in thread "Thread-2" java.util.zip.ZipError: jzentry == 0,
 jzfile = 140315557132880,
 total = 1001,
 name = /var/folders/9d/m0q0kvvx3fj54mr_m3qs3twh0000gn/T/zipfile.zip ,
 i = 1,
 message = null
	at java.util.zip.ZipFile$ZipEntryIterator.next(ZipFile.java:519)
	at java.util.zip.ZipFile$ZipEntryIterator.nextElement(ZipFile.java:499)
	at java.util.zip.ZipFile$ZipEntryIterator.nextElement(ZipFile.java:480)
	at ZipFileTest$ZipFileReader.run(ZipFileTest.java:65)
	at java.lang.Thread.run(Thread.java:745)
Exception in thread "Thread-0" java.util.zip.ZipError: jzentry == 0,
 jzfile = 140315557132880,
 total = 1001,
 name = /var/folders/9d/m0q0kvvx3fj54mr_m3qs3twh0000gn/T/zipfile.zip ,
 i = 1,
 message = null
	at java.util.zip.ZipFile$ZipEntryIterator.next(ZipFile.java:519)
	at java.util.zip.ZipFile$ZipEntryIterator.nextElement(ZipFile.java:499)
	at java.util.zip.ZipFile$ZipEntryIterator.nextElement(ZipFile.java:480)
	at ZipFileTest$ZipFileReader.run(ZipFileTest.java:65)
	at java.lang.Thread.run(Thread.java:745)

Well, JVM didn’t crash, but we got a plenty of errors, so, -Dsun.zip.disableMemoryMapping=true system property prevents crashes only and seems to be unusable 😦 Here we need to stop blaming JDK and start thinking about following: what does mean “mapped file gets overwritten”, i.e. are the following scenarios are doing the same or not:

$ file 1.txt
1.txt: empty
$ echo "talented team" > 1.txt
$ file 1.txt
1.txt: ASCII text
$ echo "talented team" > 1.txt.tmp
$ mv 1.txt.tmp 1.txt

Actually, the second scenario is safe from JVM perspective and small patch for initial testcase may prove this:

@@ -79,9 +78,12 @@ public ZipFileWriter(File aFile) {
         public void run() {
             int i = 1;
             ZipOutputStream zOut;
+            File tmp;
             try {
-                zOut = new ZipOutputStream(new FileOutputStream(file));
-            } catch (FileNotFoundException e1) {
+                tmp = File.createTempFile(ZipFileTest.class.getSimpleName(),
+                        null, file.getParentFile());
+                zOut = new ZipOutputStream(new FileOutputStream(tmp));
+            } catch (IOException e1) {
                 e1.printStackTrace();
                 return;
             }
@@ -102,6 +104,9 @@ public void run() {
             } finally {
                 try {
                     zOut.close();
+                    if(!tmp.renameTo(file)) {
+                        System.err.println("Can't rename");
+                    }
                 } catch (IOException e) {
                 }
             }

and here we are:

$ java ZipFileTest
$ 

What about DFC? Javassist rocks!

Weird DQL grammar. Part II

Typically Documentum ignores DQL hints which are present in subqueries (though in most cases return_top hint is useless):

1> select count(*) from
2> (select user_name from dm_user enable(return_top 10))
3> go
count(*)              
----------------------
                    66
(1 row affected)

also Documentum fails to parse union keyword in subquery:

1> select count(*) from
2> (select user_name from dm_user union select group_name from dm_group)
3> go
[DM_QUERY_E_SYNTAX]error:  
  "A Parser Error (syntax error) has occurred in the vicinity of:  
   select count(*) from (select user_name from dm_user union"

but:

1> select count(*) from
2> (select user_name from dm_user enable(return_range 1 10 'user_name'))
3> go
count(*)              
----------------------
                    10
(1 row affected)

BugMakers :)

Frankly speaking, when I was writing previous blogpost I got surprised to discover that DQL update statement preserves the value of r_lock_owner attribute:

API> revert,c,09024be98006b104
...
OK
API> get,c,09024be98006b104,r_lock_owner
...
dmadmin
API> ?,c,update dm_document objects set object_name='xxx' where r_object_id='09024be98006b104'
objects_updated
---------------
              1
(1 row affected)
[DM_QUERY_I_NUM_UPDATE]info:  "1 objects were affected by your UPDATE statement."


API> revert,c,09024be98006b104
...
OK
API> get,c,09024be98006b104,r_lock_owner
...
dmadmin
API> 

Unfortunately, it is not true when you update objects, which behaviour is customized via TBO:

API> retrieve,c,bs_doc_cash
...
09bc2c71806d6ffe
API> checkout,c,l
...
09bc2c71806d6ffe
API> ?,c,update bs_doc_cash objects set object_name='xxx' where r_object_id='09bc2c71806d6ffe'
objects_updated
---------------
              1
(1 row affected)
[DM_QUERY_I_NUM_UPDATE]info:  "1 objects were affected by your UPDATE statement."


API> revert,c,09bc2c71806d6ffe
...
OK
API> get,c,09bc2c71806d6ffe,r_lock_owner
...

API>

but:

API> checkout,c,09bc2c71806d6ffe
...
09bc2c71806d6ffe
API> apply,c,,EXEC,QUERY,S,update bs_doc_cash objects set object_name='xxx' where r_object_id='09bc2c71806d6ffe',BOF_DQL,B,F
...
q0
API> next,c,q0
...
OK
API> dump,c,q0
...
USER ATTRIBUTES

  objects_updated                 : 1

SYSTEM ATTRIBUTES


APPLICATION ATTRIBUTES


INTERNAL ATTRIBUTES


API> revert,c,09bc2c71806d6ffe
...
OK
API> get,c,09bc2c71806d6ffe,r_lock_owner
...
dmadmin
API> 

Database connections. Part II

Well, previosely we defined an estimation for database connection – about twice amount of concurrent Documentum sessions – may be less, may be more, depends on application. Now the question: how many connections is it possible to create in database? OpenText thinks that the number of database connections is unlimited:

D2 runs on content server which has good scalability by adding additional content server nodes. Content server is often the bottleneck of the whole D2 topology when the system is running under a heavy load condition. When content server reaches its bottleneck, we could monitored the CPU usage of content server is up to 90% and the number of active sessions grows very fast. To add one additional content server node to the existing environment could improve system throughput significantly.
Officially we suggests adding one additional content server node on every 300 concurrent users’ growth. The mileage

which is actually not true, on the other hand if OpenText has written something like: our product fails to take advantage of best practices and does not pool database connections, it would be ridiculous, so instead of improving product they has preferred to declare another marketing bullshit.

So, why database connection pooling is important? If you try to ask google you will find something like: creating database connection is an expensive and time-consuming operation: application needs to perform TCP (or even TLS) handshake, authenticate, database needs to start new process, etc…, so, it is recommended to pool database connections. Unfortunately it is only a half of the truth – pools also limit the number of concurrent database connections, and it is important too, let me quote the best oracle database expert ever:

In looking at your Automatic Workload Repository report, I see that the longest-running events at the system level are latch-related: cache buffers chains and library cache. Additionally, your CPU time was way up there. Concurrency-based waits are caused by one thing and one thing only: having many concurrently active sessions. If you had fewer concurrently active sessions, you would by definition have fewer concurrency-based waits (fewer people contending with each other). I see that you had 134 sessions in the database running on a total of 4 CPU cores. Because it is not really possible for 4 CPU cores to allow 134 sessions to be concurrently active, I recommend that you decrease the number of concurrent sessions by reducing the size of your connection pool—radically. Cut your connection pool to something like 32. It is just not possible for 4 cores to run 134 things at the same time; 4 cores can run only 4 things at exactly the same time. If you reduce the concurrent user load, you’ll see those waits go down, response time go down, and your transaction rates improve. In short, you’ll find out that you can actually do more work with fewer sessions than you can with more.

I know that this fewer-does-more suggestion sounds counterintuitive, but I encourage you to watch this narrated Real World Performance video.

In this video, you’ll see what happens in a test of a 12-core machine running transactions and decreasing the size of a connection pool from an unreasonable number (in the thousands) to a reasonable number: 96. At 96 connections on this particular machine, the machine was able to do 150 percent the number of transactions per second and took the response time for these transactions from ~100 milliseconds down to ~5 milliseconds.

Short of reducing your connection pool size (and therefore changing the way the application is using the database by queuing in the middle-tier connection pool instead of queuing hundreds of active sessions in the database), you would have to change your queries to make them request cache buffers chains latches less often. In short: tune the queries and the algorithms in the application. There is literally no magic here. Tweaking things at the system level might not be an option. Touching the application might have to be an option.

And from Documentum perspective, the only option to limit the number of database connections is to use shared server feature (fuck yeah, Oracle supports Database Resident Connection Pooling since 11g, but mature product does not). And do not pay much attention to EMC’s documents like Optimizing Oracle for EMC Documentum – such documents are wrong from beginning to end.