Encryption madness

My colleagues got faced with ridiculous security settings implemented by EMC, so I decided to write this blogpost. Starting from Documentum 7.2 EMC decided to switch from resolving security issues to treating symptoms (first attempt, second attempt). The security issue is: if attacker is able to hijack aek.key file from Content Server filesystem he is able to get superuser access to repository (the basic scenario is: we can decrypt i_ticket_crypto_key from docbase config using aek.key and after that we can issue any login ticket). To “mitigate” such attack EMC proposes two options:

As we already know, both options have nothing in common with security, but, as it turned out, cause a lot of pain in behind. The problem is: if you protect aek.key by passphrase or use lockbox, DFC is unable to use aek.key anymore to decrypt passwords (the basic example is: the file with password for ldap server stored in config directory on Content Server, ldap synchronisation job needs to decrypt this password), so it is required to invent another approach for decrypting passwords on DFC side, but EMC did something weird. Starting from 7.2 DFC following behaviour takes place:

  • installer adds dfc.crypto.repository parameter into dfc.properties (it is worth to mention, that installer does it in wrong way: installation of second repository overrides this parameter)
  • when encrypting text/password, if DFC sees dfc.crypto.repository parameter in dfc.properties it tries to establish trusted session to this repository to call ENCRYPT_TEXT/ENCRYPT_PASSWORD RPC command and adds repository name to the end of encrypted text/password:
    Connected to Documentum Server running Release 7.2.0060.0222  Linux64.Oracle
    Session id is s0
    API> encryptpass,c,mycoolpassword
    ...
    DM_ENCR_PASS_V2=AAAAEKfc...RYjoGt::DCTM_DEV
    API> encrypttext,c,mycoolpassword
    ...
    DM_ENCR_TEXT_V2=AAAAELQ...O+x5g::DCTM_DEV
    
  • when decrypting text, if DFC sees dfc.crypto.repository parameter in dfc.properties it tries to establish trusted session to this repository or, in case of failure, to repository mentioned at the end of encrypted text
  • when decrypting password DFC uses old technique based on reading aek.key file

I believe, the behaviour described above is extremely logical, unfortunately, I’m unable to understand this logic. For example, let imagine that I installed two repositories, in this case dfc.crypto.repository points to the second one. Now I’m going to setup LDAP synchronisation, in order to encrypt ldap password Documentum Administrator calls replicate_setup_methods docbase method, and passwords will be encrypted using the second repository, now, what will happen if second repository goes down? It’s obvious that ldap synchronisation will stop working for the first repository – DFC is unable to connect to the second repository to decrypt password (actually, all DFC encryption/decryption operations on CS side will fail because repository configured in dfc.properties is down). Also, I have noticed that some users (and my colleagues too) have the imprudence to copy dfc.properties file from Content Server to Application Server, after that some things stop working because copied dfc.properties file contains dfc.crypto.repository parameter, but Application Server has no trusted access to repository, below are some examples from ECN:

Time in Documentum. MSSQL challenge

Interesting, when I was writing post about time in Documentum, I have noticed that storing dates in UTC timezone would cause difficulties for reporting software, but I couldn’t imagine that everything is so really bad. The problem is that among all supported databases only Oracle database works correctly with timezones, i.e. I’m able to compensate any timezone offset in SQL selects by writing something like:

all other databases do not support timezone regions out of the box and work only with specific “(-|+)HH:MM” offsets, for example, in case of MSSQL in order to work correctly with UTC time (i.e. properly handle day light saving) you will need to install T-SQL Toolbox, which is slow by design. I do believe because of this Documentum’s DATETOSTRING_LOCAL function has so weird behaviour:

i.e. Documentum translates DATETOSTRING(r_creation_date, ‘yyyy-mm-dd’) directly to SQL’s TO_CHAR(dm_sysobject.r_creation_date, ‘yyyy-mm-dd’), but it unable to do the same for DATETOSTRING_LOCAL(r_creation_date, ‘yyyy-mm-dd’), in order to do that it converts date to string preserving information about time (i.e. TO_CHAR(dm_sysobject.r_creation_date, ‘mm/dd/yyyy hh24:mi:ss’)) and then converts resulting string to requested format and server’s timezone. Due to this implementation you are unable (actually you can, but results are unreliable) to use DATETOSTRING_LOCAL function in WHERE clause and subqueries.

Now, I do think that for the most installations storing dates in UTC timezone is an evil, and enabling this feature by default was a big EMC’s mistake.

ACL computations

Yesterday my skypemate complained me about last Scott Roth’s post:

Is something changed in the Documentum platform or he is completely wrong, but he has may years of experience. I cannot believe that. I supposed that the 7.2 version changed the most permissive logic, the “standard” most permissive, it seems a post written by a Documentum newbie. I need to unsubscribe from his blog, there are not useful blogs about documentum.

The last point about useful blogs about documentum was insulting, so I think it is required to clarify how ACL computations work in Documentum.

Three years ago I got engaged in the investigation of the following performance problem:

Customer’s application associates every user with a group and grants privileges to associated group rather than certain user – this idea is very clear because when you need to grant “the same as certain user’s privileges” to specific user you just need to modify group memberships. Unfortunately, Documentum has a couple of performance issues related to such design, these issues are:

  • if ACL contains a lot of group entries, computation of effective privileges for specific user (i.e. “get,c,l,_permit,user_name”) becomes extremely slow (I saw some cases when IDfSysObject.getPermitEx(“username”) call took minutes) – if ACL entry contains a group Content Server tries to figure out whether user belongs to that group or not, i.e. Content Server performs one sql select per group entry
  • if ACL contains a lot entries, saving this ACL becomes extremely slow – when saving ACL Content Server tries to figure out whether ACL entry relates to user or group and validate the existence of corresponding user or group, in order to do that Content Server performs selects from database: one select if ACL entry relates to group and three selects if ACL entry relates to user – it’s more useful to put group entries in ACL, isn’t it?

The first issue seems to be resolved in recent releases of Content Server – now IDfSysObject.getPermitEx(“username”) call is backed up by the following single SQL query:

SELECT acl.r_accessor_name,
       acl.r_accessor_permit,
       acl.r_permit_type,
       acl.r_is_group
  FROM dm_acl_r acl, dm_group_r gr1, dm_group_r gr2
 WHERE     acl.r_object_id = :acl_id
       AND acl.r_is_group = 1
       AND gr1.users_names = :user_name
       AND gr1.r_object_id = gr2.r_object_id
       AND gr2.i_nondyn_supergroups_names IS NOT NULL
       AND gr2.i_nondyn_supergroups_names = acl.r_accessor_name
UNION
SELECT acl.r_accessor_name,
       acl.r_accessor_permit,
       acl.r_permit_type,
       acl.r_is_group
  FROM dm_acl_r acl
 WHERE     acl.r_object_id = :acl_id
       AND (   acl.r_accessor_name = :user_name
            OR acl.r_accessor_name = 'dm_world')

The second issue is still not resolved, moreover, there is no ETA for that, it’s very embarrassing because there is no need to perform a lot of SQL selects at all, let’s clarify. Content Server Performs following SQL selects:

-- determine whether ACL entry relates
-- to user or group
SELECT s.is_dynamic, s.group_class
  FROM dm_group_s s
 WHERE s.group_name = :p0;

-- query to get r_object_id or dm_user
SELECT r_object_id
  FROM dm_user_s
 WHERE user_name = :p0;

-- fetching user
  SELECT *
    FROM DM_USER_RV dm_dbalias_B, DM_USER_SV dm_dbalias_C
   WHERE (    dm_dbalias_C.R_OBJECT_ID = :dmb_handle
          AND dm_dbalias_C.R_OBJECT_ID = dm_dbalias_B.R_OBJECT_ID)
ORDER BY dm_dbalias_B.R_OBJECT_ID, dm_dbalias_B.I_POSITION;

last two queries are intended to check whether user exists or not, actually in order to perform the same it is enough to replace the first query by:

SELECT u.r_is_group
  FROM dm_user_s u
 WHERE u.user_name = :p0;

first query is intended to populate following attributes in dm_acl:

  • r_is_group – not required at all due to ACL computation algorithm, see explanation below
  • i_has_required_groups – required for MACL entries only
  • i_has_required_group_set – required for MACL entries only

Why r_is_group in dm_acl is not required at all? The explanation is very obvious: when Content Server tries to determine permissions of current user it retrieves information about all groups user belongs to from database, so it already has information about user’s groups and it is enough to check whether r_accessor_name in ACL entry exists in user’s group set or not, when Content Server tries to determine permissions of non-current user (see above SQL select) it is enough to use following SQL select:

SELECT acl.r_accessor_name,
       acl.r_accessor_permit,
       acl.r_permit_type,
       acl.r_is_group
  FROM dm_acl_r acl, dm_group_r gr1, dm_group_r gr2
 WHERE     acl.r_object_id = :acl_id
       -- AND acl.r_is_group = 1
       AND gr1.users_names = :user_name
       AND gr1.r_object_id = gr2.r_object_id
       AND gr2.i_nondyn_supergroups_names IS NOT NULL
       AND gr2.i_nondyn_supergroups_names = acl.r_accessor_name
UNION
SELECT acl.r_accessor_name,
       acl.r_accessor_permit,
       acl.r_permit_type,
       acl.r_is_group
  FROM dm_acl_r acl
 WHERE     acl.r_object_id = :acl_id
       AND (   acl.r_accessor_name = :user_name
            OR acl.r_accessor_name = 'dm_world')

Now about ACL computation logic.

The general logic is:

Sysobject’s owner:

+ implicit default permit (DM_PERMIT_READ)
+ permit granted to dm_owner
+ permits granted to the specific user
+ permits granted to any groups the user belongs to
- restrictions on specific user
- restrictions on any groups the user belongs to
- required group restrictions
- required group set restriction
_________________________________________________
RESULT: MAX(Calculated Permit, DM_PERMIT_BROWSE)

superuser:

+ implicit default permit (DM_PERMIT_READ)
+ permit granted to dm_owner
+ permits granted to the specific user
+ permits granted to any groups the user belongs to
_________________________________________________
RESULT: MAX(Calculated Permit, DM_PERMIT_READ)

regular user:

+ permits granted to the specific user
+ permits granted to any groups the user belongs to
- restrictions on specific user
- restrictions on any groups the user belongs to
- required group restrictions
- required group set restriction
_________________________________________________
RESULT: Calculated Permit

Non-common logic:

  • for regular users dm_escalated_* groups override restrictions on user/group but not required group/group set restrictions
  • dm_read_all, dm_browse_all group membership overrides all MACLs
  • minimum owner’s permissions actually is MAX(dm_docbase_config.minimum_owner_permit, DM_BROWSE_ALL)

What is wrong in last Scott Roth’s post? Implication “I’m getting weird results, so I think that product has certain behaviour” is wrong, the correct is “I’m getting weird results, here is my proof of concept”.

DM_FOLDER_E_CONCUR_LINK_OPERATION_FAILURE

It seems that Content Server 7.2 got a new weird behavior – now if you perform create/update/delete operations with dm_folder objects in transaction you may get one of DM_FOLDER_E_CONCUR_LINK_OPERATION_FAILURE, DM_FOLDER_E_CONCUR_UNLINK_OPERATION_FAILURE or DM_FOLDER_E_CONCUR_RENAME_OPERATION_FAILURE error:

--
-- Session #1
--
API> begintran,c,
...
OK
API> create,c,dm_folder
...
0b024be98000a900
API> set,c,l,object_name
SET> folder 1
...
OK
API> save,c,l
...
OK

--
-- Session #2
--
API> begintran,c,
...
OK
API> create,c,dm_folder
...
0b024be98000a90b
API> set,c,l,object_name
SET> folder 2
...
OK
API> save,c,l
...
-- 10 sec timeout
[DM_FOLDER_E_CONCUR_LINK_OPERATION_FAILURE]error:  
      "Cannot perfrom the link operation on folder (0b024be98000a90b), 
      as some concurrent operation is being performed on the folder or 
      decendant folder or ancesstor folder with folder id 0c024be980000105."


API> commit,c,
...
[DM_SESSION_E_TRANSACTION_ERROR]error:  
      "Transaction invalid due to errors, please abort transaction."

it seems that new behavior originates from following bugs/CRs addressed in 7.2 (check release notes):

Issue Number Description
CS-46175 r_link_cnt on folder is not showing the correct numbers of objects held by the folder.
CS-40838 When two users perform a move operation of two folders simultaneously, the r_folder_path and i_ancestor_id parameters contain incorrect values causing folder inconsistencies in Oracle and SQL Server. Workaround: Add disable_folder_synchronization = T in the server.ini file. By default, the value is F.

The interesting thing here is a fact, that new behavior has nothing in common with consistency – EMC developers are not familiar with the common lock pattern:

  if (condition) {
    acquire lock
    if (condition) {
      do work
    }
    release lock
  }

and make mistakes that even junior developers do not make:

--
-- Session #1
--
API> create,c,dm_folder
...
0b024be98000c2dc
API> set,c,l,object_name
SET> test_folder
...
OK
API> link,c,l,/dmadmin
...
OK
API> link,c,l,/Temp
...
OK
API> link,c,l,/System
...
OK
API> save,c,l
...
OK

--
-- Session #2
--
API> begintran,c,
...
OK
API> create,c,dm_folder
...
0b024be98000c2e8
API> set,c,l,object_name
SET> f1
...
OK
API> link,c,l,/dmadmin/test_folder
...
OK
API> save,c,l
...
OK

--
-- Session #1
--
API> destroy,c,l
... waiting

--
-- Session #2
--
API> commit,c,
...
OK

--
-- Session #1
-- 
OK

--
-- Session #2
-- here we get zombie folder
--
API> get,c,0b024be98000c2e8,r_folder_path[0]
...
/dmadmin/test_folder/f1
API> retrieve,c,dm_folder where any r_folder_path='/dmadmin/test_folder'
...
[DM_API_E_NO_MATCH]error:  
    "There was no match in the docbase for the qualification: 
    dm_folder where any r_folder_path='/dmadmin/test_folder'"

What a shame!

Fighting with DEV ENV

About a month ago I thought about upgrading my current T420s laptop to something more modern and powerful, unfortunately it have turned out that laptop market is unable to offer anything suitable for me, though I do not expect anything extraordinary, just:

So, I gave up the idea to buy new laptop and decided to save moneyspend money on booze, but such approach does not actually solve my problem: I need to run a couple of virtual machines with Documentum on my current laptop. What to do? The answer is obvious: take a problem and solve it. Below you can find some suggestions about how to decrease memory footprint for Documentum and increase response time – finally, I got something like:

Decrease number of concurrent jobs

By default agentexec executes up to three jobs in a polling cycle, some of docbase jobs are too heavy for being run simultaneously.

API> retrieve,c,dm_method where object_name='agent_exec_method'
...
10024be980000171
API> get,c,l,method_verb
...
./dm_agent_exec -max_concurrent_jobs 1
API>

Disable saving job’s logs into repository

Actually, this suggestion is very controversial and it may not work in some cases. The problem is vendor assumes that putting “ECM” label on software product automatically turns this product into scrap-heap, agentexec is a good confirmation of this point: every time when agentexec executes job it saves job’s log into repository, and, as per my memory, I had checked those logs may be four or five times during last eight years, why do not disable this useless feature? I believe that it’s fucking simple to add extra attribute to dm_job object to control behavior of agentexec, instead of that vendor created dumb job intended for clearing obsolete logs. Fortunately, when playing with dm_job’s method_trace_level attribute I have found that setting method_trace_level to -1 prevents agentexec from saving job’s log into repository, unfortunately, standard jobs does not recognize this value:

[com.documentum.mthdservlet.DoMethod] - Exception invoking com.documentum.bpm.method.XCPAutoTasKMgmt.
DfMethodArgumentException:: THREAD: http--0.0.0.0-9080-1; 
       MSG: [DFC_METHOD_BAD_ARGUMENT_VALUE] Argument method_trace_level 
       has an invalid value (4294967295); ERRORCODE: ff; NEXT: null
    at com.documentum.fc.methodserver.DfMethodArgumentException.invalidArgument(DfMethodArgumentException.java:29)
    at com.documentum.fc.methodserver.DfMethodArgumentManager.getInt(DfMethodArgumentManager.java:122)
    at com.documentum.fc.methodserver.DfStandardJobArguments.<init>(DfStandardJobArguments.java:60)
    at com.documentum.fc.methodserver.DfMethodArgumentManager.getJobArguments(DfMethodArgumentManager.java:248)
    at com.documentum.fc.methodserver.DfMethodArgumentManager.<init>(DfMethodArgumentManager.java:97)
    at com.documentum.bpm.Utils.GenericJobMethod.execute(GenericJobMethod.java:41)
    at com.documentum.mthdservlet.DfMethodRunner.runIt(Unknown Source)
    at com.documentum.mthdservlet.AMethodRunner.runAndReturnStatus(Unknown Source)
    at com.documentum.mthdservlet.DoMethod.invokeMethod(Unknown Source)
    at com.documentum.mthdservlet.DoMethod.doPost(Unknown Source)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:754)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:847)

What to do? HexEditor to the rescue: I have replaced “-docbase_name %s -user_name %s -job_id %s -method_trace_level %s” pattern in dm_agent_exec binary by “-docbase_name %s -user_name %s -job_id %s -method_trace_level 0” and now agentexec does not save job’s logs into repository and jobs are happy with parsing 0 as trace level:

API> ?,c,select count(*) from dm_document where FOLDER('/Temp/Jobs',DESCEND)
count(*)
----------------------
                     0
(1 row affected)

Another problem with agentexec it loves to store log files with stupid names in $DOCUMENTUM/dba/log/<docbaseid>/agentexec directory:

agentexec]$ ls | grep save.
agentexec.log.save.07.17.15.18.31.17
job_08024be9800018bc.save.07.17.15.18.46.22
job_08024be9800050cf.save.07.17.15.18.32.49
job_08024be9800050cf.save.07.17.15.18.37.20
job_08024be9800050cf.save.07.17.15.18.44.50
job_08024be980005ce1.save.07.17.15.18.34.19
job_08024be980005ce1.save.07.17.15.18.38.50
job_08024be980006763.save.07.17.15.18.31.20
job_08024be980006763.save.07.17.15.18.40.20
job_08024be980006766.save.07.17.15.18.35.49
job_08024be980006766.save.07.17.15.18.41.50
job_08024be9800067c1.save.07.17.15.18.43.20

to solve this problem I replaced “%s.%s.%02d.%02d.%02d.%02d.%02d.%02d” pattern in dm_agent_exec binary by “%s.%s\0%02d.%02d.%02d.%02d.%02d.%02d” (\0 is a binary zero).

Disable useless jobs

By default, Documentum installer enables following jobs:

  • dm_usageReport
  • dm_WfmsTimer
  • dm_ContentWarning
  • dm_DBWarning
  • dm_StateOfDocbase
  • dm_UpdateStats
  • dm_DataDictionaryPublisher
  • dm_bpm_XCPAutoTaskMgmt
  • dce_Clean
  • dm_QmPriorityAging
  • dm_QmPriorityNotification
  • dm_QmThresholdNotification
  • dm_WFReporting
  • dm_WFSuspendTimer

and all of them are useless for development environment:

API> ?,c,update dm_job objects set is_inactive=TRUE 
   where object_name like 'dm\_%' escape '\' or object_name like 'dce\_%' escape '\'
objects_updated
---------------
             50
(1 row affected)
[DM_QUERY_I_NUM_UPDATE]info:  "50 objects were affected by your UPDATE statement."

Improve startup time of DMCL applications

java.ini:

#
# java_options      - Options for the Java VM.
#
java_options = "-Xms4m -Xmx64m -XX:PermSize=4m -XX:MaxPermSize=256m -XX:+UseSerialGC -Xrs"

java.security:

#
# List of providers and their preference orders (see above):
#
security.provider.1=sun.security.provider.Sun
security.provider.2=sun.security.rsa.SunRsaSign
security.provider.3=sun.security.ec.SunEC
security.provider.4=com.sun.net.ssl.internal.ssl.Provider
security.provider.5=com.sun.crypto.provider.SunJCE
security.provider.6=sun.security.jgss.SunProvider
security.provider.7=com.sun.security.sasl.Provider
security.provider.8=org.jcp.xml.dsig.internal.dom.XMLDSigRI
security.provider.9=sun.security.smartcardio.SunPCSC

Improve startup time of Java Method Server

  1. undeploy acs.ear
  2. tune heap size – default settings: -Xms1024m -Xmx1024m -XX:PermSize=64m -XX:MaxPermSize=256m are too greedy
  3. replace standalone.xml with:
    <?xml version='1.0' encoding='UTF-8'?>
    
    <server xmlns="urn:jboss:domain:1.2">
    
        <extensions>
            <extension module="org.jboss.as.deployment-scanner"/>
            <extension module="org.jboss.as.ee"/>
            <extension module="org.jboss.as.naming"/>
            <extension module="org.jboss.as.remoting"/>
            <extension module="org.jboss.as.security"/>
            <extension module="org.jboss.as.web"/>
        </extensions>
    
        <profile>
            <subsystem xmlns="urn:jboss:domain:deployment-scanner:1.1">
                <deployment-scanner path="deployments" relative-to="jboss.server.base.dir" scan-interval="0" deployment-timeout="300"/>
            </subsystem>
            <subsystem xmlns="urn:jboss:domain:ee:1.0"/>
            <subsystem xmlns="urn:jboss:domain:naming:1.1"/>
            <subsystem xmlns="urn:jboss:domain:remoting:1.1"/>
            <subsystem xmlns="urn:jboss:domain:security:1.1">
                <security-domains>
                    <security-domain name="other" cache-type="default"/>
                </security-domains>
            </subsystem>
            <subsystem xmlns="urn:jboss:domain:web:1.1" default-virtual-server="default-host" native="false">
                <connector name="http" protocol="HTTP/1.1" scheme="http" socket-binding="http"/>
                <virtual-server name="default-host" enable-welcome-root="true">
                    <alias name="localhost"/>
                    <alias name="example.com"/>
                </virtual-server>
            </subsystem>
        </profile>
    
        <interfaces>
            <interface name="public">
                <inet-address value="${jboss.bind.address:0.0.0.0}"/>
            </interface>
        </interfaces>
    
        <socket-binding-group name="standard-sockets" default-interface="public" port-offset="${jboss.socket.binding.port-offset:0}">
            <socket-binding name="ajp" port="9089"/>
            <socket-binding name="http" port="9080"/>
            <socket-binding name="https" port="9082"/>
        </socket-binding-group>
    
    </server>
    

Disable SSL

It seems that starting from 7.2(?) Documentum installer enables SSL by default:

API> retrieve,c,dm_server_config
...
3d024be980000102
API> get,c,l,secure_connect_mode
...
secure

disable it to improve response time:

API> set,c,l,secure_connect_mode
SET> native
...
OK
API> save,c,l
...
OK

Disable email notifications

server.ini:

mail_notification = F

Disable updating last_login_utc_time attribute of dm_user objects

API> ?,c,exec exec_sql with 
    query='update dm_user_s set i_vstamp=i_vstamp+1, 
    last_login_utc_time=to_date(''01.01.0001'',''dd.mm.yyyy'')'
result
------------
T
(1 row affected)

Disable updating r_access_date attribute of dm_sysobject objects

server.ini:

update_access_date = F

Disable MACL security

API> retrieve,c,dm_docbase_config
...
3c024be980000103
API> set,c,l,macl_security_disabled
SET> T
...
OK
API> save,c,l
...
OK

Disable dmbasic method server

server.ini:

# This controls the dmbasic method server.
method_server_enabled = F
method_server_threads = 5

Disable auditing

API> unaudit,c,,dm_default_set
...
OK
API> unaudit,c,,dm_logon_failure
...
OK

Q & A. VIII

I am grateful to you for creating dctmpy which I am planning to heavily use in my icinga2 monitoring environment. This work you have done is commendable. Earlier I was planning to write my own custom plugins using perl, then I figured out it is not gonna be easy considering the effort required to make Db::Documentum (which Scott created) work in D6+ environments. That is when I came across your wonderful work. I have successfully tested functionalities like login, sessioncount, targets, login etc.

I believe you have recently introduced a few modes. And as the documentation in emc community about dctmpy (https://community.emc.com/people/aldago-zF7Lc/blog/2014/05/19/monitoring-documentum-with-nagios-and-dctmpy-plugin) is not up to date, could you please provide some details about different modes like jobs, indexagents, acsstatus, timeskew, xplorestatus, query, method, indexqueue, serverworkqueue, countquery and how to use them. I will be especially interested in query and countquery. If this means I can run any query in docbase and compare the output against the thresholds we supply, that will be awesome.

Eagerly looking forward to your response,
Once again, thank you so much!
– Vishnu

At first, I strongly recommend to not consider Alvaro’s post as a guide to action, not because it’s wrong, but because configuring nagios *.cfg files is the same as writing sendmail.cf without m4 – my preference is to use opsview.

Before describing capabilities of dctmpy I think it worth to define what needs to be monitored and why, otherwise monitoring objectives are not clear – for example, I have tried to understand what ReveilleSoftware really does and after checking some presentations and youtube clips I got understanding that ReveilleSoftware just draw bars and pies :). So, dctmpy seems to be the only reliable monitoring solution for Documentum, others, if exist, are either based on DFC, which requires some extra setup or hacks, or other Documentum services, which makes them dependent on underlying service.

Docbroker service

Typical Docbroker issues are:

  1. Docbroker is down – somebody forgot to start it or Docbroker failed or there are connectivity issues or attacker stopped docbroker
  2. Content Server is not registered on Docbroker – misconfiguration on CS side or connectivity issues
  3. Wrong Content server is registered on Docbroker – I have seen some stupid cases when infrastructure guys clone (EMC does not provide any reliable solution for loading data into repository, so the most reliable way to perform cloning) PROD to UAT but forget to modify network settings, after that users work with wrong environment
  4. Attacker poisoned registration information
  5. Docbroker is running under DoS – for some weird reason Docbroker’s implementation is extremely ugly, and even telnet on Docbroker port causes DoS, example:
    # session 1
     ~]$ nc 192.168.13.131 1489
    <just enter here>
    
    # session 2
    ~]$ time timeout 20 dmqdocbroker -c getdocbasemap
    dmqdocbroker: A DocBroker Query Tool
    dmqdocbroker: Documentum Client Library Version: 7.2.0000.0054
    Targeting current host
    Targeting port 1489
    
    real    0m20.002s
    user    0m0.002s
    sys     0m0.000s
     ~]$ echo $?
    124
    
  6. DoS is caused by slow client or network problems, yes, it’s weird, but client or server with network issues could affect all Documentum infrastructure, so, it is always a good idea to use different docbrokers for different services

I belive all this situations are covered by nagios_check_docbroker, some example:

Basic check of availability:

nagios_check_docbroker -H 192.168.13.131:1489
CHECKDOCBROKER OK - docbase_map_time is 6ms, Registered docbases: DCTM_DEV
| docbase_map_time=6ms;100;;0

The same for SSL connection (note -s flag and increased response time):

nagios_check_docbroker -H 192.168.13.131:1490 -s
CHECKDOCBROKER OK - docbase_map_time is 423ms, Registered docbases: DCTM_DEV
| docbase_map_time=423ms;;;0

Adding response time thresholds:

nagios_check_docbroker -H 192.168.13.131:1490 -s -w 100
CHECKDOCBROKER WARNING - docbase_map_time is 490ms (outside range 0:100),
      Registered docbases: DCTM_DEV
| docbase_map_time=490ms;100;;0

nagios_check_docbroker -H 192.168.13.131:1490 -s -w 100 -c 200
CHECKDOCBROKER CRITICAL - docbase_map_time is 442ms (outside range 0:200),
           Registered docbases: DCTM_DEV
| docbase_map_time=442ms;100;200;0

Checking registration of certain docbase(s):

nagios_check_docbroker -H 192.168.13.131:1489 -d DCTM_DEV
CHECKDOCBROKER OK - docbase_map_time is 7ms,
      Server DCTM_DEV.DCTM_DEV is registered on 192.168.13.131:1489
| docbase_map_time=7ms;;;0

nagios_check_docbroker -H 192.168.13.131:1489 -d DCTM_DEV1
CHECKDOCBROKER CRITICAL - 
      Docbase DCTM_DEV1 is not registered on 192.168.13.131:1489,
      docbase_map_time is 7ms
| docbase_map_time=7ms;;;0

# multiple docbases
nagios_check_docbroker -H 192.168.13.131:1489 -d DCTM_DEV1,DCTM_DEV
CHECKDOCBROKER CRITICAL - 
      Docbase DCTM_DEV1 is not registered on 192.168.13.131:1489,
      docbase_map_time is 5ms,
      Server DCTM_DEV.DCTM_DEV is registered on 192.168.13.131:1489
| docbase_map_time=5ms;;;0

Checking registration of certain server(s):

nagios_check_docbroker -H 192.168.13.131:1489 -d DCTM_DEV.DCTM_DEV
CHECKDOCBROKER OK - docbase_map_time is 6ms,
       Server DCTM_DEV.DCTM_DEV@192.168.13.131 is registered on 192.168.13.131:1489
| docbase_map_time=6ms;;;0

nagios_check_docbroker -H 192.168.13.131:1489 -d DCTM_DEV.DCTM
CHECKDOCBROKER CRITICAL - 
       Server DCTM_DEV.DCTM is not registered on 192.168.13.131:1489,
       docbase_map_time is 11ms
| docbase_map_time=11ms;;;0

#multiple servers
nagios_check_docbroker -H 192.168.13.131:1489 -d DCTM_DEV.DCTM,DCTM_DEV.DCTM_DEV
CHECKDOCBROKER CRITICAL - 
       Server DCTM_DEV.DCTM is not registered on 192.168.13.131:1489,
       docbase_map_time is 7ms, 
       Server DCTM_DEV.DCTM_DEV@192.168.13.131 is registered on 192.168.13.131:1489
| docbase_map_time=7ms;;;0

Checking IP addresses of registered servers:

nagios_check_docbroker -H 192.168.13.131:1489 -d DCTM_DEV.DCTM_DEV@192.168.13.131
CHECKDOCBROKER OK - docbase_map_time is 8ms,
       Server DCTM_DEV.DCTM_DEV@192.168.13.131 is registered on 192.168.13.131:1489
| docbase_map_time=8ms;;;0

nagios_check_docbroker -H 192.168.13.131:1489 -d DCTM_DEV.DCTM_DEV@192.168.13.132
CHECKDOCBROKER CRITICAL - 
       Server DCTM_DEV.DCTM_DEV (status: Open) is registered on 192.168.13.131:1489 
        with wrong ip address: 192.168.13.131, expected: 192.168.13.132,
       docbase_map_time is 7ms
| docbase_map_time=7ms;;;0

Checking malicious registrations (note -f flag):

nagios_check_docbroker -H 192.168.13.131:1489 -f -d DCTM_DEV.DCTM_DEV@192.168.13.132
CHECKDOCBROKER CRITICAL - 
       Server DCTM_DEV.DCTM_DEV (status: Open) is registered on 192.168.13.131:1489
         with wrong ip address: 192.168.13.131, expected: 192.168.13.132, 
       Malicious server DCTM_DEV.DCTM_DEV@192.168.13.131 (status: Open)
         is registered on 192.168.13.131:1489,
       docbase_map_time is 9ms
| docbase_map_time=9ms;;;0

Repository services

Actually, there are a lot of things to be monitored, so nagios_check_docbase covers the most common issues. Common command line pattern for all checks is:

nagios_check_docbase -H <hostname> -p <port> -i <docbaseid> -l <username>
 -a <password> -m <mode> -n <name> [-s] [-t <timeout>] <specific arguments>

where:

  • hostname – hostname or ip address where Documentum is running
  • port – tcp port Documentum is listening on (this is not a docbroker port)
  • docbaseid – docbase identifier (see docbase_id in server.ini, might be omitted but in this case you will get stupid exceptions in repository log)
  • username – username to connect to Documentum
  • password – password to connect to Documentum
  • -s – defines whether to use SSL connection
  • timeout – defines timeout in seconds after which check fails, default is 60 seconds (useful for query checks), for example:
    nagios_check_docbase -H dctms://dmadmin:dmadmin@192.168.13.131:10000 \
      -m countquery \
      --query "select count(*) from dm_folder a, dm_folder b, dm_folder c"
    COUNTQUERY UNKNOWN: Timeout: check execution aborted after 60s
    
    nagios_check_docbase -H dctms://dmadmin:dmadmin@192.168.13.131:10001 \
      -m countquery -t 3600 \
      --query "select count(*) from dm_folder a, dm_folder b, dm_folder c"
    COUNTQUERY OK - countquery is 14544652121
    | countquery=14544652121;;;0 query_time=2703163ms;;;0
    
  • name – name of check displayed in output, default is uppercase of check name, for example:
    nagios_check_docbase -H dctms://dmadmin:dmadmin@192.168.13.131:10001 -m login
    LOGIN OK - user: dmadmin, connection: 1229ms, authentication: 136ms
    | authentication_time=136ms;;;0 connection_time=1229ms;;;0
    
    nagios_check_docbase -H dctms://dmadmin:dmadmin@192.168.13.131:10001 \
     -m login -n superuser_login
    SUPERUSER_LOGIN OK - user: dmadmin, connection: 941ms, authentication: 86ms
    | authentication_time=86ms;;;0 connection_time=941ms;;;0
    
  • mode – one of:
    • sessioncount – checks count of active sessions in repository, i.e. hot_list_size in COUNT_SESSIONS RPC command result, example (last number in performance output is a value of concurrent_sessions in server.ini):
      nagios_check_docbase -H dctms://dmadmin:dmadmin@192.168.13.131:10001 \
        -m sessioncount
      SESSIONCOUNT OK - sessioncount is 4
      | sessioncount=4;;;0;100
      
      # critical threshold
      nagios_check_docbase -H dctms://dmadmin:dmadmin@192.168.13.131:10001 \
         -m sessioncount -c 2
      SESSIONCOUNT CRITICAL - sessioncount is 4 (outside range 0:2)
      | sessioncount=4;;2;0;100
      
      # warning and critical thresholds:
      nagios_check_docbase -H dctms://dmadmin:dmadmin@192.168.13.131:10001 \
         -m sessioncount -w 2 -c 6
      SESSIONCOUNT WARNING - sessioncount is 4 (outside range 0:2)
      | sessioncount=4;2;6;0;100
      
    • targets – checks whether repository is registered on all configured docbrokers, example:
      nagios_check_docbase -H dctms://dmadmin:dmadmin@192.168.13.131:10001 -m targets
      TARGETS OK - DCTM_DEV.DCTM_DEV has status Open on docu72dev01:1489
      
    • indexagents – checks status of configured index agents, i.e. checks that status returned by FTINDEX_AGENT_ADMIN RPC is 100, example:
      # no index agents configured in docbase:
      nagios_check_docbase -H dctms://dmadmin:dmadmin@192.168.13.131:10001 \
         -m indexagents
      INDEXAGENTS WARNING - No indexagents
      
      # stopped index agent
      nagios_check_docbase -H 192.168.2.56:12000/131031 -l dmadmin -a dmadmin \
           -m indexagents
      INDEXAGENTS WARNING - Indexagent docu70dev01_9200_IndexAgent is stopped
      
    • jobs – check job scheduling, i.e. checks whether job is in active state (might be picked up by agentexec), checks last return code of job method, checks whether agentexec honors scheduling (last check is very inaccurate because of weird agentexec implementation, so checking jobs which are supposed to be executed frequently might produce unexpected results), example:
      # single job
      nagios_check_docbase -H dctms://dmadmin:dmadmin@192.168.13.131:10001 -m jobs \
         --job dm_UpdateStats
      JOBS OK - dm_UpdateStats last run - 1 days 02:31:34 ago
      
      # multiple jobs (comma-separated list)
      nagios_check_docbase -H dctms://dmadmin:dmadmin@192.168.13.131:10001 -m jobs \
         --job dm_ConsistencyChecker,dm_UpdateStats
      JOBS CRITICAL - dm_ConsistencyChecker is inactive,
          dm_UpdateStats last run - 1 days 02:35:22 ago
      
      # job with bad last return code
      nagios_check_docbase -H dctms://dmadmin:dmadmin@192.168.13.131:10001 -m jobs \
         --job dm_usageReport
      JOBS CRITICAL - dm_usageReport has status: FAILED:  
         Could not launch method dm_usageReport:  OS error: (No Error), DM error: ()
      
    • nojobs – checks whether certain job is not scheduled (so it is reversed “jobs” mode) – default Documentum installation schedules certain jobs which consume a lot of resources but do nothing useful, such jobs must be disabled, example:
      nagios_check_docbase -H dctms://dmadmin:dmadmin@192.168.13.131:10001 \
          -m nojobs --job dm_DBWarning
      NOJOBS CRITICAL - dm_DBWarning is active
      
    • timeskew – check time difference in seconds between Documentum host and monitoring server, example:
      nagios_check_docbase -H dctms://dmadmin:dmadmin@192.168.13.131:10001 \
          -m timeskew
      TIMESKEW OK - timeskew is 66.02
      | timeskew=66.0209999084;;;0
      
      # critical theshold
      nagios_check_docbase -H dctms://dmadmin:dmadmin@192.168.13.131:10001 \
          -m timeskew -c 60
      TIMESKEW CRITICAL - timeskew is 66.23 (outside range 0:60)
      | timeskew=66.2279999256;;60;0
      
      # warning and critical thesholds
      nagios_check_docbase -H dctms://dmadmin:dmadmin@192.168.13.131:10001 \
          -m timeskew -w 60 -c 120
      TIMESKEW WARNING - timeskew is 66.17 (outside range 0:60)
      | timeskew=66.1689999104;60;120;0
      
    • query – executes select statement and checks whether the count of returned rows is inside of specified threshold ranges (for checks described previously threshold ranges were trivial (i.e. “less than”), but for this check, I believe, you may want to specify more complex conditions like “count of returned rows must be greater than specified threshold”, see nagios-plugin documentation for threshold formats), additionally output might be formatted be specifying –format argument, example:
      # no thresholds, just formatted output
      nagios_check_docbase -H dctms://dmadmin:dmadmin@192.168.13.131:10001 \
         --query "select user_name,user_state from dm_user where user_state<>0" \
         -m query --format {user_name}:{user_state}
      QUERY OK - hacker:1 - 3ms
      | count=1;;;0 query_time=3ms;;;0
      
      # count of rows does not exceed critical threshold 
      nagios_check_docbase -H dctms://dmadmin:dmadmin@192.168.13.131:10001 \
          --query "select user_name,user_state from dm_user where user_state<>0" \
          -m query --format {user_name}:{user_state} -w 0 -c 1
      QUERY WARNING - hacker:1 - 3ms (outside range 0:0)
      | count=1;0;1;0 query_time=3ms;;;0
      
      # count of rows is greater than or equal to critical threshold
      nagios_check_docbase -H dctms://dmadmin:dmadmin@192.168.13.131:10001 \
          --query "select user_name,user_state from dm_user where user_state<>0" \
          -m query --format {user_name}:{user_state} -c 2:
      QUERY CRITICAL - hacker:1 - 3ms (outside range 2:)
      | count=1;;2:;0 query_time=3ms;;;0
      
      # count of rows is greater than or equal to critical threshold
      nagios_check_docbase -H dctms://dmadmin:dmadmin@192.168.13.131:10001 \
          --query "select user_name,user_state from dm_user where user_state<>0" \
          -m query --format {user_name}:{user_state} -c 2:
      QUERY CRITICAL - hacker:1 (outside range 2:)
      | count=1;;2:;0 query_time=3ms;;;0
      
      # also check query execution time against thresholds
      nagios_check_docbase -H dctms://dmadmin:dmadmin@192.168.13.131:10001 \
          --query "select user_name,user_state from dm_user where user_state<>0" \
          -m query --format {user_name}:{user_state} -c 2: --criticaltime 2
      QUERY CRITICAL - hacker:1 - 3ms (outside range 2:)
      | count=1;;2:;0 query_time=3ms;;2;0
      
    • method – technically it is the same as “query” mode, but accepts only “execute do_method” queries and additionally checks value of launch_failed result attribute, I believe such approach to check health of JMS is more reliable than “jmscheck” mode (see below), example:
      nagios_check_docbase -H dctms://dmadmin:dmadmin@192.168.13.131:10001 \
          -m method --query "execute do_method with method='JMSHealthChecker'"
      METHOD OK
      | query_time=14ms;;;0
      
    • countquery – technically it is the same as “query” mode, but this mode assumes that query returns only single row with single attribute (actually it just picks up the first row and the first attribute in row), example:
      nagios_check_docbase -H dctms://dmadmin:dmadmin@192.168.13.131:10001 \
          -m countquery --query "select count(*) from dm_sysobject"
      COUNTQUERY OK - countquery is 8746
      | countquery=8746;;;0 query_time=7ms;;;0
      
    • workqueue – checks the total number of non-completed auto-activities for whole repository, actually it checks whether the configured number of workflow agents is sufficient or not, in some cases growth of workflow queue may indicate some issues either with workflow agent or with JMS, example:
      nagios_check_docbase -H dctms://dmadmin:dmadmin@192.168.13.131:10001 \
          -m workqueue
      WORKQUEUE OK - workqueue is 0
      | workqueue=0;;;0
      
    • serverworkqueue – checks the number of non-completed auto-activities for current server, i.e. number of auto-activities acquired by server’s workflow agent, example:
      nagios_check_docbase -H dctms://dmadmin:dmadmin@192.168.13.131:10001 \
          -m serverworkqueue
      SERVERWORKQUEUE OK - DCTM_DEV is 0
      | DCTM_DEV=0;;;0
      
    • indexqueue – checks indexagent queue size, it’s worth to combine this check with “indexagents” check, because, again, due to weird implementation of indexaget it may report “running” status, but does not process queue, example:
      nagios_check_docbase -H 192.168.2.56:12000/131031 -l dmadmin -a dmadmin \
           -m indexqueue -w 1000 -c 2000
      INDEXQUEUE CRITICAL - _fulltext_index_user is 4.978e+04 (outside range 0:2000)
      | _fulltext_index_user=49781;1000;2000;0
      
    • ctsqueue – the same as “indexqueue” but for CTS, no example because I do not have CTS installed
    • failedtasks – checks the number of failed auto-activities, example:
      nagios_check_docbase -H 192.168.2.56:12000/131031 -l dmadmin -a dmadmin \
             -m failedtasks
      FAILEDTASKS CRITICAL - 1 task(s): 'Last Performer' (tp002-000_user1)
      
    • login – checks if certain user is able to authenticate (I use this to check LDAP availability), example:
      nagios_check_docbase -H dctms://dmadmin:dmadmin@192.168.13.131:10001 -m login
      LOGIN OK - user: dmadmin, connection: 1804ms, authentication: 93ms
      | authentication_time=93ms;;;0 connection_time=1804ms;;;0
      
      # thresholds
      nagios_check_docbase -H dctms://dmadmin:dmadmin@192.168.13.131:10001 \
          -m login --warningtime 500 --criticaltime 1000
      LOGIN WARNING - user: dmadmin, connection: 909ms, authentication: 86ms
      | authentication_time=86ms;500;1000;0 connection_time=909ms;500;1000;0
      
    • jmsstatus – checks availability of JMS, example:
      nagios_check_docbase -H dctms://dmadmin:dmadmin@192.168.13.131:10001 \
         -m jmsstatus
      JMSSTATUS OK - http://docu72dev01:9080/DmMethods/servlet/DoMethod - 60ms, 
                     http://docu72dev01:9080/DmMail/servlet/DoMail - 2ms, 
                     http://docu72dev01:9080/bpm/servlet/DoMethod - 6ms
      | response_time_08024be980000ced_do_bpm=6ms;;;0
      response_time_08024be980000ced_do_mail=2ms;;;0
      response_time_08024be980000ced_do_method=60ms;;;0
      
    • ctsstatus – checks availability of CTS, no example
    • acsstatus – checks availability of ACS, no example
    • xplorestatus – checks availability of xPlore, no example

Because arguments host, port, docbaseid, username, password are mandatory it makes hard to create flexible setup in nagios (for example opsview allows to set only four arguments for template), so these arguments might be collapsed into the single one (host) using following convention (see previous examples):

dctm[s]://username:password@host:port/docbaseid

also password might be obfuscated using following approach:

echo -ne "password" | \
  perl -na -F// -e 'print reverse map{sprintf("%02x",(ord$_^0xB6||0xB6))}@F'

for example:

 ~]$ echo -ne dmadmin | \
> perl -na -F// -e 'print reverse map{sprintf("%02x",(ord$_^0xB6||0xB6))}@F'
d8dfdbd2d7dbd2[dmadmin@docu72dev01 ~]$
 ~]$ check_docbase.py -H dctms://dmadmin:d8dfdbd2d7dbd2@192.168.13.131:10001 \
       -m login
LOGIN OK - user: dmadmin, connection: 1805ms, authentication: 93ms
| authentication_time=93ms;;;0 connection_time=1805ms;;;0

My apologize for the last comment

Actually, I always try do double-check anything before posting it in my blog, but today I made a mistake: I decided that comment from Dearash is a 1st April joke, but after some research I realized it was not a joke – that was a home truth about support. Dearash’s comment had came from EMC’s network:



So, it looks like Dearash wanted to emphasize that EMC support is unable to help customers with their difficulties – original quote from Sukumar is:

Because in my case i’m facing some aspect related issue which creating the object through webtop and it neither allows me to create nor open the existing document instances of that type..Looks like a bug and still SR is pending with EMC about this issue

and suggest to look for alternative sources of support.

Sorry for stupid joke about ECN.