A FATAL error has occurred

Have you ever seen such error:

?

I believe everybody who tried to deploy Documentum in large enterprise had faced with such spontaneous errors but never paid much attention because error message is completely misleading: “[DM_SESSION_E_AUTH_FAIL]error: “Authentication failed for user” – dumb user enters invalid password, so it’s not our issue. But actually, this error reveals a lot of problem related to session management in Documentum.

Root cause

If take a careful look at stacktrace it becomes clear that the error originates not from login page but somewhere from WDK (put any other application here) intestines:

at com.documentum.web.formext.privilege.PrivilegeService.getUserPrivilege(PrivilegeService.java:57)
at com.documentum.web.formext.config.PrivilegeQualifier.getScopeValue(PrivilegeQualifier.java:69)
at com.documentum.web.formext.config.BoundedContextCache.retrieveQualifierScopeValue(BoundedContextCache.java:153)
at com.documentum.web.formext.config.ScopeKey.<init>(ScopeKey.java:57)
at com.documentum.web.formext.config.ConfigService.makeScopeKey(ConfigService.java:1482)
at com.documentum.web.formext.config.ConfigService.lookupElement(ConfigService.java:527)

this fact means that user was already successfully logged in and this is, obviously, not a user’s mistake. So, what does really happen there? The basic description of the problem is: user’s dfc session got either reused by another user or released by application and when application tries to acquire new dfc session it fails. So, the first question is why application fails to acquire new dfc session. I believe there are a lot of reasons, but the most common for large enterprises is following: ldap authentication in Documentum is unstable: most time it works as expected but sometimes it fails and causes a hardly diagnosable issues.

Mitigation options

  1. disable D7 session pooling in dfc (i.e. set dfc.compatibility.useD7SessionPooling to false) – the most of customers noticed that error has started occurring more frequently after moving to new version of dfc, actually, it is a true, because new pooling implementation tends to keep the amount of dfc sessions as small as possible, so the amount of authentication requests increases
  2. if you use bind_type=bind_search_dn in ldap config switch to bind_by_dn – it will decrease the amount of ldap round-trips
  3. use as nearest ldap servers as possible
  4. put a blame onto EMC – authentication is not a kind of thing which must occur every 5 seconds due to poor application design

Feedback storm

On last week I received a dozen of requests related to the one of recent blogposts, and all those requests contain the same two questions:

  • Why is it password protected?
  • When will it be publicly available?

The short story is: after installing latest Content Server patches I had faced with severe compatibility issues and further research revealed that those compatibility issues are caused by “security enhancements” which do not really look like security fixes, for example, I had spent about 6 hours to find out why my application has started failing when running on latest patchset and how to disable new weird behaviour, and just 30 minutes to write a new proof of concept which bypasses new “security enhancements”. After that I performed a deep analysis of last ten security vulnerabilities, which were announced by EMC as remediated and found the same problem: nothing was fixed, moreover some of them contain so rude mistakes that those mistakes look more like backdoors rather than mistakes. At current state I’m trying to bring together all information I have and this blogpost will be publicly available soon.

Do you like obsessive advertisement?

I have no idea who was the author of dumb idea to display document’s content in a browser – I always thought that if you unable to open file in specialized application this means you are not intended to see that file. Today I noticed on LinkedIn an advertisement of another one square wheel and realised that the advertisement of ARender is really obsessive, some examples:

But being a curious person I decided to give ARender a chance and “tried” it, the result, as expected, was mediocre – 10Mb of network traffic for a small pdf file, interesting, how it can be fast (quote: “Extremely fast startup time, no application download required at client side.”) if it sends a bunch of http requests on every resize? But, may be a network traffic is not an issue anymore, after all we are in 2015. Ok, let’s explore ARender site.

Oops…

Oops…

Do you have any idea why I like /etc/passwd file (I believe passing /dev/zero is also a funny option)? It contains information about users’ home directories, which in turn contain .bash_history files:

Oh no…

PS. I got a response from ARender team:

Greetings,

We have read your blogpost thoroughly regarding the problems you raised on our document viewer, ARender.

First of all, many thanks for sending us the potential weaknesses and bugs you could find in order for us to improve and consolidate our solutions.

Regarding your raised issues about ARender’s bandwidth usage, this originates from our backward compatibility with Internet Explorer 6. As the latter does not handle resizing of pictures very well, we had to request pictures with different sizes for each window size change. Now with ARender 3, and the drop of IE6 compatibility, we will soon be able to use a rezising mechanism, with only some key pictures sizes requested on demand when the quality starts to be altered by the zoom. This will leverage the number of images requested but also the number of http requests.

For the security issue regarding the access to critical system paths, it is possible in ARender to turn off the filesystem access, and in the future, to restrict specific paths once ARender enters production environment. We also recently integrated ARender in docker, that we will try to promote and push as standard usage. As ARender is then deployed in a minimalistic environement, there will be no services exposure either than ARender itself and no access to the real host filesystem.

Some ideas about organising storage for content files

Memento mori

When planning how are you going to store content files always think about disaster recovery, the typical case is: storage admins ask you how many disk space do you need and after that they provision one large 10-20Tb LUN for Documentum – this is completely wrong, because in case of disaster recovery your primary goal is to decrease RTO and RPO, but restoring “obsolete” files in 10-20Tb LUNs won’t help you – business users always have a preferences about what needs to be recovered first, it may be content of specific/business-critical types or content loaded within last two days/weeks/months, also keep in mind that Documentum does not work without content of /System cabinet.

General considerations are:

  1. always prefer NAS to SAN – in general, NAS appliances are slower than SAN, but it is not an issue for Documentum, furthermore, most NAS appliances have a build-in capabilities which do not exists in SAN appliances, for example: if you need to scale your repository on multiple servers you have two options: create a cluster filesystem (cluster software costs extra money and requires extra maintenance) or use NAS, typically NAS appliances represent a symbiosis between filesystem, network and disk drivers, so, the most of NAS appliances have a build-in replication and snapshot capabilities (SAN appliances may have such capabilities too, but the problem is SAN appliances have no idea about what is stored in underlying LUN)
  2. if you have no choice and SAN is the only option always use volume manager – never ever create a filesystem on a LUN without volume manager, otherwise in future you will unable to perform an extremely simple operations without downtime, for example, if I need to move all data from one storage to another (somebody decided to decommission and old appliance or I decided to move old data on slow storage) I just add new physical volume to the existing disk group, remove old physical volume and wait some time while volume manager moves data between physical volumes in online
  3. split content volumes into maintainable pieces – it may be a 3-6 months’ worth of data or 1-2Tb volumes, in my deployments I have found that 2Tb is an optimal size
  4. try to understand business value of stored content and design storage accordingly, Content Storage Services option is your friend here

Trusted Content Services

Never ever use Trusted Content Services option for encrypting content files, the considerations are:

  • it does not bring any value from security perspective, even stubborn EMC employees realised that
  • there are different opinions about how to properly use AV-software in Documentum environment, some guy think that real-time scan is good and get something like: , another guys think that periodic AV-scans of content volumes is ok, but what are you going to find if all content is encrypted? Moreover, viruses have a dumb nature: today infected file may be treated as harmless, tomorrow it will be harmful, so, encryption is not AV friend.
  • it seems that EMC fails to provide backward compatibility for TCS option across releases and operating systems: How will content be re-encrypted during TCS 7.2 upgrade?, Documentum Migration from AIX to Linux

Fighting with Composer. NLS data

Case: I played with locale data in composer, installed project a couple of times and now I want to perform “clean” installation without reinstalling the whole repository. The problem is: in composer locale data is hierarchical but in repository it is flat, for example, if I overwrote attribute’s label for subtype and now I want to inherit it from supertype again, I need to keep overwritten label in sync with supertype – seems not to be convenient.

Solution:

-- deleting NLS data for particular types
DELETE dmi_object_type
 WHERE r_object_id IN (SELECT r_object_id
                         FROM dm_nls_dd_info_s
                        WHERE parent_id IN (SELECT r_object_id
                                              FROM dm_aggr_domain_s
                                             WHERE type_name ...));

DELETE dm_nls_dd_info_r
 WHERE r_object_id IN (SELECT r_object_id
                         FROM dm_nls_dd_info_s
                        WHERE parent_id IN (SELECT r_object_id
                                              FROM dm_aggr_domain_s
                                             WHERE type_name ...));

DELETE dm_nls_dd_info_s
 WHERE parent_id IN (SELECT r_object_id
                       FROM dm_aggr_domain_s
                      WHERE type_name ...);

UPDATE dm_domain_r
   SET nls_dd_info = NULL, nls_key = NULL
 WHERE     r_object_id IN (SELECT r_object_id
                             FROM dm_aggr_domain_s
                            WHERE type_name ...)
       AND nls_dd_info IS NOT NULL;

UPDATE dm_domain_r
   SET nls_dd_info = NULL, nls_key = NULL
 WHERE     r_object_id IN (SELECT r_object_id
                             FROM dm_domain_s
                            WHERE parent_id IN (SELECT r_object_id
                                                  FROM dm_aggr_domain_s
                                                 WHERE type_name ...))
       AND nls_dd_info IS NOT NULL;

-- generating API commands to remove dd objects
SELECT 'apply,c,,REMOVE_DD_OBJECT,ID,S,' || r_object_id
  FROM dmi_dd_type_info_sp
 WHERE type_name ...;

SELECT 'apply,c,,REMOVE_DD_OBJECT,ID,S,' || r_object_id
  FROM dmi_dd_attr_info_sp
 WHERE type_name ...;

CTS challenge

Previously I had written that CTS has some serious security issues, but this knowledge I got when I was trying to install xCP2 about a year ago, so I didn’t pay much attention to CTS functionality, unfortunately, on last month I was involved in project which was supposed to use real-time requests to CTS and I got following changes:

  1. Reference composer project is completely unusable (looks like I’m the first person who tried to use it), the problem is following: if try to load reference project into composer, composer does not recognise it as reference project and fails to build it (transformation project contains types with names starting from dmc_ prefix): , it could be fixed by adding “com.emc.ide.project.dmCoreProjectNatureId” line into .project file, unfortunately it does not help because further dar deployer fails to resolve dependencies
  2. SBO API does not work in JMS context due to Jar hell, the default installation of dmc_jars and dmc_libraries is:
    API> ?,c,select r_object_id, sandbox_library, object_name from dmc_java_library 
      where FOLDER('/System/Modules/SBO/com.documentum.services.cts.df.transform.ICTSTransformService',DESCEND)
    r_object_id       sandbox_library  object_name                                                                                                                                                                  
    ----------------  ---------------  -----------
    0b024be9800138ab                0  RealTime
    0b024be9800138ae                0  log4j
    (2 rows affected)
    
    
    API>  ?,c,select r_object_id, jar_type, object_name from dmc_jar 
      where FOLDER('/System/Modules/SBO/com.documentum.services.cts.df.transform.ICTSTransformService',DESCEND)
    r_object_id       jar_type      object_name                                                                                                                                                                     
    ----------------  ------------  ---------------------------
    09024be9800138ba             1  ctsTransform.jar
    09024be9800138df             2  ctsTransformImpl.jar
    09024be9800138b2             2  commons-codec-1.3.jar
    09024be9800138b3             2  commons-fileupload-1.0.jar
    09024be9800138b4             2  commons-httpclient-3.0.jar
    09024be9800138b5             2  commons-io-1.2.jar
    09024be9800138b6             2  commons-jxpath-1.2.jar
    09024be9800138b7             2  commons-lang-2.4.jar
    09024be9800138b8             2  commons-logging.jar
    09024be9800138e3             2  loadbalancer.jar
    09024be9800138e5             3  realtime.jar
    09024be9800138e6             2  commons-cli-1.0.jar
    09024be9800138e4             2  log4j.jar
    (13 rows affected)
    

    the correct one is:

    API> ?,c,select r_object_id, sandbox_library, object_name from dmc_java_library 
      where FOLDER('/System/Modules/SBO/com.documentum.services.cts.df.transform.ICTSTransformService',DESCEND)
    r_object_id       sandbox_library  object_name                                                                                                                                                                  
    ----------------  ---------------  -----------
    0b024be98001baa4                1  RealTime
    0b024be98001baa7                0  log4j
    (2 rows affected)
    
    
    API> ?,c,select r_object_id, jar_type, object_name from dmc_jar 
       where FOLDER('/System/Modules/SBO/com.documentum.services.cts.df.transform.ICTSTransformService',DESCEND)
    r_object_id       jar_type      object_name                                                                                                                                                                     
    ----------------  ------------  --------------------------
    09024be98001bab0             1  ctsTransform.jar
    09024be98001bac3             2  ctsTransformImpl.jar
    09024be98001b9c2             2  commons-codec-1.3.jar
    09024be98001b9c3             2  commons-fileupload-1.0.jar
    09024be98001baaa             2  commons-httpclient-3.0.jar
    09024be98001baab             2  commons-io-1.2.jar
    09024be98001baac             2  commons-jxpath-1.2.jar
    09024be98001baad             2  commons-lang-2.4.jar
    09024be98001bac7             2  loadbalancer.jar
    09024be98001bac9             2  realtime.jar
    09024be98001bad4             2  commons-cli-1.0.jar
    09024be98001bac8             2  log4j.jar
    (12 rows affected)

Encryption madness

My colleagues got faced with ridiculous security settings implemented by EMC, so I decided to write this blogpost. Starting from Documentum 7.2 EMC decided to switch from resolving security issues to treating symptoms (first attempt, second attempt). The security issue is: if attacker is able to hijack aek.key file from Content Server filesystem he is able to get superuser access to repository (the basic scenario is: we can decrypt i_ticket_crypto_key from docbase config using aek.key and after that we can issue any login ticket). To “mitigate” such attack EMC proposes two options:

As we already know, both options have nothing in common with security, but, as it turned out, cause a lot of pain in behind. The problem is: if you protect aek.key by passphrase or use lockbox, DFC is unable to use aek.key anymore to decrypt passwords (the basic example is: the file with password for ldap server stored in config directory on Content Server, ldap synchronisation job needs to decrypt this password), so it is required to invent another approach for decrypting passwords on DFC side, but EMC did something weird. Starting from 7.2 DFC following behaviour takes place:

  • installer adds dfc.crypto.repository parameter into dfc.properties (it is worth to mention, that installer does it in wrong way: installation of second repository overrides this parameter)
  • when encrypting text/password, if DFC sees dfc.crypto.repository parameter in dfc.properties it tries to establish trusted session to this repository to call ENCRYPT_TEXT/ENCRYPT_PASSWORD RPC command and adds repository name to the end of encrypted text/password:
    Connected to Documentum Server running Release 7.2.0060.0222  Linux64.Oracle
    Session id is s0
    API> encryptpass,c,mycoolpassword
    ...
    DM_ENCR_PASS_V2=AAAAEKfc...RYjoGt::DCTM_DEV
    API> encrypttext,c,mycoolpassword
    ...
    DM_ENCR_TEXT_V2=AAAAELQ...O+x5g::DCTM_DEV
    
  • when decrypting text, if DFC sees dfc.crypto.repository parameter in dfc.properties it tries to establish trusted session to this repository or, in case of failure, to repository mentioned at the end of encrypted text
  • when decrypting password DFC uses old technique based on reading aek.key file

I believe, the behaviour described above is extremely logical, unfortunately, I’m unable to understand this logic. For example, let imagine that I installed two repositories, in this case dfc.crypto.repository points to the second one. Now I’m going to setup LDAP synchronisation, in order to encrypt ldap password Documentum Administrator calls replicate_setup_methods docbase method, and passwords will be encrypted using the second repository, now, what will happen if second repository goes down? It’s obvious that ldap synchronisation will stop working for the first repository – DFC is unable to connect to the second repository to decrypt password (actually, all DFC encryption/decryption operations on CS side will fail because repository configured in dfc.properties is down). Also, I have noticed that some users (and my colleagues too) have the imprudence to copy dfc.properties file from Content Server to Application Server, after that some things stop working because copied dfc.properties file contains dfc.crypto.repository parameter, but Application Server has no trusted access to repository, below are some examples from ECN: