Tomcat 8 vs webtop

Another Friday challenge was a dumb slowness of webtop under tomcat 8 – 80 seconds to get login page:

compare with weblogic:

What is the problem? For some weird reason when webtop is running under tomcat 8 the size of returned content is not in sync with http headers:

i.e. server reports that content size is 20K but the real size is 4K, so browser gets confused:

bash-3.2$ curl -D /dev/stderr 'http://localhost:8080/webtop/wdk/include/locate.js' \
> -H 'Pragma: no-cache' -H 'Accept-Encoding: gzip, deflate, sdch'  \
> -H 'Connection: keep-alive' -H 'Cache-Control: no-cache' --compressed > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Cache-Control: max-age=86400
Accept-Ranges: bytes
ETag: W/"20735-1417517700000"
Last-Modified: Tue, 02 Dec 2014 10:55:00 GMT
Content-Encoding: gzip
Content-Type: application/javascript
Content-Length: 20735
Date: Sat, 26 Sep 2015 10:42:51 GMT

23 20735   23  4948    0     0    246      0  0:01:24  0:00:20  0:01:04     0
curl: (18) transfer closed with 15787 bytes remaining to read
bash-3.2$

Who is to blame now? I believe EMC thinks ASF broke something in new tomcat, but the real reason is EMC developers do not follow documentation, check example provided by ASF (webapps/examples/WEB-INF/classes/compressionFilters/CompressionResponseStream.java):

} else {
    response.addHeader("Content-Encoding", "gzip");
    response.setContentLength(-1);  // don't use any preset content-length as it will be wrong after gzipping
    response.setBufferSize(compressionBuffer);
    gzipstream = new GZIPOutputStream(output);
}

Indeed, when I put missing calls into com.documentum.web.servlet.ThresholdingDeflateOutputStream#startDeflating everything started working smoothly:

bash-3.2$ curl -D /dev/stderr 'http://localhost:8080/webtop/wdk/include/locate.js' \
> -H 'Pragma: no-cache' -H 'Accept-Encoding: gzip, deflate, sdch'  \
> -H 'Connection: keep-alive' -H 'Cache-Control: no-cache' --compressed > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Cache-Control: max-age=86400
Accept-Ranges: bytes
ETag: W/"20735-1417517700000"
Last-Modified: Tue, 02 Dec 2014 10:55:00 GMT
Content-Encoding: gzip
Content-Type: application/javascript
Transfer-Encoding: chunked
Date: Sat, 26 Sep 2015 11:09:14 GMT

100  4948    0  4948    0     0   362k      0 --:--:-- --:--:-- --:--:--  371k

Security through guessing

About three weeks ago I made a decision to stop posting about security vulnerabilities in Documentum, but on Friday I have faced with amusing behaviour of webtop and I was unable to leave that fact without blogpost. Nine months ago I wrote a post about how EMC fails to read documentation. Actually, I was never considering the ability to read webtop’s configuration files through HTTP requests as vulnerability because I always follow my own best practices and never put environment-specific configuration into web application, unfortunately this point is not obvious for some developers and we may get something like:

What did cause me to write this post? On Friday I was trying to merge some changes implemented in webtop 6.8 to customised webtop 6.7 and I had noticed new weird changes in web.xml:

   <context-param>
      <param-name>StaticPageIncludes</param-name>
      <param-value><![CDATA[(\.bmp|\.css|\.htm|\.html|\.gif|\.jar|\.jpeg|\.jpg|\.js|\.properties|\.xml|\.xml;|\.png)$]]></param-value>
   </context-param>

note, that EMC added “\.xml;” to protect config files from reading:

The problem is EMC still fails to read documentation:

Time in Documentum. MSSQL challenge

Interesting, when I was writing post about time in Documentum, I have noticed that storing dates in UTC timezone would cause difficulties for reporting software, but I couldn’t imagine that everything is so really bad. The problem is that among all supported databases only Oracle database works correctly with timezones, i.e. I’m able to compensate any timezone offset in SQL selects by writing something like:

all other databases do not support timezone regions out of the box and work only with specific “(-|+)HH:MM” offsets, for example, in case of MSSQL in order to work correctly with UTC time (i.e. properly handle day light saving) you will need to install T-SQL Toolbox, which is slow by design. I do believe because of this Documentum’s DATETOSTRING_LOCAL function has so weird behaviour:

i.e. Documentum translates DATETOSTRING(r_creation_date, ‘yyyy-mm-dd’) directly to SQL’s TO_CHAR(dm_sysobject.r_creation_date, ‘yyyy-mm-dd’), but it unable to do the same for DATETOSTRING_LOCAL(r_creation_date, ‘yyyy-mm-dd’), in order to do that it converts date to string preserving information about time (i.e. TO_CHAR(dm_sysobject.r_creation_date, ‘mm/dd/yyyy hh24:mi:ss’)) and then converts resulting string to requested format and server’s timezone. Due to this implementation you are unable (actually you can, but results are unreliable) to use DATETOSTRING_LOCAL function in WHERE clause and subqueries.

Now, I do think that for the most installations storing dates in UTC timezone is an evil, and enabling this feature by default was a big EMC’s mistake.

Ingestion rates

The main problem of migration projects is you know the amount of data to be migrated, you have time constraints but you have no references for similar projects or references you have are obscure, let me provide some example.

Relying on my previous experience I may say that 20 documents per second is a good rate for a single thread (actually, this rate is bit overestimated because I didn’t upload content). Is it possible to decrease this rate? Definitely yes. Is it possible to improve this rate? Obviously yes: we have three systems involved in ingestion:

  • feeder – generates commands for Content Server
  • Content Server – parses commands from feeder, stores content on disk, generates commands for database
  • database – parses commands from Content Server, stores data on disk

and when one system is active, other two are idle, so, in the first approximation we may get three times improvement of ingestion rate if we spawn three threads simultaneously. But writing multithreaded programs sounds complicated, so, let’s try to find references in the internet…

The first:

DEMA – all about speed

The core of DEMA is about speed of the migration – examples included:

  • Government Client – 28 million documents – 10 TB – 10 days
  • Cable TV – 75 million documents – 5 TB – 15 days
  • Life Sciences – 1.5 million documents – 24 hours

The second:

Architected for the “fastest migration possible”, DEMA was constructed to avoid the Documentum API and go directly to the underlying database.

Thread communication – to be truly multi-threaded, the migration tool has to make sure that activities are divided up correctly by each thread to avoid conflicts.

It seems that the best reference is 5 million documents per day (57 documents per second), and this result was achieved by avoiding Documentum API, so, our 20 documents per second sounds reasonable because we are using Documentum API, doesn’t it? Bullshit! The duration of projects in the first reference does not make sense: I may spend one half of day for coding migration tool and then load 10 million documents during remaining half of day, what ingestion rate did I achieve? 230 or 115 documents per second? Moreover, EMC provides completely different and more promising rates:

migrates more that 1.2 million objects in one hour

Now, in order to place correct reference about ingestion rates I would like to provide the real-world results (4 CPU cores on Content Server):

  • single thread performance (no optimisation tricks) – 17 documents per second
  • single thread performance with optimised content transfer – 22 documents per second
  • 4 threads with optimised content transfer (50% CPU utilisation on CS) – 67 documents per second
  • 20 threads with optimised content transfer (80% CPU utilisation on CS) – 130 documents per second
  • 50 threads with optimised content transfer (100% CPU utilisation on CS) – 180 documents per second
  • 50 threads with optimised content transfer and disabled auditing (dm_save and dm_link events) (100% CPU utilisation on CS) – 250 documents per second

“optimised content transfer” means following: in order to avoid content transfer between feeder and Content Server I made content files available from CS host and instead of doing something like:

object.setContentType("format");
object.setFile("path to file");

I do following:

object.setContentType("format");
ITypedData extendedData = ((ISysObject) object).getExtendedData();
extendedData.setString("HANDLE_CONTENT", "yes");
extendedData.setBoolean("FIRST_PAGE", true);
extendedData.setString("FILE_PATH", "path on content server");

In order to avoid redundant RPCs related to calculation of default values I do following:

// fake TBO
/**
 * @author Andrey B. Panfilov <andrew@panfilov.tel>
 */
public class FakeSysObject extends DfSysObject {

    public FakeSysObject() {
        super();
    }

    // by default DFC calls
    // com.documentum.fc.client.DfPersistentObject.setDefaultValues
    // which has performance impact, so we override this behaviour
    @Override
    protected void init() throws DfException {
        ReflectionUtils.invokeMethod(this, "setValues");
    }

}

// poisoning DFC by fake TBO
registerTBO("docbase", "type_name", FakeSysObject.class);

protected void registerTBO(String docbaseName, String type, Class<?> clazz)
    throws DfException {
    // poisoning DocbaseRegistry
    ClassCacheManager.getInstance().setIsDownloading(true);
    IntrinsicModuleRegistry moduleRegistry = IntrinsicModuleRegistry
            .getInstance();
    ReflectionUtils.invokeMethod(moduleRegistry, "registerTBO", type,
            clazz.getName());
}

// populating default values in feeder
object.setString(....);

ACL computations

Yesterday my skypemate complained me about last Scott Roth’s post:

Is something changed in the Documentum platform or he is completely wrong, but he has may years of experience. I cannot believe that. I supposed that the 7.2 version changed the most permissive logic, the “standard” most permissive, it seems a post written by a Documentum newbie. I need to unsubscribe from his blog, there are not useful blogs about documentum.

The last point about useful blogs about documentum was insulting, so I think it is required to clarify how ACL computations work in Documentum.

Three years ago I got engaged in the investigation of the following performance problem:

Customer’s application associates every user with a group and grants privileges to associated group rather than certain user – this idea is very clear because when you need to grant “the same as certain user’s privileges” to specific user you just need to modify group memberships. Unfortunately, Documentum has a couple of performance issues related to such design, these issues are:

  • if ACL contains a lot of group entries, computation of effective privileges for specific user (i.e. “get,c,l,_permit,user_name”) becomes extremely slow (I saw some cases when IDfSysObject.getPermitEx(“username”) call took minutes) – if ACL entry contains a group Content Server tries to figure out whether user belongs to that group or not, i.e. Content Server performs one sql select per group entry
  • if ACL contains a lot entries, saving this ACL becomes extremely slow – when saving ACL Content Server tries to figure out whether ACL entry relates to user or group and validate the existence of corresponding user or group, in order to do that Content Server performs selects from database: one select if ACL entry relates to group and three selects if ACL entry relates to user – it’s more useful to put group entries in ACL, isn’t it?

The first issue seems to be resolved in recent releases of Content Server – now IDfSysObject.getPermitEx(“username”) call is backed up by the following single SQL query:

SELECT acl.r_accessor_name,
       acl.r_accessor_permit,
       acl.r_permit_type,
       acl.r_is_group
  FROM dm_acl_r acl, dm_group_r gr1, dm_group_r gr2
 WHERE     acl.r_object_id = :acl_id
       AND acl.r_is_group = 1
       AND gr1.users_names = :user_name
       AND gr1.r_object_id = gr2.r_object_id
       AND gr2.i_nondyn_supergroups_names IS NOT NULL
       AND gr2.i_nondyn_supergroups_names = acl.r_accessor_name
UNION
SELECT acl.r_accessor_name,
       acl.r_accessor_permit,
       acl.r_permit_type,
       acl.r_is_group
  FROM dm_acl_r acl
 WHERE     acl.r_object_id = :acl_id
       AND (   acl.r_accessor_name = :user_name
            OR acl.r_accessor_name = 'dm_world')

The second issue is still not resolved, moreover, there is no ETA for that, it’s very embarrassing because there is no need to perform a lot of SQL selects at all, let’s clarify. Content Server Performs following SQL selects:

-- determine whether ACL entry relates
-- to user or group
SELECT s.is_dynamic, s.group_class
  FROM dm_group_s s
 WHERE s.group_name = :p0;

-- query to get r_object_id or dm_user
SELECT r_object_id
  FROM dm_user_s
 WHERE user_name = :p0;

-- fetching user
  SELECT *
    FROM DM_USER_RV dm_dbalias_B, DM_USER_SV dm_dbalias_C
   WHERE (    dm_dbalias_C.R_OBJECT_ID = :dmb_handle
          AND dm_dbalias_C.R_OBJECT_ID = dm_dbalias_B.R_OBJECT_ID)
ORDER BY dm_dbalias_B.R_OBJECT_ID, dm_dbalias_B.I_POSITION;

last two queries are intended to check whether user exists or not, actually in order to perform the same it is enough to replace the first query by:

SELECT u.r_is_group
  FROM dm_user_s u
 WHERE u.user_name = :p0;

first query is intended to populate following attributes in dm_acl:

  • r_is_group – not required at all due to ACL computation algorithm, see explanation below
  • i_has_required_groups – required for MACL entries only
  • i_has_required_group_set – required for MACL entries only

Why r_is_group in dm_acl is not required at all? The explanation is very obvious: when Content Server tries to determine permissions of current user it retrieves information about all groups user belongs to from database, so it already has information about user’s groups and it is enough to check whether r_accessor_name in ACL entry exists in user’s group set or not, when Content Server tries to determine permissions of non-current user (see above SQL select) it is enough to use following SQL select:

SELECT acl.r_accessor_name,
       acl.r_accessor_permit,
       acl.r_permit_type,
       acl.r_is_group
  FROM dm_acl_r acl, dm_group_r gr1, dm_group_r gr2
 WHERE     acl.r_object_id = :acl_id
       -- AND acl.r_is_group = 1
       AND gr1.users_names = :user_name
       AND gr1.r_object_id = gr2.r_object_id
       AND gr2.i_nondyn_supergroups_names IS NOT NULL
       AND gr2.i_nondyn_supergroups_names = acl.r_accessor_name
UNION
SELECT acl.r_accessor_name,
       acl.r_accessor_permit,
       acl.r_permit_type,
       acl.r_is_group
  FROM dm_acl_r acl
 WHERE     acl.r_object_id = :acl_id
       AND (   acl.r_accessor_name = :user_name
            OR acl.r_accessor_name = 'dm_world')

Now about ACL computation logic.

The general logic is:

Sysobject’s owner:

+ implicit default permit (DM_PERMIT_READ)
+ permit granted to dm_owner
+ permits granted to the specific user
+ permits granted to any groups the user belongs to
- restrictions on specific user
- restrictions on any groups the user belongs to
- required group restrictions
- required group set restriction
_________________________________________________
RESULT: MAX(Calculated Permit, DM_PERMIT_BROWSE)

superuser:

+ implicit default permit (DM_PERMIT_READ)
+ permit granted to dm_owner
+ permits granted to the specific user
+ permits granted to any groups the user belongs to
_________________________________________________
RESULT: MAX(Calculated Permit, DM_PERMIT_READ)

regular user:

+ permits granted to the specific user
+ permits granted to any groups the user belongs to
- restrictions on specific user
- restrictions on any groups the user belongs to
- required group restrictions
- required group set restriction
_________________________________________________
RESULT: Calculated Permit

Non-common logic:

  • for regular users dm_escalated_* groups override restrictions on user/group but not required group/group set restrictions
  • dm_read_all, dm_browse_all group membership overrides all MACLs
  • minimum owner’s permissions actually is MAX(dm_docbase_config.minimum_owner_permit, DM_BROWSE_ALL)

What is wrong in last Scott Roth’s post? Implication “I’m getting weird results, so I think that product has certain behaviour” is wrong, the correct is “I’m getting weird results, here is my proof of concept”.

CVE-2015-4544 fixed. ORLY?

Today EMC announced another one CVE: CVE-2015-4544 – Unprivileged Content Server users may potentially escalate their privileges to become a superuser by creating and performing malicious operations on dm_job objects. This is due to improper authorization checks being performed on such objects and some of their attributes. The previous fix for CVE-2014-4626 was incomplete.

And by the “good” tradition fixes actually contain no fixes 🙂 Interesting thing, that EMC tried to fix this vulnerability for 10 months (“Customers on EMC Documentum Content Server prior to 7.0 with extended support agreement are requested to raise hotfix requests through EMC Customer Support.” sounds weird, doesn’t it?), below is an original conversation about CVE-2014-4626:

From: andrew@panfilov.tel
Sent: Friday, November 07, 2014 3:00 AM
To: CERT Coordination Center
Cc: CERT Coordination Center
Subject: Re: EMC Documentum vulnerability reports VU#315340

Could you please forward my response directly to EMC without any
modifications?

=====================8lt;====================

Fire the security expert – hi is an idiot.

=====================>8====================

Now clarification for CERT (not to be disclosed for EMC).

I’m not going to describe each security issue individually because
“reproducible” means reproducible,  but as example I want to provide a brief
explanation for VRF#HUFU6FNP [VU#315340].

Abstract: “Any user is able to elevate privileges, hijack Content Server
filesystem, execute any commands by creating malicious dm_job objects”

The problem is that non-privileged user is able to create dm_job objects and
execute corresponding docbase methods (some examples of “malicious” methods
are given in VRF#HUFU6FNP, also see VRF#HUFV0UZN), the word “create” here
does mean some sequence of commands which result to existence of dm_job
object. PoC in VRF#HUFU6FNP describes attack on scheduler – scheduler does
not schedule jobs unless they are owned by superuser, so, the command
sequence in that case was: “create dm_job and update dm_job”, EMC thinks
that they have fixed vulnerability, but they just fixed the sequence given
in PoC, another sequence is “create dm_sysobject, update dm_sysobject &
change dm_sysobject” – see VRF#HUGC34JH, it’s already known attack, so I
suspect backdoor here. Also, I could provide third PoC related to this
report, but I do not  think that would be useful for EMC.

__
Regards,
Andrey Panfilov.