ACL performance

This blogpost is a follow-up for WTF??? blogpost, but describes the similar problem from Content Server perspective.

Two months ago one large company asked me to help them with performance issue, that time I had written two blogposts about how to diagnose performance issues in Documentum environment: Diagnostics challenge. Part I and Diagnostics challenge. Part II, but I did’t shed a light on what was the root cause. Interesting thing here is a fact that that time I was already familial with such performance problem in Documentum – we had filed a bug to EMC about four years ago but it is still not yet resolved: initially EMC was promising to resolve this performance problem in upcoming 7.x release, later they said something like “it requires a lot of effort – we are not going to fix it”.

Well, I’m not sure about all existing ECM solutions, but can definitely say about Documentum: Documentum (as a product) is not a ECM solution because it lacks concept of business roles (i.e. when business-users gets some capabilities according to the current context), and, hence, it is not suitable for enterprises. To prove my point let’s examine the most basic capability: certain user is able to read certain document. And the question is under what circumstances user gets read access to document? Actually, there are a plenty of reasons, some of them are:

  1. user is somehow involved in a business process related to this document, i.e. our user is an author, or reviewer, or addressee, or somebody else
  2. document was somehow classified and user gets access due to this classification, for example, members of legal department have access to all legal documents
  3. user is a legal successor of user from #1
  4. user is a secretary/assistant of user from #1
  5. user is a supervisor of user from #1
  6. user is a big boss and wants to see all documents in enterprise
  7. user is not a big boss, but wants to see all documents in branch office
  8. document somehow relates to another document, accessible by our user

I do not pretend that the list above is complete, but I do believe that it is common for all enterprises, and the problem is that Documentum (as a product) does not cover these cases – even combining #1 and #2 is already a challenge (check What is wrong in Documentum. Part III blogpost); #8 requires functionality of dynamic groups (check Dynamic groups. Advances. Part IV), which is not properly documented; for #3, #4 and #5 the best approach seems to be do not use direct grants to users, but instead create an associated group for each user – all these tricks require additional coding, however EMC thinks that companies do not buy their “product” because of costs of database licenses, but not because their “product” doesn’t fit customers’ needs. LOL 🙂

So, under such circumstances every customer/developer tends to invent own square wheel, and in most cases my job is to understand what was done and somehow improve that. The problem was following: customer’s application, every time when document was being saved to repository, was recalculating access permissions and was updating related acl as well – at first glance such implementation looks reasonable, but Content Server behaviour is extremely weird in such case: if you are saving acl which actually hasn’t been changed it takes extremely long time:

-- list of ACLs which contain more than 10000 accessors
API> ?,c,select r_object_id from dm_acl group by r_object_id 
     having count(r_accessor_name)>10000 enable(row_based)
r_object_id     
----------------
45024be98002b10c
45024be98002a74d
45024be98002a645
45024be98002a74c
(4 rows affected)

API> fetch,c,45024be98002a74d
...
OK

API> save,c,45024be98002a74d
... <--- 30 seconds 
OK

API> save,c,45024be98002a74d
... <--- double check: 30 seconds again
OK

--
-- now Content Server magic: we are adding
-- new record to acl
--
API> grant,c,45024be98002a74d,dm_read_all,AccessPermit,,6
...
OK

API> save,c,45024be98002a74d
... <--- less than a second
OK

Let’s investigate the difference between two cases (saving acl as is and adding new record).

First case:

[DM_SESSION_I_SESSION_START]info:  "Session 01024be98000c241 started for user dmadmin."


Connected to Documentum Server running Release 7.2.0030.0195  Linux64.Oracle
Session id is s0
API> apply,c,,SQL_TRACE,SESSION_ID,S,01024be98000c241,LEVEL,I,10
...
q0
API> next,c,q0
...
OK
API> dump,c,q0
...
USER ATTRIBUTES

  result                          : T

SYSTEM ATTRIBUTES


APPLICATION ATTRIBUTES


INTERNAL ATTRIBUTES


API> save,c,45024be98002a74d
...
OK
API> Bye
[dmadmin@docu72dev01 ord-dars]$ grep -i select \
> /u01/documentum/cs/dba/log/00024be9/dmadmin/01024be98000c241 | wc
  10800  173071 1811563

Second case:

[DM_SESSION_I_SESSION_START]info:  "Session 01024be98000c246 started for user dmadmin."


Connected to Documentum Server running Release 7.2.0030.0195  Linux64.Oracle
Session id is s0
API> apply,c,,SQL_TRACE,SESSION_ID,S,01024be98000c246,LEVEL,I,10
...
q0
API> next,c,q0
...
OK
API> dump,c,q0
...                                
USER ATTRIBUTES

  result                          : T

SYSTEM ATTRIBUTES


APPLICATION ATTRIBUTES


INTERNAL ATTRIBUTES


API> grant,c,45024be98002a74d,dm_browse_all,AccessPermit,,3
...
OK
API> save,c,45024be98002a74d
...
OK
API> Bye
[dmadmin@docu72dev01 ord-dars]$ grep -i select \
> /u01/documentum/cs/dba/log/00024be9/dmadmin/01024be98000c246 | wc
     28     719    9178

Wow, 10800 select statements in first case and just 28 select statements in second case! Looks like something is wrong, doesn’t it? Fortunately, four years ago I was dealing with another ACL performance issue and, in order to prove their wrong opinion, EMC had shared source code related to ACL processing, below is a code snippet which demonstrates wrong behaviour:

  // In order to validate the ACE's for this ACL, we will
  // find the first ACE that has been modified (r_accessor_name,
  // r_accessor_permit, r_accessor_xpermit, permit_type and
  // application_permit) and then validate all ACE's from that
  // entry to the end of the ACE list.
  int from = 0;
  if (changed[_pos._accessorName] > 0)
    from = changed[_pos._accessorName];
  if (changed[_pos._accessorPermit] > 0)
    from = min(from, changed[_pos._accessorPermit]);
  if (changed[_pos._accessorXPermit] > 0)
    from = min(from, changed[_pos._accessorXPermit]);
  if (changed[_pos._permitType] > 0)
    from = min(from, changed[_pos._permitType]);
  if (changed[_pos._applicationPermit] > 0)
    from = min(from, changed[_pos._applicationPermit]);

Do you see a mistake? Here it is:

  int from = 0;
  // from is always 0 if we haven't changed accessors
  if (changed[_pos._accessorName] > 0)
    from = changed[_pos._accessorName];

and the correct code is:

  int from = this->GetAttrValueCount(_pos._accessorName);
  if (changed[_pos._accessorName] > 0)
    from = changed[_pos._accessorName];

So, just one line to fix an issue.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s