Weird DQL grammar. Part III

Today I noticed another odd behaviour:

API> ?,c,select count(*) from dm_sysobject 
  where a_is_template=FALSE enable(GENERATE_SQL_ONLY)
generate_sql
-------------------------------------------------
select all count(*) from dm_sysobject_sp  dm_sysobject 
  where ((dm_sysobject.a_is_template=0)) 
  and (dm_sysobject.i_has_folder = 1 and dm_sysobject.i_is_deleted = 0) 
(1 row affected)

API> ?,c,select count(*) from dm_sysobject 
  where a_is_template=0 enable(GENERATE_SQL_ONLY)
generate_sql
-------------------------------------------------
select all count(*) from dm_sysobject_sp  dm_sysobject 
  where ((dm_sysobject.a_is_template=0)) 
  and (dm_sysobject.i_has_folder = 1 and dm_sysobject.i_is_deleted = 0) 
(1 row affected)

BUT:

API> ?,c,select count(*) from dm_dbo.dm_sysobject_s 
  where a_is_template=FALSE enable(GENERATE_SQL_ONLY)
generate_sql
-------------------------------------------------
select all count(*) from docbase.dm_sysobject_s 
  where ((a_is_template=0 OR a_is_template IS NULL)) 
(1 row affected)

API> ?,c,select count(*) from dm_dbo.dm_sysobject_s 
  where a_is_template=0 enable(GENERATE_SQL_ONLY)
generate_sql
-------------------------------------------------
select all count(*) from docbase.dm_sysobject_s 
  where ((a_is_template=0 OR a_is_template IS NULL)) 
(1 row affected)

i.e. Content Server thinks that it is smart enough and in case of registered tables generates wrong SQL, there are two options to “override” such behaviour:

  • write odd DQL, like:
    API> ?,c,select count(*) from dm_dbo.dm_sysobject_s 
      where (a_is_template=FALSE and a_is_template IS NOT NULL) enable(GENERATE_SQL_ONLY)
    generate_sql                                                                                                                             
    -------------------------------------------------
    select all count(*) from bee_ssc_dev.dm_sysobject_s 
      where (((a_is_template=0 OR a_is_template IS NULL) and a_is_template is not null)) 
    (1 row affected)
    
  • write tricky DQL:
    API> ?,c,select count(*) from dm_dbo.dm_sysobject_s 
      where a_is_template IN (FALSE) enable(GENERATE_SQL_ONLY)
    generate_sql                                                                     
    -------------------------------------------------
    select all count(*) from bee_ssc_dev.dm_sysobject_s 
      where (a_is_template in (0)) 
    (1 row affected)
    

Never happened before and happened again

As I described previosely, in 2012 we were implementing a large Documentum project, and that time we were observing a lot of performance and stability issues (actually, it was bit weird to observe such issues in mature product maintained by talented team). And one of issues was following: it was not possible to restart JMS during business hours (from my perspective it is not a big deal to do that if your application is properly designed – business users do not consume JMS, so, bouncing JMS should not affect business users):

  • about 5 of 10 restarts ended with deadlocks in java
  • about 4 of 10 restarts ended with crash of java process with SIGBUS error in libzip.so, something like:
    #
    # A fatal error has been detected by the Java Runtime Environment:
    #
    #  SIGBUS (0x7) at pc=0x00007f881a262982, pid=19993, tid=140222193350400
    #
    # JRE version: Java(TM) SE Runtime Environment (7.0_72-b14) (build 1.7.0_72-b14)
    # Java VM: Java HotSpot(TM) 64-Bit Server VM (24.72-b04 mixed mode linux-amd64 compressed oops)
    # Problematic frame:
    # C  [libzip.so+0x4982]  newEntry+0x62
    #
    
  • and only 1 of 10 restarts was successful

So, we filed a couple SRs and after a while EMC support started claiming that first two issues were addressed in recent Documentum patches. Frankly speaking, during last 4 years I haven’t seen JMS crashes with SIGBUS error in libzip.so, but now I’m not sure that talented team had really fixed the issue, because two days ago our DevOps engineer claimed that Documentum 7.2 is still affected and he was a witness of three succeeded crashes of JMS with SIGBUS error in libzip.so – I just have lost faith in talented team 😦

Well, let discuss this odd error and try to fix it 🙂 If you google for java + sigbus + libzip.so you will find a plenty of bugs in openjdk JIRA and all of them states something like:

  • JVM maps central directory of zip file into memory using mmap system call due to performance considerations
  • it is expected that application receives SIGBUS error is mapped file gets overwritten:
           SIGBUS Attempted access to a portion of the buffer that does not
                  correspond to the file (for example, beyond the end of the
                  file, including the case where another process has truncated
                  the file).
    
  • there is a -Dsun.zip.disableMemoryMapping=true system property which was designed to “prevent crashes”

So, JDK seems to “work as designed”, nobody is going to fix this behaviour (not true for JDK9, but for current JDK versions we must take this behaivour as given) and the real root cause of our problem is “somebody” changes jar files on JMS while they are in use and, hence, causes it crashes, and this “somebody” is an odd TBO implementation. Is it possible to fix it? Actually, I have no idea what talented team was doing four years ago, but I have found robust solution within four hours (if my memory serves me right they was trying to fix the issue during four months) – the good starting point here is to find/write a good testcase and I have found it in JDK-8017777 SIGBUS in java.util.zip.ZipFile.getEntry:

$ java ZipFileTest
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0xa) at pc=0x00000001054cfe92, pid=5775, tid=0x0000000000004d03
#
# JRE version: Java(TM) SE Runtime Environment (8.0_111-b14) (build 1.8.0_111-b14)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.111-b14 mixed mode bsd-amd64 compressed oops)
# Problematic frame:
# C  [libzip.dylib+0x2e92]  newEntry+0x154
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /Users/apanfilov/hs_err_pid5775.log
[thread 21251 also had an error]
[thread 22275 also had an error]
[thread 23299 also had an error]
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
[thread 24323 also had an error]
Abort trap: 6

Cool! Not lets check the behaviour of JVM with -Dsun.zip.disableMemoryMapping=true system property:

$ java -Dsun.zip.disableMemoryMapping=true ZipFileTest
Exception in thread "Thread-2" java.util.zip.ZipError: jzentry == 0,
 jzfile = 140315557132880,
 total = 1001,
 name = /var/folders/9d/m0q0kvvx3fj54mr_m3qs3twh0000gn/T/zipfile.zip ,
 i = 1,
 message = null
	at java.util.zip.ZipFile$ZipEntryIterator.next(ZipFile.java:519)
	at java.util.zip.ZipFile$ZipEntryIterator.nextElement(ZipFile.java:499)
	at java.util.zip.ZipFile$ZipEntryIterator.nextElement(ZipFile.java:480)
	at ZipFileTest$ZipFileReader.run(ZipFileTest.java:65)
	at java.lang.Thread.run(Thread.java:745)
Exception in thread "Thread-0" java.util.zip.ZipError: jzentry == 0,
 jzfile = 140315557132880,
 total = 1001,
 name = /var/folders/9d/m0q0kvvx3fj54mr_m3qs3twh0000gn/T/zipfile.zip ,
 i = 1,
 message = null
	at java.util.zip.ZipFile$ZipEntryIterator.next(ZipFile.java:519)
	at java.util.zip.ZipFile$ZipEntryIterator.nextElement(ZipFile.java:499)
	at java.util.zip.ZipFile$ZipEntryIterator.nextElement(ZipFile.java:480)
	at ZipFileTest$ZipFileReader.run(ZipFileTest.java:65)
	at java.lang.Thread.run(Thread.java:745)

Well, JVM didn’t crash, but we got a plenty of errors, so, -Dsun.zip.disableMemoryMapping=true system property prevents crashes only and seems to be unusable 😦 Here we need to stop blaming JDK and start thinking about following: what does mean “mapped file gets overwritten”, i.e. are the following scenarios are doing the same or not:

$ file 1.txt
1.txt: empty
$ echo "talented team" > 1.txt
$ file 1.txt
1.txt: ASCII text
$ echo "talented team" > 1.txt.tmp
$ mv 1.txt.tmp 1.txt

Actually, the second scenario is safe from JVM perspective and small patch for initial testcase may prove this:

@@ -79,9 +78,12 @@ public ZipFileWriter(File aFile) {
         public void run() {
             int i = 1;
             ZipOutputStream zOut;
+            File tmp;
             try {
-                zOut = new ZipOutputStream(new FileOutputStream(file));
-            } catch (FileNotFoundException e1) {
+                tmp = File.createTempFile(ZipFileTest.class.getSimpleName(),
+                        null, file.getParentFile());
+                zOut = new ZipOutputStream(new FileOutputStream(tmp));
+            } catch (IOException e1) {
                 e1.printStackTrace();
                 return;
             }
@@ -102,6 +104,9 @@ public void run() {
             } finally {
                 try {
                     zOut.close();
+                    if(!tmp.renameTo(file)) {
+                        System.err.println("Can't rename");
+                    }
                 } catch (IOException e) {
                 }
             }

and here we are:

$ java ZipFileTest
$ 

What about DFC? Javassist rocks!

Weird DQL grammar. Part II

Typically Documentum ignores DQL hints which are present in subqueries (though in most cases return_top hint is useless):

1> select count(*) from
2> (select user_name from dm_user enable(return_top 10))
3> go
count(*)              
----------------------
                    66
(1 row affected)

also Documentum fails to parse union keyword in subquery:

1> select count(*) from
2> (select user_name from dm_user union select group_name from dm_group)
3> go
[DM_QUERY_E_SYNTAX]error:  
  "A Parser Error (syntax error) has occurred in the vicinity of:  
   select count(*) from (select user_name from dm_user union"

but:

1> select count(*) from
2> (select user_name from dm_user enable(return_range 1 10 'user_name'))
3> go
count(*)              
----------------------
                    10
(1 row affected)

BugMakers :)

Frankly speaking, when I was writing previous blogpost I got surprised to discover that DQL update statement preserves the value of r_lock_owner attribute:

API> revert,c,09024be98006b104
...
OK
API> get,c,09024be98006b104,r_lock_owner
...
dmadmin
API> ?,c,update dm_document objects set object_name='xxx' where r_object_id='09024be98006b104'
objects_updated
---------------
              1
(1 row affected)
[DM_QUERY_I_NUM_UPDATE]info:  "1 objects were affected by your UPDATE statement."


API> revert,c,09024be98006b104
...
OK
API> get,c,09024be98006b104,r_lock_owner
...
dmadmin
API> 

Unfortunately, it is not true when you update objects, which behaviour is customized via TBO:

API> retrieve,c,bs_doc_cash
...
09bc2c71806d6ffe
API> checkout,c,l
...
09bc2c71806d6ffe
API> ?,c,update bs_doc_cash objects set object_name='xxx' where r_object_id='09bc2c71806d6ffe'
objects_updated
---------------
              1
(1 row affected)
[DM_QUERY_I_NUM_UPDATE]info:  "1 objects were affected by your UPDATE statement."


API> revert,c,09bc2c71806d6ffe
...
OK
API> get,c,09bc2c71806d6ffe,r_lock_owner
...

API>

but:

API> checkout,c,09bc2c71806d6ffe
...
09bc2c71806d6ffe
API> apply,c,,EXEC,QUERY,S,update bs_doc_cash objects set object_name='xxx' where r_object_id='09bc2c71806d6ffe',BOF_DQL,B,F
...
q0
API> next,c,q0
...
OK
API> dump,c,q0
...
USER ATTRIBUTES

  objects_updated                 : 1

SYSTEM ATTRIBUTES


APPLICATION ATTRIBUTES


INTERNAL ATTRIBUTES


API> revert,c,09bc2c71806d6ffe
...
OK
API> get,c,09bc2c71806d6ffe,r_lock_owner
...
dmadmin
API> 

Q & A. XIV

Hi,
I need to link multiple documents to folder in documentum.
those documents are checked out. Can you please advise how can i achieve this?

Thanks
Ram

Unfortunately, this guy didn’t provide a good description for his problem, so, let’s pretend that he is trying to do something like:

update dm_document objects
link '/target/folder'
where ...

and getting something like:

API> ?,c,update dm_document objects link '/Temp' where r_object_id='09024be98006ab34'
[DM_QUERY_F_UP_SAVE]fatal:  "UPDATE:  An error has occurred during a save operation."

[DM_SYSOBJECT_E_LOCKED]error:  
  "The operation on  sysobject was unsuccessful because it is locked by user DCTM_DEV."


API> get,c,09024be98006ab34,r_lock_owner
...
DCTM_DEV

Now follow my hands – it is a magic:

API> ?,c,alter group dm_escalated_allow_save_on_lock  add dmadmin

API> ?,c,update dm_document objects link '/Temp' where r_object_id='09024be98006ab34'
objects_updated
---------------
              1
(1 row affected)
[DM_QUERY_I_NUM_UPDATE]info:  "1 objects were affected by your UPDATE statement."


API> revert,c,09024be98006ab34
...
OK
API> get,c,09024be98006ab34,r_lock_owner
...
DCTM_DEV

Hardware company mantra. Part II

Pro Documentum

I thought a lot how to write continuation for previous blogpost, because I have my own OpenText experience and I’m looking forward to share it, on the other hand OpenText has recently shared their Documentum “roadmap“, so I want to share my thought on it as well. Today Alvaro de Andres has inspired me to write a blogpost about second topic, i.e. roadmap. Actually, I would completely ignore this blogpost if it was written by a sales person, who was hired a couple of months ago and has a KPI to sale as much D2 licences as possible, but statements like “migrate to D2 because you can configure it” sound ridiculous if you are an architect with 20 years of experience. Well, let’s discuss this topic more thoroughly, and let’s begin not with Documentum 🙂

Simple question: what is the worst bugtracking software ever?

I do…

View original post 1,045 more words

Database connections. Part II

Well, previosely we defined an estimation for database connection – about twice amount of concurrent Documentum sessions – may be less, may be more, depends on application. Now the question: how many connections is it possible to create in database? OpenText thinks that the number of database connections is unlimited:

D2 runs on content server which has good scalability by adding additional content server nodes. Content server is often the bottleneck of the whole D2 topology when the system is running under a heavy load condition. When content server reaches its bottleneck, we could monitored the CPU usage of content server is up to 90% and the number of active sessions grows very fast. To add one additional content server node to the existing environment could improve system throughput significantly.
Officially we suggests adding one additional content server node on every 300 concurrent users’ growth. The mileage

which is actually not true, on the other hand if OpenText has written something like: our product fails to take advantage of best practices and does not pool database connections, it would be ridiculous, so instead of improving product they has preferred to declare another marketing bullshit.

So, why database connection pooling is important? If you try to ask google you will find something like: creating database connection is an expensive and time-consuming operation: application needs to perform TCP (or even TLS) handshake, authenticate, database needs to start new process, etc…, so, it is recommended to pool database connections. Unfortunately it is only a half of the truth – pools also limit the number of concurrent database connections, and it is important too, let me quote the best oracle database expert ever:

In looking at your Automatic Workload Repository report, I see that the longest-running events at the system level are latch-related: cache buffers chains and library cache. Additionally, your CPU time was way up there. Concurrency-based waits are caused by one thing and one thing only: having many concurrently active sessions. If you had fewer concurrently active sessions, you would by definition have fewer concurrency-based waits (fewer people contending with each other). I see that you had 134 sessions in the database running on a total of 4 CPU cores. Because it is not really possible for 4 CPU cores to allow 134 sessions to be concurrently active, I recommend that you decrease the number of concurrent sessions by reducing the size of your connection pool—radically. Cut your connection pool to something like 32. It is just not possible for 4 cores to run 134 things at the same time; 4 cores can run only 4 things at exactly the same time. If you reduce the concurrent user load, you’ll see those waits go down, response time go down, and your transaction rates improve. In short, you’ll find out that you can actually do more work with fewer sessions than you can with more.

I know that this fewer-does-more suggestion sounds counterintuitive, but I encourage you to watch this narrated Real World Performance video.

In this video, you’ll see what happens in a test of a 12-core machine running transactions and decreasing the size of a connection pool from an unreasonable number (in the thousands) to a reasonable number: 96. At 96 connections on this particular machine, the machine was able to do 150 percent the number of transactions per second and took the response time for these transactions from ~100 milliseconds down to ~5 milliseconds.

Short of reducing your connection pool size (and therefore changing the way the application is using the database by queuing in the middle-tier connection pool instead of queuing hundreds of active sessions in the database), you would have to change your queries to make them request cache buffers chains latches less often. In short: tune the queries and the algorithms in the application. There is literally no magic here. Tweaking things at the system level might not be an option. Touching the application might have to be an option.

And from Documentum perspective, the only option to limit the number of database connections is to use shared server feature (fuck yeah, Oracle supports Database Resident Connection Pooling since 11g, but mature product does not). And do not pay much attention to EMC’s documents like Optimizing Oracle for EMC Documentum – such documents are wrong from beginning to end.