Bulk fetches

In September 2014 we got a new customer which had a EDMS system with a whole bunch of performance issues, the system is based on heavy customization of Webtop. The main problem was that application was performing a lot of fetches from docbase, for example, in some cases opening document properties page was accompanied by 10000 fetches (we already know that fetches are slow), so, users’ experience was predictable. To eliminate the most of bottlenecks caused by such weird design we had performed some optimizations intended to decrease the number of fetches – some things got replaced by DQL queries, another things got cached (here the technique described in Power Pivot had helped us a lot – that was pretty easy to find slow components and components with skewed data). Unfortunately, due to application’s design it was not possible to completely eliminate all unnecessary fetches – it was looking the same as completely rewrite application. So, I decided to implement “bulk fetches”, i.e. try to load objects, having the same type, by a single RPC command.

Why are fetches slow?

At first, I think it is worth to shed light on IDfSession#getObject and IDfPersistentObject#fetch methods. Documentation says following about these methods:

getObject

IDfPersistentObject getObject(IDfId objectId) throws DfException

Returns a Documentum server object.

fetch

boolean fetch(String typeNameIgnored) throws DfException

Fetches this object from the repository without placing a lock on the object.
Use this method to ensure that you are working with the most recent version of the object. You must have at least BROWSE permission on an object to call the fetch method.
Without a lock, there is no guarantee that you will be able to save any changes you make to the object since another user may checkout the object while you have it fetched. If you fetch an object, you cannot use the checkin method to write the object back to the repository. You must use the save method.

I have no idea who in EMC writes misleading documentation, but the only correct description for these methods is: “both methods retrieve current object’s snapshot from repository if corresponding DFC object is not dirty“. To clarify last statement let’s perform some tests based on technique described in RPC Commands (draft):

public class Test {

    public static void main(String[] args) throws DfException {
        IDfSession session = new DfClientX().getLocalClient().newSession(
                "ssc_dev", new DfLoginInfo("dmadmin", "dmadmin"));
        System.out.println(session.getSessionConfig().dump());
        // cache is empty
        log(session, "FIRST GET OBJECT START");
        doTrace(session, true);
        session.getObject(new DfId("0901ffd7805b03d8"));
        doTrace(session, false);
        log(session, "FIRST GET OBJECT END");
        // cache already contains object
        log(session, "SECOND GET OBJECT START");
        doTrace(session, true);
        IDfPersistentObject object = session.getObject(new DfId(
                "0901ffd7805b03d8"));
        doTrace(session, false);
        log(session, "SECOND GET OBJECT END");
        // dirty object
        object.setString(DfDocbaseConstants.OBJECT_NAME, "");
        log(session, "THIRD GET OBJECT START");
        doTrace(session, true);
        session.getObject(new DfId("0901ffd7805b03d8"));
        log(session, "THIRD GET OBJECT END");
        doTrace(session, false);
        session.disconnect();
    }

    private static void log(IDfSession session, String message)
        throws DfException {
        session.apply(null, "STAMP_TRACE",
                new DfList(new String[] {"MESSAGE" }), new DfList(
                        new String[] {"S" }), new DfList(
                        new String[] {message }));
    }

    private static void doTrace(IDfSession session, boolean enable)
        throws DfException {
        if (enable) {
            session.apply(null, "LOG_ON", new DfList(new String[] {"DETAIL" }),
                    new DfList(new String[] {"B" }), new DfList(
                            new String[] {"T" }));
        } else {
            session.apply(null, "LOG_OFF", new DfList(new String[] {}),
                    new DfList(new String[] {}), new DfList(new String[] {}));
        }

    }

}

Session log is:

// cache is empty
FIRST GET OBJECT START
[DM_SESSION_I_OPERATION_STARTED]info:  "Operation SysObjFullFetch started."
[DM_SESSION_I_OPERATION_ARGUMENTS]info:  "
Object Id: 0901ffd7805b03d8
Argument Object:
2
OBJ NULL 0 0 0 0
2
FOR_REVERT BOOL S 0
F
CACHE_VSTAMP INT S 0
174476
"
[DM_SESSION_I_OPERATION_ENDED]info:  "Operation SysObjFullFetch completed."
FIRST GET OBJECT END
// cache already contains object
SECOND GET OBJECT START
[DM_SESSION_I_OPERATION_STARTED]info:  "Operation IsCurrent started."
[DM_SESSION_I_OPERATION_ARGUMENTS]info:  "
Object Id: 0901ffd7805b03d8
Argument Object:
2
OBJ NULL 0 0 0 0
4
OBJECT_TYPE STRING S 0
A 11 dm_document
i_vstamp INT S 0
0
CACHE_VSTAMP INT S 0
174476
TYPE_CACHE_VSTAMP INT S 0
4707
"
[DM_SESSION_I_OPERATION_ENDED]info:  "Operation IsCurrent completed."
SECOND GET OBJECT END
// dirty object
THIRD GET OBJECT START
THIRD GET OBJECT END

I.e, for uncached object DFC sends fetch (SysObjFullFetch for sysobjects) RPC-command, for cached object DFC sends IsCurrent RPC-command, and for dirty object DFC sends nothing, the behaviour of IDfPersistentObject#fetch method is the same: DFC tries to figure out whether object’s snapshot is actual or not by sending IsCurrent RPC-command only if object is not dirty, if IsCurrent RPC command returns false DFC sends fetch RPC. Note, if your coder writes something like:

IDfPersistentObject object = session.getObject("bla-bla-bla");
object.fetch(null);

he is an idiot – object.fetch() reiterates the same logic which already was performed by session.getObject(). So in best case IDfSession#getObject sends only one RPC command (IsCurrent or fetch), in worst case (object is present in DFC cache but stale) it sends two RPC commands (IsCurrent and fetch). In my case application performs a lot of fetches, hence cache hit tends to be zero, moreover my situation is complicated by the fact that SysObjFullFetch RPC command is extremely slow because it submits five SQL queries to database (IsCurrent submits only one SQL query), those queries are:

-- determine object type by r_object_id
SELECT dm_dbalias_B.I_TYPE
  FROM DMI_OBJECT_TYPE dm_dbalias_B
 WHERE dm_dbalias_B.R_OBJECT_ID = :dmb_handle

-- retrieve data from database
  SELECT *
    FROM TYPE_RV dm_dbalias_B, TYPE_SV dm_dbalias_C
   WHERE (    dm_dbalias_C.R_OBJECT_ID = :dmb_handle
          AND dm_dbalias_C.R_OBJECT_ID = dm_dbalias_B.R_OBJECT_ID)
ORDER BY dm_dbalias_B.R_OBJECT_ID, dm_dbalias_B.I_POSITION

-- get identifier of object's ACL (totally useless)
SELECT dm_dbalias_B.R_OBJECT_ID
  FROM DM_ACL_S dm_dbalias_B
 WHERE (dm_dbalias_B.OWNER_NAME = :p00 
    AND dm_dbalias_B.OBJECT_NAME = :p01)

-- retrieve object's ACL
SELECT *
    FROM DM_ACL_RV dm_dbalias_B, DM_ACL_SV dm_dbalias_C
   WHERE (    dm_dbalias_C.R_OBJECT_ID = :dmb_handle
          AND dm_dbalias_C.R_OBJECT_ID = dm_dbalias_B.R_OBJECT_ID)
ORDER BY dm_dbalias_B.R_OBJECT_ID, dm_dbalias_B.I_POSITION

-- check whether ACL is actual or not (totally useless)
SELECT dm_dbalias_B.R_OBJECT_ID
  FROM DM_ACL_S dm_dbalias_B
 WHERE (    dm_dbalias_B.R_OBJECT_ID = :dmb_objectp
        AND dm_dbalias_B.I_VSTAMP = :dmb_versionp)

Bulk fetches

Surprisingly, but DFC has a method which allows to construct persistent objects using DQL query:

getObjectsByQuery

IDfEnumeration getObjectsByQuery(String dql,
                                 String optionalTypeName)
                                 throws DfException

Returns an enumeration of persistent objects using data obtained through a query.
The query doesn’t need to contain all the object attributes but there are a minimum subset of attributes that must be present. All queries must include r_object_id and i_vstamp attribute. The query must include r_object_type, r_aspect_name, i_is_replica and i_is_reference if the object has these attributes. The optionalTypeName is required if the object does not have the attribute r_object_type, eg: dm_user. When a query is done on dmi_queue_item, the attribute “source_docbase” must be included in the query.
Note that the objects returned by this call may not be fully populated with data. Only data present in the query will exist in the returned objects. As long as your subsequent access to the objects only reference the populated data then no “fetch” of the full object will occur. As soon as you try to access object data that is not present then an internal “fetch” will be triggered to obtain the remaining data. This process is transparent to the object user.

Unfortunately, IDfSession#getObjectsByQuery method didn’t feet my needs due to following reasons:

  • it returns deprecated Enumeration, backed up by ArrayList – large resultsets consume a lot of memory
  • it was required to rewrite a lot of DQLs to inject object’s attibutes – application was initially designed to retrieve a set of objects’ identifiers using DQL query and then perform session.getObject() to get a set of persistent objects

so, I started seeking for a way to perform “bulk fetches” having a set of objects’ identifiers. After some investigation I discovered an option to construct persistent objects from IDfCollection:

PersistentObjectManager objectManager = ((ISession) session)
        .getObjectManager();
IDfCollection collection = ...;
while (collection.next()) {
    IDfPersistentObject object = objectManager
            .getObjectFromQueryRow(collection, objectType);
}

Now, the problem has been narrowed to constructing IDfCollection object with required attributes. I have proposed following options:

  • DQL_MATCH RPC command:
    StringBuilder builder = new StringBuilder();
    builder.append(DfDocbaseConstants.R_OBJECT_ID).append(" IN (");
    for (Iterator<String> iter = ids.iterator(); iter.hasNext();) {
        builder.append('\'').append(iter.next()).append('\'');
        if (iter.hasNext()) {
            builder.append(',');
        }
    }
    builder.append(")");
    PersistentObjectManager manager = ((ISession) session)
            .getObjectManager();
    IDfCollection collection = session.apply(null, "DQL_MATCH", new DfList(
            new String[] {"QUERY_TYPE", "QUERY_PREDICATE" }), new DfList(
            new String[] {"S", "S" }), new DfList(new String[] {
        "dm_document", builder.toString() }));
    while (collection.next()) {
        IDfPersistentObject object = manager.getObjectFromQueryRow(
                collection, "dm_document");
    }
    
  • Regular DQL query with a trick to save ordering of repeating attributes (complete code could be found on github):
    IDfType type = session.getType("dm_document");
    boolean hasRepeatings = false;
    StringBuilder builder = new StringBuilder("SELECT ");
    for (int i = 0, n = type.getTypeAttrCount(); i < n; i++) {
        IDfAttr attr = type.getTypeAttr(i);
        if (attr.isRepeating()) {
            hasRepeatings = true;
        }
        builder.append(attr.getName());
        builder.append(",");
    }
    builder.append(DfDocbaseConstants.R_OBJECT_ID).append(",");
    builder.append(DfDocbaseConstants.I_VSTAMP);
    if (hasRepeatings) {
        builder.append(',');
        builder.append("i_position");
    }
    builder.append(" WHERE ").append(DfDocbaseConstants.R_OBJECT_ID)
            .append(" IN (");
    for (Iterator<String> iter = ids.iterator(); iter.hasNext();) {
        builder.append('\'').append(iter.next()).append('\'');
        if (iter.hasNext()) {
            builder.append(',');
        }
    }
    builder.append(")");
    builder.append(" ORDER BY ").append(DfDocbaseConstants.R_OBJECT_ID);
    if (hasRepeatings) {
        builder.append("i_position  DESC");
    }
    PersistentObjectManager manager = ((ISession) session)
            .getObjectManager();
    IDfCollection collection = new DfQuery(builder.toString()).execute(
            session, IDfQuery.DF_EXEC_QUERY);
    while (collection.next()) {
        IDfPersistentObject object = manager.getObjectFromQueryRow(
                collection, "dm_document");
    }
    
    

Later, DQL_MATCH demonstrated bad performance – it has only three times performance improvement over regular fetches:

moreover sometimes it does not work properly, so DQL_MATCH was not an option anymore.

It seemed that the last challenge was to identify the optimal count of objects’ identifies in “IN ()” clause – I already knew that DQL queries have length restrictions and in general case performance depends on the length of query and after some microbenchmarks I have found that the optimal value is somewhere between 500 and 100:

But I have noticed a flat region on graphs which is cased by garbage collection activity and now the challenge is to find optimal GC settings.

4 thoughts on “Bulk fetches

  1. Pingback: GC challenge | Documentum in a (nuts)HELL
  2. Pingback: Bulk fetches. GC competition | Documentum in a (nuts)HELL
  3. Pingback: What happens when smart guy does not sit back | Documentum in a (nuts)HELL
  4. Pingback: Workflow throughput | Documentum in a (nuts)HELL

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s