Ingestion rates

The main problem of migration projects is you know the amount of data to be migrated, you have time constraints but you have no references for similar projects or references you have are obscure, let me provide some example.

Relying on my previous experience I may say that 20 documents per second is a good rate for a single thread (actually, this rate is bit overestimated because I didn’t upload content). Is it possible to decrease this rate? Definitely yes. Is it possible to improve this rate? Obviously yes: we have three systems involved in ingestion:

  • feeder – generates commands for Content Server
  • Content Server – parses commands from feeder, stores content on disk, generates commands for database
  • database – parses commands from Content Server, stores data on disk

and when one system is active, other two are idle, so, in the first approximation we may get three times improvement of ingestion rate if we spawn three threads simultaneously. But writing multithreaded programs sounds complicated, so, let’s try to find references in the internet…

The first:

DEMA – all about speed

The core of DEMA is about speed of the migration – examples included:

  • Government Client – 28 million documents – 10 TB – 10 days
  • Cable TV – 75 million documents – 5 TB – 15 days
  • Life Sciences – 1.5 million documents – 24 hours

The second:

Architected for the “fastest migration possible”, DEMA was constructed to avoid the Documentum API and go directly to the underlying database.

Thread communication – to be truly multi-threaded, the migration tool has to make sure that activities are divided up correctly by each thread to avoid conflicts.

It seems that the best reference is 5 million documents per day (57 documents per second), and this result was achieved by avoiding Documentum API, so, our 20 documents per second sounds reasonable because we are using Documentum API, doesn’t it? Bullshit! The duration of projects in the first reference does not make sense: I may spend one half of day for coding migration tool and then load 10 million documents during remaining half of day, what ingestion rate did I achieve? 230 or 115 documents per second? Moreover, EMC provides completely different and more promising rates:

migrates more that 1.2 million objects in one hour

Now, in order to place correct reference about ingestion rates I would like to provide the real-world results (4 CPU cores on Content Server):

  • single thread performance (no optimisation tricks) – 17 documents per second
  • single thread performance with optimised content transfer – 22 documents per second
  • 4 threads with optimised content transfer (50% CPU utilisation on CS) – 67 documents per second
  • 20 threads with optimised content transfer (80% CPU utilisation on CS) – 130 documents per second
  • 50 threads with optimised content transfer (100% CPU utilisation on CS) – 180 documents per second
  • 50 threads with optimised content transfer and disabled auditing (dm_save and dm_link events) (100% CPU utilisation on CS) – 250 documents per second

“optimised content transfer” means following: in order to avoid content transfer between feeder and Content Server I made content files available from CS host and instead of doing something like:

object.setContentType("format");
object.setFile("path to file");

I do following:

object.setContentType("format");
ITypedData extendedData = ((ISysObject) object).getExtendedData();
extendedData.setString("HANDLE_CONTENT", "yes");
extendedData.setBoolean("FIRST_PAGE", true);
extendedData.setString("FILE_PATH", "path on content server");

In order to avoid redundant RPCs related to calculation of default values I do following:

// fake TBO
/**
 * @author Andrey B. Panfilov <andrew@panfilov.tel>
 */
public class FakeSysObject extends DfSysObject {

    public FakeSysObject() {
        super();
    }

    // by default DFC calls
    // com.documentum.fc.client.DfPersistentObject.setDefaultValues
    // which has performance impact, so we override this behaviour
    @Override
    protected void init() throws DfException {
        ReflectionUtils.invokeMethod(this, "setValues");
    }

}

// poisoning DFC by fake TBO
registerTBO("docbase", "type_name", FakeSysObject.class);

protected void registerTBO(String docbaseName, String type, Class<?> clazz)
    throws DfException {
    // poisoning DocbaseRegistry
    ClassCacheManager.getInstance().setIsDownloading(true);
    IntrinsicModuleRegistry moduleRegistry = IntrinsicModuleRegistry
            .getInstance();
    ReflectionUtils.invokeMethod(moduleRegistry, "registerTBO", type,
            clazz.getName());
}

// populating default values in feeder
object.setString(....);