Ingestion rates

The main problem of migration projects is you know the amount of data to be migrated, you have time constraints but you have no references for similar projects or references you have are obscure, let me provide some example.

Relying on my previous experience I may say that 20 documents per second is a good rate for a single thread (actually, this rate is bit overestimated because I didn’t upload content). Is it possible to decrease this rate? Definitely yes. Is it possible to improve this rate? Obviously yes: we have three systems involved in ingestion:

  • feeder – generates commands for Content Server
  • Content Server – parses commands from feeder, stores content on disk, generates commands for database
  • database – parses commands from Content Server, stores data on disk

and when one system is active, other two are idle, so, in the first approximation we may get three times improvement of ingestion rate if we spawn three threads simultaneously. But writing multithreaded programs sounds complicated, so, let’s try to find references in the internet…

The first:

DEMA – all about speed

The core of DEMA is about speed of the migration – examples included:

  • Government Client – 28 million documents – 10 TB – 10 days
  • Cable TV – 75 million documents – 5 TB – 15 days
  • Life Sciences – 1.5 million documents – 24 hours

The second:

Architected for the “fastest migration possible”, DEMA was constructed to avoid the Documentum API and go directly to the underlying database.

Thread communication – to be truly multi-threaded, the migration tool has to make sure that activities are divided up correctly by each thread to avoid conflicts.

It seems that the best reference is 5 million documents per day (57 documents per second), and this result was achieved by avoiding Documentum API, so, our 20 documents per second sounds reasonable because we are using Documentum API, doesn’t it? Bullshit! The duration of projects in the first reference does not make sense: I may spend one half of day for coding migration tool and then load 10 million documents during remaining half of day, what ingestion rate did I achieve? 230 or 115 documents per second? Moreover, EMC provides completely different and more promising rates:

migrates more that 1.2 million objects in one hour

Now, in order to place correct reference about ingestion rates I would like to provide the real-world results (4 CPU cores on Content Server):

  • single thread performance (no optimisation tricks) – 17 documents per second
  • single thread performance with optimised content transfer – 22 documents per second
  • 4 threads with optimised content transfer (50% CPU utilisation on CS) – 67 documents per second
  • 20 threads with optimised content transfer (80% CPU utilisation on CS) – 130 documents per second
  • 50 threads with optimised content transfer (100% CPU utilisation on CS) – 180 documents per second
  • 50 threads with optimised content transfer and disabled auditing (dm_save and dm_link events) (100% CPU utilisation on CS) – 250 documents per second

“optimised content transfer” means following: in order to avoid content transfer between feeder and Content Server I made content files available from CS host and instead of doing something like:

object.setContentType("format");
object.setFile("path to file");

I do following:

object.setContentType("format");
ITypedData extendedData = ((ISysObject) object).getExtendedData();
extendedData.setString("HANDLE_CONTENT", "yes");
extendedData.setBoolean("FIRST_PAGE", true);
extendedData.setString("FILE_PATH", "path on content server");

In order to avoid redundant RPCs related to calculation of default values I do following:

// fake TBO
/**
 * @author Andrey B. Panfilov <andrew@panfilov.tel>
 */
public class FakeSysObject extends DfSysObject {

    public FakeSysObject() {
        super();
    }

    // by default DFC calls
    // com.documentum.fc.client.DfPersistentObject.setDefaultValues
    // which has performance impact, so we override this behaviour
    @Override
    protected void init() throws DfException {
        ReflectionUtils.invokeMethod(this, "setValues");
    }

}

// poisoning DFC by fake TBO
registerTBO("docbase", "type_name", FakeSysObject.class);

protected void registerTBO(String docbaseName, String type, Class<?> clazz)
    throws DfException {
    // poisoning DocbaseRegistry
    ClassCacheManager.getInstance().setIsDownloading(true);
    IntrinsicModuleRegistry moduleRegistry = IntrinsicModuleRegistry
            .getInstance();
    ReflectionUtils.invokeMethod(moduleRegistry, "registerTBO", type,
            clazz.getName());
}

// populating default values in feeder
object.setString(....);

7 thoughts on “Ingestion rates

  1. Hi, I read with attention your solution and i found gorgeous.

    I have got two question:
    The first is the modification to the content upload:

    object.setContentType(“format”);
    ITypedData extendedData = ((ISysObject) object).getExtendedData();
    extendedData.setString(“HANDLE_CONTENT”, “yes”);
    extendedData.setBoolean(“FIRST_PAGE”, true);
    extendedData.setString(“FILE_PATH”, “path on content server”);

    But do you use a standard content store and then the binary s copied in file store rappresentation, or will be accesible with and external file store?

    The second one is the use TBO.

    What are the default value that you manually set?

    Thanks for any helps

    Like

  2. But do you use a standard content store and then the binary s copied in file store rappresentation, or will be accesible with and external file store?

    It is DFC analog of DQL’s “CREATE … object … SETFILE ‘filepath’ WITH CONTENT_FORMAT=’format_name'”, i.e. in this case content server copies files from remote filesystem to local storage. The point was to avoid content transfer between DFC and Content Server.

    What are the default value that you manually set?

    That depends on your data model.

    Like

  3. I often read your blog. Very informative and helpful. Thanks for sharing of all your findings.

    object.setFile(“path to file”); ‘path to file’ is limited to 255 characters in 6.x, I am not sure in 7.x.

    You might have already noticed it but thought of sharing with you.

    Regards,
    Akbar

    Like

  4. Pingback: setFile API usage limitis | Documentum in a (nuts)HELL
  5. Pingback: When documentum had started dying | Documentum in a (nuts)HELL
  6. Pingback: Workflow throughput | Documentum in a (nuts)HELL

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s