Have never thought that my colleagues may teach me something…
Yesterday I asked my colleague, who is trying to improve his skills in performance optimisation, whether he had any idea how to improve this SQL statement:
SELECT ALL dm_folder.r_object_id FROM dm_folder_sp dm_folder WHERE ( EXISTS (SELECT r_object_id FROM dm_folder_r WHERE dm_folder.r_object_id = r_object_id AND r_folder_path = :"SYS_B_00") AND (dm_folder.object_name = :"SYS_B_01")) AND ( dm_folder.i_has_folder = :"SYS_B_02" AND dm_folder.i_is_deleted = :"SYS_B_03")
and, surprisingly, the answer was: “Yes, I have seen something similar on support site – EMC suggest to set dfc.query.should_include_object_name and dfc.query.should_include_object_name properties, something like:
Well, as was expected both dfc.query.should_include_object_name and dfc.query.should_include_object_name properties are not documented, so let discuss the problem more thoroughly.
Imagine that we are maintaining following folder structure in our docbase:
\_CLIENT_1 \_CLAIMS \_INVOCES \_CLINET_2 \_CLAIMS \_INVOCES ... \_CLIENT_X \_CLAIMS \_INVOCES
i.e. for every client we create the same folder structure and when we want to store invoice for particular client we do something like:
create,c,dm_document set,c,l,object_name xxx link,c,/CLIENTS/CLINET_1/INVOICES save,c,l
the problem is that upon link call DFC calls IDfSession#getFolderByPath method to retrieve folder object with particular path, and inside IDfSession#getFolderByPath method DFC does following: it cuts off object name part from the path (i.e. everything after last ‘/’) and builds following DQL query:
SELECT r_object_id FROM dm_folder WHERE object_name='INVOICES' AND ANY r_folder_path='/CLIENTS/CLINET_1/INVOICES'
such implementation is bit weird for two reasons:
- when I do the same I just write something like “retrieve,c,dm_folder where any r_folder_path=”…” and do not bother myself about object name
- Content Server has a build-in FolderIdFindByPath RPC command:
API> apply,c,,FolderIdFindByPath,_FOLDER_PATH_,S,/dmadmin ... q0 API> next,c,q0 ... OK API> get,c,q0,result ... 0c01d92080000105 API> close,c,q0 ... OK
which generates following effective SQL statement:
select r_object_id from dm_folder_r where r_folder_path = :p0
so, I have no idea why DFC performs extra logic here, moreover, in case of current DFC implementation we are getting overcomplicated SQL query and, sometimes database engine fails to build a good execution plan for this query (this is caused by dumb recommendation to set CURSOR_SHARING database parameter to FORCE and depending on docbase structure execution of such query may take minutes). Below are two possible execution plans for this query:
good (dm_folder_r is a leading table – querying dm_folder_r table by r_folder_path will always return not more than one row):
and bad (dm_folder_r is not a leading table – imagine that we have 1 million clients and hence 1 million INVOICE folders, so querying dm_sysobjec_s table by object_name first will return 1 million records):
in case of “retrieve,c,dm_folder where any r_folder_path=”…” execution plan is always good:
In 2011 (if my memory serves me right), I solved such performance problem by marking index on dm_folder_r(r_folder_path) as unique – in this case database engine always builds the correct execution plan because it knows that querying dm_folder_r table will always return not more than one row, however in recent versions DFC it is possible to disable it’s dumb behaviour by setting dfc.query.should_include_object_name and dfc.query.should_include_object_name properties – can’t understand why this wasn’t enabled by default.