Batch Apex on Force.com And Handling Large Data Volumes

Discover methods to prevent timeouts in the SOQL queries used to select records for processing. Achieving this objective maximizes the potential of the batch Apex framework, specifically tailored for handling substantial data volumes.

Salesforce developers, have you encountered this scenario? Working with extensive data volumes, you’ve opted for batch Apex due to Salesforce’s multi-tenant architecture and governor limits. Yet, as you attempt record processing, the jobs get aborted.

This blog post delves into solutions to prevent timeouts in the SOQL queries used for record selection. Achieving this objective maximizes the potential of the batch Apex framework, specifically designed for managing substantial data volumes.

Batch Apex Architecture

Batch Apex allows you to run complex and long-running processes on the Force.com platform, and it typically goes through the following general steps.

  • The start method of a batch Apex class collects the records or objects that will be processed. Using well-performing SOQL in the start() method reduces the chances of a job being aborted.
  • The execute method picks up batches of these records and does the bulk of the processing.
  • After the batches are processed, the finish method sends confirmation emails or executes post-processing operations.

Tuning SOQL in start method

Optimizing the SOQL in the start method is key to avoiding timeouts of a batch Apex job. You might want to process millions of records, but if the job is aborted when the query runs for longer than 2 minutes, how can you avoid this scenario?

Selective Filters
The first step involves ensuring that the query is selective and leveraging indexes.

Force.com’s query optimizer checks the filters in the query to determine if an index will be used.

An index is used withWhen
Standard indexed fieldsThe filter matches less than 30% of the total records, up to 1 million records
Custom indexed fieldsThe filter matches less than 10% of the total number of records, up to 333,333 records
The AND clauseAt least one filter meets the conditions in this table for the appropriate type of indexed fields*
The OR clauseAll fields must be indexed and meet the conditions in this table for the appropriate type of indexed fields*

* For AND and OR clauses, Force.com’s query optimizer applies several algorithms and creates runtime joins to decide if an index is used. The table features baselines which, when met, will result in an index being used.

Consider the following query.

SELECT Id FROM Account WHERE CreatedDate > 2013-01-01T00:00:00Z AND Industry = 'Banking' AND SLA__c IN ('Gold', 'Platinum')

Start by checking for the total record count of accounts and remember that the count includes soft-deleted records, which have been deleted from the Recycle Bin but have not yet been permanently deleted. Evaluate each filter individually to determine if it is selective.

SELECT count() FROM Account WHERE CreatedDate  > 2013-01-01T00:00:00Z

SELECT count() FROM Account WHERE Industry = 'Banking'

SELECT count() FROM Account WHERE SLA__c IN ('Gold', 'Platinum')

A standard index already exists on the CreatedDate field and including this filter to process smaller time ranges is recommended. The other two fields would require salesforce.com Customer Support to create custom indexes. In order for the query to use an index, either the first filter returns less than 1 million records, or one of the two remaining filters returns fewer than 333,333 records.

If a filter is selective and isn’t already indexed, make the field an External Id or contact salesforce.com Customer Support. Customer Support can create single- and two-column custom indexes on most fields, with the exception of multi-picklist fields, some formula fields, and long text area and rich text area fields.

If, after you create custom indexes, the query is still not selective, is it possible to have additional criteria added? In most cases, adding additional selective filters or processing in multiple jobs allows the start method query to complete. In the previous example, we could create multiple jobs with different values for the SLA__c filter and select smaller ranges for the CreatedDate filter.

When Selective Filters Aren’t Enough

When dealing with large data volumes, having selective filters might not be enough to avoid timeouts. Suppose you have 5 million accounts, and you want to process 80% of the records represented by the following query. Assume that additional filters cannot be added, and that attempts to use multiple jobs have not succeeded.

global Database.QueryLocator<SObject> start(Database.BatchableContext bc) {
    return Database.getQueryLocator('SELECT Id, Name, Description FROM Account
        WHERE Rating__c IN (‘Excellent’, ‘Good’, ‘Moderate’)');
}

global void execute(Database.BatchableContext bc, List<Account> scope) {
    for(Account acc : scope) { /* do something */ }
}

Because you are trying to query for 4 million of the 5 million accounts, there is no way to use a selective filter to stay under the 10% of total records (up to 333,333 records) threshold, even if Rating__c is indexed.

As an alternative, you will see improved performance by removing the filters completely from the start method and filtering out the unwanted records in the execute method. Only consider this technique after attempting to tune the query and contacting salesforce.com Customer Support. It is recommended to only use this when timeouts are occurring because of large data volumes. Outside of this use case, this technique will likely perform worse than a properly optimized query would.

global Database.QueryLocator<SObject> start(Database.BatchableContext bc) {
    return Database.getQueryLocator('SELECT Id FROM Account');
}

global void execute(Database.BatchableContext bc, List<Account> scope) {
    List<Account> actualScope = [SELECT Id, Name, Description FROM Account
        WHERE Rating__c IN ('Excellent','Good','Moderate’)
        AND Id IN :scope];

    for(Account acc : actualScope) { /* do something */ }
}

Note that the selection of the Name and Description fields has been moved to the SOQL in the execute method, and that this tactic doesn’t improve or degrade the performance of the query in the start method. For any job, no matter which fields are in the SELECT clause for the start method, the batch Apex framework is selecting only the Id field before creating batches of records to be processed in the execute method.

The query results in a full-table scan, and although this solution is counterintuitive, it has helped in some scenarios involving timeouts due to large data volumes. With that said, only consider this technique after working with salesforce.com Customer Support to tune the query.

In summary,

The performance of batch Apex tasks relies on factors such as data sharing, extensive data volumes, the nature of processed records, and the segmentation of records across jobs. However, the initiation of processing hinges on the completion of the SOQL within the start() method. Hence, it’s critical to craft SOQL queries capable of managing growing data volumes and evading timeouts.

If timeouts become a challenge, Salesforce Customer Support can assist by:

  • Optimizing the SOQL within the start method and potentially creating custom indexes.
  • Dividing the job into smaller segments, with segmentation by CreatedDate being a recommended approach.
  • Selecting Ids without filters, but only as a last resort if the previous solutions fail to address the timeouts.