Skip to end of metadata
Go to start of metadata

Overview

This page is in intended for CoprHD developers who are working with new or existing workflows and aims to provide a reference of Do's and Don'ts in order to produce a clean workflow.  The material here is based on findings from the Rollback Quality project and as such will also cover recommended testing strategies using a failure injection framework.

Recommended reading: Workflow Service

Design Patterns (Do's)

This section describes various design patterns to consider for your workflows.

Task Completer Best Practices

The following list proposes a set of recommended patterns that may be followed when implementing or updating TaskCompleter instances.

Allow early access to TaskCompleters

This is especially important for rollback workflow steps, where the rollback method is a wrapper for one or more existing "happy-path" workflow methods that perform the inverse operation.  For example, "createVolumes" would rollback with "rollbackCreateVolumes" which in turn delegates to a method for deleting volumes.

CodeComments
/**
 * {@inheritDoc} NOTE NOTE: The arguments here must match deleteVolumesMethod defined above (except opId).
 */
@Override
public void deleteVolumes(URI systemURI, List<URI> volumeURIs, String opId) throws ControllerException {
    MultiVolumeTaskCompleter completer = new MultiVolumeTaskCompleter(volumeURIs, opId);
    deleteVolumesWithCompleter(systemURI, volumeURIs, completer);
}




@Override
public void rollBackCreateVolumes(URI systemURI, List<URI> volumeURIs, String opId) throws ControllerException {
    MultiVolumeTaskCompleter completer = new MultiVolumeTaskCompleter(volumeURIs, opId);
    ... 
    deleteVolumesWithCompleter(systemURI, volumeURIs, completer);
    ...
}




/**
 * Deletes the given volumes with an existing task completer.
 *
 * @param systemURI     Storage system URI
 * @param volumeURIs    List of Volume URI
 * @param completer     Task completer
 * @throws ControllerException
 */
public void deleteVolumesWithCompleter(URI systemURI, List<URI> volumeURIs, MultiVolumeTaskCompleter completer)
        throws ControllerException {





#deleteVolumes is the "happy-path" workflow step method for a delete request. Because the delete operation is also shared with a rollback operation, we simply create the relevant task completer here and delegate.



#rollbackCreateVolumes is the rollback workflow method for when a create request fails. It will also create an early instance of the relevant task completer and delegate.






Finally, #deleteVolumesWithCompleter no longer has to create its own task completer. The task completer passed in could be coming from either a "happy path" or rollback context.


The intended benefits of this pattern are to provide a single instance of a task completer for all scenarios of an operation and help ensure that a task is completed once, or in the context of the same thread - not at all, if an asynchronous thread has assumed responsibility (see section Asynchronous Steps).

Responsibilities of a TaskCompleter

When completing a TaskCompleter instance, it should be considered an opportune time to perform any database operations that:

  • Establish or destroy relationships between objects.
  • Create, update or deactivate objects.

Rather than having these database operations strewn over the course of a workflow step, it can be a convenient and single location to have the TaskCompleter assume these responsibilities.

Detect and report rollback failures

Failures that occur during rollback have the potential to leave behind provisioned resources and should be reported to user.  For example, a volume creation rollback may fail to remove a volume and complete the task with an error state:

CodeComments
// VolumeDeleteCompleter#complete
...
switch (status) {
    case error:
        if (isRollingBack() && (coded instanceof ServiceError)) {
            ServiceError error = (ServiceError) coded;
            String originalMessage = error.getMessage();
            String additionMessage = "Rollback encountered problems cleaning up " +
                    volume.getNativeGuid() + " and may require manual clean up";
            String updatedMessage = String.format("%s\n%s", originalMessage, additionMessage);
            error.setMessage(updatedMessage);
        }
        dbClient.error(Volume.class, volume.getId(), getOpId(), coded);
        break;
        ...




In order to detect a failure during rollback, we can use TaskCompleter#isRollingBack in the error case.

We can then update the service code message accordingly.


Complete only once

A task completer can be asked about its completion state via TaskCompleter#isCompleted.  This property is set to true in the #error and #ready methods in TaskCompleter, and so it's important to ensure a chain of "super" calls within the TaskCompleter inheritance hierarchy will reach these implementations if it's to be used reliably.  Alternatively, a sub-class may set this property directly.

CodeComments
private void handleException(Exception e, TaskCompleter taskCompleter) {
    _log.error("Handling exception with task completer: {}", taskCompleter, e);
    if (taskCompleter != null && (taskCompleter.isCompleted() || taskCompleter.isAsynchronous())) {
        _log.warn("Task has been marked as either asynchronous or completed.  Not performing any error handling.");
        return;
    }
    ...





Note that although it is possible to write a TaskCompleter in an idempotent manner, the ability to query the completed state may still be useful when the TaskCompleter instance is being passed into other methods with uncertain exception handling.

Asynchronous Steps

Relinquishing TaskCompleter responsibilities

Workflow steps that hand off their task completer to an asynchronous job should immediately relinquish the responsibility of completing the task completer, once the job has been successfully queued.

After an asynchronous job has been queued, there may be subsequent code to be executed that is subject to failure.  In the event of an exception being thrown, an exception handler must be able to recognize if an asynchronous job has assumed task responsibility, otherwise the task may be completed whilst the asynchronous job is still queued or executing.  This can result in a running command from the CLI exiting prematurely, or an order page in the UI incorrectly informing the user a workflow has completed.

A task completer can be asked about its asynchronous-related state via TaskCompleter#isAsynchronous.  This property is automatically set to true when an associated job is placed into the Dispatcher queue.  Surrounding code should then conditionally completed based on the result of TaskCompleter#isAsynchronous.

Example

The following example is taken from an SMI-S delete volume scenario, where an asynchronous job tracks the running operation on the SMI-S provider.

CodeComments
// SmisStorageDevice#doDeleteVolumes
_helper.invokeMethod(storageSystem, configSvcPath,
        returnElementsMethod, inArgs, outArgs);
CIMObjectPath job = _cimPath.getCimObjectPathFromOutputArgs(outArgs,
        SmisConstants.JOB);
if (job != null) {
    ControllerServiceImpl.enqueueJob(new QueueJob(new SmisDeleteVolumeJob(job,
            storageSystem.getId(), taskCompleter)));
}







An SmisDeleteVolumeJob instance is created, and taskCompleter is passed into the constructor. The job is then queued.

// ControllerServiceImpl.java
public static void enqueueJob(QueueJob job) throws Exception {
    _jobQueue.put(job);
    job.getJob().getTaskCompleter().setAsynchronous(true);
}





After successfully queuing without any Exceptions, the job's associated task completed is automatically marked as asynchronous for convenience.

// SmisStorageDevice#doDeleteVolumes
} catch (Exception e) {
    _log.error("Problem in doDeleteVolume: ", e);
    // Check to see if an Asynchronous job will now handle the task status.
    if (!taskCompleter.isAsynchronous()) {
        ServiceError error = DeviceControllerErrors.smis.methodFailed("doDeleteVolume",
                e.getMessage());
        taskCompleter.error(_dbClient, error);
    }
}





In the exception catch block, we can now conditionally complete the task completer. If it has been marked as asynchronous, it is the asynchronous job's responsibility to handle the task status.


Rollback Context

During workflow execution, each step has the option to create a context object associated with itself.  The context object is an arbitrary object that implements the Serializable interface and is persisted in Zookeeper.

A context object can be useful for when rollbacks need more data about what it should or should not do, in order to prevent data unavailability.  For example, consider part of a scenario where a request is made to export a Volume to a Host:

  1. An Initiator is to be added to an initiator group, but is found to exist unbeknownst to the database.
  2. An exception occurs in the last step of the workflow, triggering a rollback.
  3. A rollback step associated with adding the Initiator performs the inverse remove operation.

Here, the rollback step required more data about the Initiator before it went ahead and removed it (causing a DU).  The following code snippets demonstrate how the use of a context object could have prevented this:

CodeComments
/**
 * Check if any of the requested Initiators have sneaked into the masking view from outside of ViPR, if so we'll add them
 * to the task completer (for the database update) and record an operation stating that we found it
 * already existed (in case a rollback attempts to try and remove it).
 *
 * @param storage       StorageSystem
 * @param mask          ExportMask
 * @param initiatorList List of requested Initiator
 * @param taskCompleter Task completer
 * @throws Exception
 */
private void handleExistingInitiators(StorageSystem storage, ExportMask mask, List<Initiator> initiatorList,
                                      TaskCompleter taskCompleter) throws Exception {
    ...
    // For each matching initiator, we can simply record that it was already found in the mask.
    for (String preExistingInit : preExistingInits) {
        Initiator initiator = idToInit.get(preExistingInit);
        ((ExportMaskInitiatorCompleter) taskCompleter).addInitiator(initiator.getId());
        // Incase of rollback, ensure removeInitiators does not remove this initiator since ViPR did not add it.
        ExportOperationContext.insertContextOperation(taskCompleter,
                VmaxExportOperationContext.OPERATION_ADD_EXISTING_INITIATOR_TO_EXPORT_GROUP, initiator.getId());
    }
}










As the method name suggests, as part of an export operation it is to handle existing Initiators in the given ExportMask.


When an Initiator is found to already exist, a context object is added to the step data stating exactly that.


Now, when the related rollback step executes it can retrieve this context and make the decision to leave the Initiator alone:

CodeComments
// Get the context from the task completer, in case this is a rollback.
ExportOperationContext context = (ExportOperationContext) WorkflowService.getInstance().loadStepData(
        taskCompleter.getOpId());


if (context.getOperations() != null) {
    WBEMClient client = _helper.getConnection(storage).getCimClient();
    ListIterator li = context.getOperations().listIterator(context.getOperations().size());
    while (li.hasPrevious()) {
        ExportOperationContextOperation operation = (ExportOperationContextOperation) li.previous();
        switch (operation.getOperation()) {
            case VmaxExportOperationContext.OPERATION_ADD_EXISTING_INITIATOR_TO_EXPORT_GROUP:
               URI initiator = (URI) operation.getArgs().get(0);
               _log.info("Not removing initiator: {} because it already existed in the masking view",
                  initiator);
               // Ensure the task completer does not remove it from ViPR ExportMask/Group.
               ((ExportMaskRemoveInitiatorCompleter) taskCompleter).removeInitiator(initiator);
               break;
               ...



Load the step data from using the task ID.




Prepare to iterate over a list of contexts.

Cast to appropriate class.


The context references an operation stating that the Initiator was already added and makes the decision to leave it alone.


In this particular example, we're removing the Initiator from the task completer to prevent it being marked as inactive.


Anti-Patterns (Don'ts)

A standard list of bad practices based on our findings.

Testing

How to use the failure injection framework and where failures should be invoked.

  • No labels

2 Comments

  1. This is a start. I've put some specific thoughts about the scenarios you added above. I'm going to think about and poke you with potential new Patterns/Anti-Patterns ... this may be an iterative process as I think about them.

    1. One pattern Bill established which should be in here is concerning creation of sub workflows. A sub-workflow is created from inside a parent workflow step as it executes. Because the workflow completer of the sub-workflow matches (i.e. is exactly equal) the parent workflow's step id, when the sub workflow step completes execution, the parent step is automatically marked as completed. This leads to a few necessary patterns:
      1. A Step creating sub-workflows should only create one sub-workflow. This is because if the sub workflow's task id is the parent's step id, if there are multiple sub workflows, the first one to finish would trigger the parent to continue, even though other sub workflows had not completed.
      2. If vipr crashes, both the parent step, and potentially steps in the workflow will be restarted by the Dispatcher. In order to prevent creation of a duplicate workflow, Bill has introduced a pattern of calling WorkflowService.hasWorkflowBeenCreated to determine this. The code creating the workflow should call WorkflowService.markWorkflowBeenCreated immediately after the return from executing the workflow.
    2. One anti-pattern I would like to see described is persisting something before it is provisioned on the hardware. Here are some things that come to mind:
      1. Volumes are pre-created in apisvc. We need an explanation about the right way to clean these up if volume creation fails.
      2. Export Masks are persisted as part of refresh. Sometimes we see duplicate copies of the same export mask in a database. How do we absolutely prevent this?
      3. Something I thought of over the weekend- the zoningMap historically has been persisted just after it's creation in the port allocator/assigner. However it's persistence doesn't mean that it was actually provisioned on the hardware, for example. I think we need to talk about what the correct pattern is for this.
      4. The VPLEX back end manager will create multiple Export Groups / Export Masks up front for a VPLEX/VMAX combination. We should make a plan for evolving this.
    3. Does it even make sense for deprovisioning workflows to have rollback? Examples would be an order to remove volumes or remove exports. Is rollback necessary / desirable for these, and how should it be constructed?
    4. WorkflowService getNewWorkflow() has a boolean "rollbackContOnError". Should this ever be set to true? Under what circumstances? If it is set that way, failure in the rollback workflow will not cause the workflow to terminate, rather it will go on to the next steps until it's at last tried to rollback all the steps.
    5. It seems to me that deprovisioning operations need to be idempotent. For example, if I'm deleting a vplex volume on top of a vmax volume, and the vmax volume deletion fails, I should be able to retry the order, and even though the vplex volume is no longer present, it should not cause an error, rather just success for that step. Then the vmax volume deletion step would be retried. I'm not sure Bill agrees, as he indicated manual cleanup could be necessary in this scenario. 

  2. TSC recommends scheduling a lunch-n-learn on this topic at a time when most developers can attend live.

    Please discuss with Bill/Tom