How do I detect and diagnose Amazon EMR cluster issues?

3 minute read
0

I want to detect and resolve the most frequent errors or exceptions in my Amazon EMR clusters. I also want to use EMR logs that are on an Amazon Simple Storage Service (Amazon S3) location to further diagnose and troubleshoot my cluster.

Short description

To analyze Amazon EMR logs that are on an Amazon S3 location, use the AWSSupport-DiagnoseEMRLogsWithAthena AWS Systems Manager automation runbook with Amazon Athena. Use the runbook to identify what's causing the issue in your EMR cluster and resolve it.

Resolution

Run the automation runbook

Before you start the AWSSupport-DiagnoseEMRLogsWithAthena runbook, make sure that your AWS Identify and Access Management (IAM) user or role has the required permissions. To view the actions that the AutomationAssumeRole parameter requires to use the runbook, see AWSSupport-DiagnoseEMRLogsWithAthena.

To start the runbook, complete the following steps:

  1. Open AWSSupport-DiagnoseEMRLogsWithAthena in the Systems Manager console.
  2. Choose Execute automation.
  3. Enter the following values:
    AutomationAssumeRole: The Amazon Resource Name (ARN) of the IAM role that allows Systems Manager Automation to perform the actions on your behalf. If no role is specified, then the Systems Manager Automation uses the permissions of the user that starts this runbook.
    ClusterID: Your Amazon EMR cluster ID.
    (Optional) S3LogLocation: The Amazon S3 location of the EMR logs. If the cluster is shut down form for more than 60 days, then provide this parameter.
    S3BucketName: The name of the Amazon S3 bucket where you receive the output of Athena queries. The bucket must have Block Public Access turned on. The bucket must be in the same AWS Region and AWS account as the cluster.
    Approvers: A list of AWS authenticated principals who can approve or reject the action.
    (Optional) FetchNodeLogsOnly: The default value is false. To automate diagnosis of the Amazon EMR application container logs, set the value to true.
    (Optional) FetchContainersLogsOnly: The default value is false. To automate diagnosis of the Amazon EMR container logs, set the value to true.
    (Optional)EndSearchDate: The end date for log searches.
    (Optional) DaysToCheck: If you set the EndSearchDate value, then DaysToCheck is required to determine the number of days to retrospectively search for logs. The maximum value is 30 days. 
    (Optional) SearchKeywords: A list of keywords to search in the logs, separated by commas.
    Note: Don't include single or double quotes in the keywords.
  4. Choose Execute.
  5. Review the detailed results in the Outputs section.

The output provides links to the following Athena Data Manipulation Language (DML) query results:

  • All errors and exceptions in the Amazon EMR cluster logs, along with the corresponding log locations.
  • A summary of unique known exceptions that are matched in the Amazon EMR logs.
  • Where specific errors and exceptions appear in the Amazon S3 log paths.

Troubleshoot automation runbook issues

Take the following actions:

  • If the underlying Athena DML query times out because the cluster log size is larger than the default setting, then the automation might fail. To resolve this issue, increase the DML query timeout on the Athena Service Quotas console. Then, run the automation again.
  • If you terminated the cluster more than 60 days ago, then the runbook doesn't describe the cluster or fetch the Amazon S3 log location. To resolve this issue, confirm that you entered the Cluster-Id and S3LogLocation parameters for the cluster.

Related information

AWS Support Automation Workflows (SAW)

Run an automation

Setting up Automation

AWS OFFICIAL
AWS OFFICIALUpdated 23 days ago