Uploaded image for project: 'Samza'
  1. Samza
  2. SAMZA-2491

AM should log uncaught exceptions and System.exit to ensure that the process dies on errors

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.5
    • None
    • None

    Description

      From: pmaheshw

      Symptom: A job deployment timed out waiting for application attempt to transition from New to Running.

      Cause: ClusterBasedJobCoordinator threw an exception during startup due to a misconfiguration, but did not kill the AM process (likely due to non-daemon threads).

      Suggested fixes:
      1. ClusterBasedJobCoordinator#main doesn't use an uncaught exception handler, and doesn't catch + log any exceptions thrown from ClusterBasedJobCoordinator constructor or from run(). We should fix this. Uncaught exceptions go to stderr instead of logs and do not have a timestamp, which makes debugging difficult. E.g.:

      Exception in thread "main" org.apache.samza.SamzaException: Cannot get systemAdmin for system aggregate-tracking
      at org.apache.samza.system.SystemAdmins.getSystemAdmin(SystemAdmins.java:63)
      at org.apache.samza.system.StreamMetadataCache$$anonfun$3.apply(StreamMetadataCache.scala:66)
      at org.apache.samza.system.StreamMetadataCache$$anonfun$3.apply(StreamMetadataCache.scala:64)
      at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
      at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
      at scala.collection.immutable.Map$Map2.foreach(Map.scala:137)
      at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
      at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
      at org.apache.samza.system.StreamMetadataCache.getStreamMetadata(StreamMetadataCache.scala:64)
      at org.apache.samza.coordinator.StreamPartitionCountMonitor.getMetadata(StreamPartitionCountMonitor.java:92)
      at org.apache.samza.coordinator.StreamPartitionCountMonitor.<init>(StreamPartitionCountMonitor.java:113)
      at org.apache.samza.clustermanager.ClusterBasedJobCoordinator.getPartitionCountMonitor(ClusterBasedJobCoordinator.java:343)
      at org.apache.samza.clustermanager.ClusterBasedJobCoordinator.<init>(ClusterBasedJobCoordinator.java:207)
      at org.apache.samza.clustermanager.ClusterBasedJobCoordinator.main(ClusterBasedJobCoordinator.java:441)

      2. JC should call System.exit on returning from main (cleanly or on exception) and from the uncaught exception handler to ensure that the AM process dies on these errors and does not leave the deployment hanging. We've also seen this issue due to client libraries (datavault, brooklin, kafka etc.) creating non-daemon threads and not stopping them cleanly. See LocalContainerRunner for reference, which does kill the process on returning from main thread. E.g., in this case its threads like this:
      "AsyncHttpClient-27-1" #134 prio=5 os_prio=0 tid=0x00007faead675000 nid=0x4151 runnable [0x00007fae9c9da000]
      java.lang.Thread.State: RUNNABLE
      at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
      at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
      at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
      at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)

      • locked <0x00000000fe6a2f40> (a com.linkedin.mario.shaded.io.netty.channel.nio.SelectedSelectionKeySet)
      • locked <0x00000000fe6fe9c0> (a java.util.Collections$UnmodifiableSet)
      • locked <0x00000000fe6a3f68> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at com.linkedin.mario.shaded.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
        at com.linkedin.mario.shaded.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:824)
        at com.linkedin.mario.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:457)
        at com.linkedin.mario.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)
        at com.linkedin.mario.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at com.linkedin.mario.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:748)

      Attachments

        Issue Links

          Activity

            People

              lhaiesp Hai Lu
              lhaiesp Hai Lu
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 0.5h
                  0.5h