Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
From: pmaheshw
Symptom: A job deployment timed out waiting for application attempt to transition from New to Running.
Cause: ClusterBasedJobCoordinator threw an exception during startup due to a misconfiguration, but did not kill the AM process (likely due to non-daemon threads).
Suggested fixes:
1. ClusterBasedJobCoordinator#main doesn't use an uncaught exception handler, and doesn't catch + log any exceptions thrown from ClusterBasedJobCoordinator constructor or from run(). We should fix this. Uncaught exceptions go to stderr instead of logs and do not have a timestamp, which makes debugging difficult. E.g.:
Exception in thread "main" org.apache.samza.SamzaException: Cannot get systemAdmin for system aggregate-tracking
at org.apache.samza.system.SystemAdmins.getSystemAdmin(SystemAdmins.java:63)
at org.apache.samza.system.StreamMetadataCache$$anonfun$3.apply(StreamMetadataCache.scala:66)
at org.apache.samza.system.StreamMetadataCache$$anonfun$3.apply(StreamMetadataCache.scala:64)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.Map$Map2.foreach(Map.scala:137)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
at org.apache.samza.system.StreamMetadataCache.getStreamMetadata(StreamMetadataCache.scala:64)
at org.apache.samza.coordinator.StreamPartitionCountMonitor.getMetadata(StreamPartitionCountMonitor.java:92)
at org.apache.samza.coordinator.StreamPartitionCountMonitor.<init>(StreamPartitionCountMonitor.java:113)
at org.apache.samza.clustermanager.ClusterBasedJobCoordinator.getPartitionCountMonitor(ClusterBasedJobCoordinator.java:343)
at org.apache.samza.clustermanager.ClusterBasedJobCoordinator.<init>(ClusterBasedJobCoordinator.java:207)
at org.apache.samza.clustermanager.ClusterBasedJobCoordinator.main(ClusterBasedJobCoordinator.java:441)
2. JC should call System.exit on returning from main (cleanly or on exception) and from the uncaught exception handler to ensure that the AM process dies on these errors and does not leave the deployment hanging. We've also seen this issue due to client libraries (datavault, brooklin, kafka etc.) creating non-daemon threads and not stopping them cleanly. See LocalContainerRunner for reference, which does kill the process on returning from main thread. E.g., in this case its threads like this:
"AsyncHttpClient-27-1" #134 prio=5 os_prio=0 tid=0x00007faead675000 nid=0x4151 runnable [0x00007fae9c9da000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0x00000000fe6a2f40> (a com.linkedin.mario.shaded.io.netty.channel.nio.SelectedSelectionKeySet)
- locked <0x00000000fe6fe9c0> (a java.util.Collections$UnmodifiableSet)
- locked <0x00000000fe6a3f68> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at com.linkedin.mario.shaded.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
at com.linkedin.mario.shaded.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:824)
at com.linkedin.mario.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:457)
at com.linkedin.mario.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)
at com.linkedin.mario.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at com.linkedin.mario.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
Attachments
Issue Links
- links to