[SAMZA-2491] AM should log uncaught exceptions and System.exit to ensure that the process dies on errors - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.5
Component/s: None
Labels:
None

Description

From: pmaheshw

Symptom: A job deployment timed out waiting for application attempt to transition from New to Running.

Cause: ClusterBasedJobCoordinator threw an exception during startup due to a misconfiguration, but did not kill the AM process (likely due to non-daemon threads).

Suggested fixes:
1. ClusterBasedJobCoordinator#main doesn't use an uncaught exception handler, and doesn't catch + log any exceptions thrown from ClusterBasedJobCoordinator constructor or from run(). We should fix this. Uncaught exceptions go to stderr instead of logs and do not have a timestamp, which makes debugging difficult. E.g.:

Exception in thread "main" org.apache.samza.SamzaException: Cannot get systemAdmin for system aggregate-tracking
at org.apache.samza.system.SystemAdmins.getSystemAdmin(SystemAdmins.java:63)
at org.apache.samza.system.StreamMetadataCache$$anonfun$3.apply(StreamMetadataCache.scala:66)
at org.apache.samza.system.StreamMetadataCache$$anonfun$3.apply(StreamMetadataCache.scala:64)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.Map$Map2.foreach(Map.scala:137)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
at org.apache.samza.system.StreamMetadataCache.getStreamMetadata(StreamMetadataCache.scala:64)
at org.apache.samza.coordinator.StreamPartitionCountMonitor.getMetadata(StreamPartitionCountMonitor.java:92)
at org.apache.samza.coordinator.StreamPartitionCountMonitor.<init>(StreamPartitionCountMonitor.java:113)
at org.apache.samza.clustermanager.ClusterBasedJobCoordinator.getPartitionCountMonitor(ClusterBasedJobCoordinator.java:343)
at org.apache.samza.clustermanager.ClusterBasedJobCoordinator.<init>(ClusterBasedJobCoordinator.java:207)
at org.apache.samza.clustermanager.ClusterBasedJobCoordinator.main(ClusterBasedJobCoordinator.java:441)

2. JC should call System.exit on returning from main (cleanly or on exception) and from the uncaught exception handler to ensure that the AM process dies on these errors and does not leave the deployment hanging. We've also seen this issue due to client libraries (datavault, brooklin, kafka etc.) creating non-daemon threads and not stopping them cleanly. See LocalContainerRunner for reference, which does kill the process on returning from main thread. E.g., in this case its threads like this:
"AsyncHttpClient-27-1" #134 prio=5 os_prio=0 tid=0x00007faead675000 nid=0x4151 runnable [0x00007fae9c9da000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)

locked <0x00000000fe6a2f40> (a com.linkedin.mario.shaded.io.netty.channel.nio.SelectedSelectionKeySet)
locked <0x00000000fe6fe9c0> (a java.util.Collections$UnmodifiableSet)
locked <0x00000000fe6a3f68> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at com.linkedin.mario.shaded.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
at com.linkedin.mario.shaded.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:824)
at com.linkedin.mario.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:457)
at com.linkedin.mario.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)
at com.linkedin.mario.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at com.linkedin.mario.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)

Attachments

Issue Links

links to

GitHub Pull Request #1325

Activity

People

Assignee:: Hai Lu

Reporter:: Hai Lu

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 23/Mar/20 16:44

Updated:: 27/May/20 03:37

Resolved:: 28/Mar/20 05:36

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

0.5h