[Bug] [Master] Dependent nodes failed when the upstream rerun without the task relied on #16285

starrysxy · 2024-07-07T03:25:26Z

Search before asking

I had searched in the issues and found no similar issues.

What happened

First of all, I'm not sure if this is a code bug. I checked the code and the phenomenon is consistent with the current code logic, but for the business logic, I don't think it is very reasonable. If necessary, please help to modify the tag.

Suppose there are two workflows A and B. Workflow A has tasks A-1, A-2, and A-3. Workflow B depends on task A-3 in workflow A. When the A workflow is finished, if the A-1 task in the A workflow is re-executed separately and the B workflow instance in the same cycle has not been executed, the dependent node in the B workflow will fail, which will cause the B workflow instance to fail.

The logic in the code is to find the workflow instance with the latest endTime in each cycle, so the workflow instance where the A-1 task is executed alone will be found, but this workflow instance does not have the A-3 task that the downstream B workflow depends on. Therefore, after getting the A workflow instance, the A-3 task instance cannot be found when traversing the task instances. At the same time, the A workflow instance is in the completed state, so the dependent node is marked as failed, and then the B workflow is marked as failed.

In my opinion, the logic of this part of the code is to facilitate the selection of 'ALL' for dependent tasks, without having to check the status of each task in the upstream workflow, but directly check the status of the entire upstream workflow. However, I think that in actual work, it is inevitable to modify the workflow, and it is also inevitable to re-execute the task after modifying the workflow. At the same time, re-executing the entire workflow may lead to problems such as late result output and waste of machine resources. Therefore, the logic here may be optimized.

What you expected to happen

If a dependent task in an upstream workflow has ever succeeded, the node status of the dependent task in the downstream workflow of the same cycle should be marked as successful.

How to reproduce

Suppose there are two workflows A and B. Workflow A has tasks A-1, A-2, and A-3. Workflow B depends on task A-3 in workflow A. When the A workflow is finished, if the A-1 task in the A workflow is re-executed separately and the B workflow instance in the same cycle has not been executed, the dependent node in the B workflow will fail, which will cause the B workflow instance to fail.

Anything else

No response

Version

3.1.x

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

starrysxy · 2024-07-07T03:30:41Z

I saw you commented in another question(#15807 (comment)), please take some time to look at this one, thanks @SbloodyS

SbloodyS · 2024-07-08T13:05:20Z

The ALL in a dependency means that all tasks in the workflow need to be executed successfully. If you want to re-execute a node in the workflow without affecting the results of the dependent nodes, you should choose tasks in the dependency workflow or reorganize your workflow. @starrysxy

starrysxy · 2024-07-08T14:57:04Z

The ALL in a dependency means that all tasks in the workflow need to be executed successfully. If you want to re-execute a node in the workflow without affecting the results of the dependent nodes, you should choose tasks in the dependency workflow or reorganize your workflow. @starrysxy

Thanks for your comment, I know the ALL in a dependency means that all tasks in the workflow need to be executed successfully. But my downstream only depends on 1 task in the upstream, not all tasks.

When I re-execute a node in the upstream, the dependent node in the downstream will fail if the downstream is executed after the upstream is successfully re-executed.

e.g: Upstream dispatch at 1 a.m. Meanwhile, downstream dispatch at 10 a.m. If I re-execute one task in the upstream, that is not dependent on the dependent node in the downstream, in the upstream at 7 a.m. And it will be successful in 1 hour. The downstream will fail in few seconds when it's dispatched at 10 a.m.

I hope I made myself clear. This may not be a code bug, but it is a very common requirement. Is there any good way to avoid this situation?

SbloodyS · 2024-07-09T01:11:18Z

The ALL in a dependency means that all tasks in the workflow need to be executed successfully. If you want to re-execute a node in the workflow without affecting the results of the dependent nodes, you should choose tasks in the dependency workflow or reorganize your workflow. @starrysxy

Like I said. The ALL in a dependency means that all tasks in the workflow need to be executed successfully. In particular, it prevents you from manually re-executing a node like this. If the downstream execution continues only after all upstream nodes are executed, manually executing a single node is an illegal operation, which may cause downstream data exceptions.

starrysxy · 2024-07-09T01:53:21Z

The ALL in a dependency means that all tasks in the workflow need to be executed successfully. If you want to re-execute a node in the workflow without affecting the results of the dependent nodes, you should choose tasks in the dependency workflow or reorganize your workflow. @starrysxy

Like I said. The ALL in a dependency means that all tasks in the workflow need to be executed successfully. In particular, it prevents you from manually re-executing a node like this. If the downstream execution continues only after all upstream nodes are executed, manually executing a single node is an illegal operation, which may cause downstream data exceptions.

I'm sure I get you. But my downstream don't depend on all tasks in the upstream, but only one task. I'll show you some screenshots.

e.g: Upstream dispatch at 1 a.m. Meanwhile, downstream dispatch at 10 a.m. If I re-execute one task in the upstream, that is not dependent on the dependent node in the downstream, in the upstream at 7 a.m. And it will be successful in 1 hour. The downstream will fail in few seconds when it's dispatched at 10 a.m.

This is my upstream:

This is my downstream:

This is the dependent node:

This is how I re-excute only one task in the upstream:
(Right-click on the task node and type the 'excute' button)

SbloodyS · 2024-07-10T06:48:58Z

I think I know what your problem is. The current dependency node check will get the latest workflow instance to verify that the task instance in it meets the configuration conditions. However, this judgment does not take into account the case of manually starting individual tasks, which can only execute some of the nodes. This scenario calls for a solution discussion.

cc @caishunfeng @ruanwenjun @EricGao888 @rickchengx

starrysxy · 2024-07-10T07:42:08Z

I think I know what your problem is. The current dependency node check will get the latest workflow instance to verify that the task instance in it meets the configuration conditions. However, this judgment does not take into account the case of manually starting individual tasks, which can only execute some of the nodes. This scenario calls for a solution discussion.

cc @caishunfeng @ruanwenjun @EricGao888 @rickchengx

Yes, that's what I mean. Btw, my version is 3.1.8.(The logic of this part is still the same in later versions). I have checked the code, and I think this scenario is common, especially if you've just switched to DS from another dispatch system. Because tasks will not be migrated all at once, but will be migrated one by one, which means the process definition will be modified frequently.

Maybe we can start with these two places in the code:

org.apache.dolphinscheduler.server.master.utils.DependentExecute#calculateResultForTasks
org.apache.dolphinscheduler.server.master.utils.DependentExecute#getDependTaskResult

ruanwenjun · 2024-07-10T17:33:55Z

I think I know what your problem is. The current dependency node check will get the latest workflow instance to verify that the task instance in it meets the configuration conditions. However, this judgment does not take into account the case of manually starting individual tasks, which can only execute some of the nodes. This scenario calls for a solution discussion.

cc @caishunfeng @ruanwenjun @EricGao888 @rickchengx

In my opinion, this is a bug, the task instance status should not bind with workflow instance status.

starrysxy · 2024-07-15T15:35:58Z

I think I know what your problem is. The current dependency node check will get the latest workflow instance to verify that the task instance in it meets the configuration conditions. However, this judgment does not take into account the case of manually starting individual tasks, which can only execute some of the nodes. This scenario calls for a solution discussion.

cc @caishunfeng @ruanwenjun @EricGao888 @rickchengx

Looking forward to others' opinions.

SbloodyS · 2024-07-16T06:14:20Z

In my opinion, this is a bug, the task instance status should not bind with workflow instance status.

+1

starrysxy · 2024-07-25T10:10:15Z

@SbloodyS When I try to fix this bug, using the dev branch, I found it has been fixed in #15795. But the bug is still there in version 3.1.x. I'm not sure if I should commit to fix this bug in version 3.1.x. If yes, which brach should I use to commit a PR.

starrysxy added bug Something isn't working Waiting for reply Waiting for reply labels Jul 7, 2024

SbloodyS added question Further information is requested and removed bug Something isn't working Waiting for reply Waiting for reply labels Jul 8, 2024

SbloodyS changed the title ~~[Bug] [Master] Dependent nodes failed when the upstream rerun without the task relied on~~ Jul 8, 2024

SbloodyS added bug Something isn't working discussion discussion priority:middle and removed question Further information is requested labels Jul 10, 2024

SbloodyS changed the title ~~[Question] [Master] Dependent nodes failed when the upstream rerun without the task relied on~~ Jul 10, 2024

SbloodyS assigned starrysxy Jul 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] [Master] Dependent nodes failed when the upstream rerun without the task relied on #16285

[Bug] [Master] Dependent nodes failed when the upstream rerun without the task relied on #16285

starrysxy commented Jul 7, 2024 •

edited

Loading

starrysxy commented Jul 7, 2024

SbloodyS commented Jul 8, 2024

starrysxy commented Jul 8, 2024 •

edited

Loading

SbloodyS commented Jul 9, 2024

starrysxy commented Jul 9, 2024 •

edited

Loading

SbloodyS commented Jul 10, 2024

starrysxy commented Jul 10, 2024

ruanwenjun commented Jul 10, 2024

starrysxy commented Jul 15, 2024

SbloodyS commented Jul 16, 2024

starrysxy commented Jul 25, 2024

[Bug] [Master] Dependent nodes failed when the upstream rerun without the task relied on #16285

[Bug] [Master] Dependent nodes failed when the upstream rerun without the task relied on #16285

Comments

starrysxy commented Jul 7, 2024 • edited Loading

Search before asking

What happened

What you expected to happen

How to reproduce

Anything else

Version

Are you willing to submit PR?

Code of Conduct

starrysxy commented Jul 7, 2024

SbloodyS commented Jul 8, 2024

starrysxy commented Jul 8, 2024 • edited Loading

SbloodyS commented Jul 9, 2024

starrysxy commented Jul 9, 2024 • edited Loading

SbloodyS commented Jul 10, 2024

starrysxy commented Jul 10, 2024

ruanwenjun commented Jul 10, 2024

starrysxy commented Jul 15, 2024

SbloodyS commented Jul 16, 2024

starrysxy commented Jul 25, 2024

starrysxy commented Jul 7, 2024 •

edited

Loading

starrysxy commented Jul 8, 2024 •

edited

Loading

starrysxy commented Jul 9, 2024 •

edited

Loading