Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] [Master] Dependent nodes failed when the upstream rerun without the task relied on #16285

Open
3 tasks done
starrysxy opened this issue Jul 7, 2024 · 11 comments
Open
3 tasks done
Assignees
Labels
bug Something isn't working discussion discussion priority:middle

Comments

@starrysxy
Copy link
Contributor

starrysxy commented Jul 7, 2024

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

First of all, I'm not sure if this is a code bug. I checked the code and the phenomenon is consistent with the current code logic, but for the business logic, I don't think it is very reasonable. If necessary, please help to modify the tag.

Suppose there are two workflows A and B. Workflow A has tasks A-1, A-2, and A-3. Workflow B depends on task A-3 in workflow A. When the A workflow is finished, if the A-1 task in the A workflow is re-executed separately and the B workflow instance in the same cycle has not been executed, the dependent node in the B workflow will fail, which will cause the B workflow instance to fail.

The logic in the code is to find the workflow instance with the latest endTime in each cycle, so the workflow instance where the A-1 task is executed alone will be found, but this workflow instance does not have the A-3 task that the downstream B workflow depends on. Therefore, after getting the A workflow instance, the A-3 task instance cannot be found when traversing the task instances. At the same time, the A workflow instance is in the completed state, so the dependent node is marked as failed, and then the B workflow is marked as failed.

In my opinion, the logic of this part of the code is to facilitate the selection of 'ALL' for dependent tasks, without having to check the status of each task in the upstream workflow, but directly check the status of the entire upstream workflow. However, I think that in actual work, it is inevitable to modify the workflow, and it is also inevitable to re-execute the task after modifying the workflow. At the same time, re-executing the entire workflow may lead to problems such as late result output and waste of machine resources. Therefore, the logic here may be optimized.

What you expected to happen

If a dependent task in an upstream workflow has ever succeeded, the node status of the dependent task in the downstream workflow of the same cycle should be marked as successful.

How to reproduce

Suppose there are two workflows A and B. Workflow A has tasks A-1, A-2, and A-3. Workflow B depends on task A-3 in workflow A. When the A workflow is finished, if the A-1 task in the A workflow is re-executed separately and the B workflow instance in the same cycle has not been executed, the dependent node in the B workflow will fail, which will cause the B workflow instance to fail.

Anything else

No response

Version

3.1.x

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@starrysxy starrysxy added bug Something isn't working Waiting for reply Waiting for reply labels Jul 7, 2024
@starrysxy
Copy link
Contributor Author

I saw you commented in another question(#15807 (comment)), please take some time to look at this one, thanks @SbloodyS

@SbloodyS SbloodyS added question Further information is requested and removed bug Something isn't working Waiting for reply Waiting for reply labels Jul 8, 2024
@SbloodyS SbloodyS changed the title [Bug] [Master] Dependent nodes failed when the upstream rerun without the task relied on Jul 8, 2024
@SbloodyS
Copy link
Member

SbloodyS commented Jul 8, 2024

The ALL in a dependency means that all tasks in the workflow need to be executed successfully. If you want to re-execute a node in the workflow without affecting the results of the dependent nodes, you should choose tasks in the dependency workflow or reorganize your workflow. @starrysxy

@starrysxy
Copy link
Contributor Author

starrysxy commented Jul 8, 2024

The ALL in a dependency means that all tasks in the workflow need to be executed successfully. If you want to re-execute a node in the workflow without affecting the results of the dependent nodes, you should choose tasks in the dependency workflow or reorganize your workflow. @starrysxy

Thanks for your comment, I know the ALL in a dependency means that all tasks in the workflow need to be executed successfully. But my downstream only depends on 1 task in the upstream, not all tasks.

When I re-execute a node in the upstream, the dependent node in the downstream will fail if the downstream is executed after the upstream is successfully re-executed.

e.g: Upstream dispatch at 1 a.m. Meanwhile, downstream dispatch at 10 a.m. If I re-execute one task in the upstream, that is not dependent on the dependent node in the downstream, in the upstream at 7 a.m. And it will be successful in 1 hour. The downstream will fail in few seconds when it's dispatched at 10 a.m.

I hope I made myself clear. This may not be a code bug, but it is a very common requirement. Is there any good way to avoid this situation?

@SbloodyS
Copy link
Member

SbloodyS commented Jul 9, 2024

The ALL in a dependency means that all tasks in the workflow need to be executed successfully. If you want to re-execute a node in the workflow without affecting the results of the dependent nodes, you should choose tasks in the dependency workflow or reorganize your workflow. @starrysxy

Like I said. The ALL in a dependency means that all tasks in the workflow need to be executed successfully. In particular, it prevents you from manually re-executing a node like this. If the downstream execution continues only after all upstream nodes are executed, manually executing a single node is an illegal operation, which may cause downstream data exceptions.

@starrysxy
Copy link
Contributor Author

starrysxy commented Jul 9, 2024

The ALL in a dependency means that all tasks in the workflow need to be executed successfully. If you want to re-execute a node in the workflow without affecting the results of the dependent nodes, you should choose tasks in the dependency workflow or reorganize your workflow. @starrysxy

Like I said. The ALL in a dependency means that all tasks in the workflow need to be executed successfully. In particular, it prevents you from manually re-executing a node like this. If the downstream execution continues only after all upstream nodes are executed, manually executing a single node is an illegal operation, which may cause downstream data exceptions.

I'm sure I get you. But my downstream don't depend on all tasks in the upstream, but only one task. I'll show you some screenshots.

e.g: Upstream dispatch at 1 a.m. Meanwhile, downstream dispatch at 10 a.m. If I re-execute one task in the upstream, that is not dependent on the dependent node in the downstream, in the upstream at 7 a.m. And it will be successful in 1 hour. The downstream will fail in few seconds when it's dispatched at 10 a.m.

This is my upstream:
daa236d776e100056fe3b606d919e9e

This is my downstream:
3e0fce1054ad27666a66260dc5845e3

This is the dependent node:
0aafe2fd40c157bd64c53963b678852

This is how I re-excute only one task in the upstream:
(Right-click on the task node and type the 'excute' button)
e6cbc3bd6740c99b33bd99e2544aedb

@SbloodyS
Copy link
Member

I think I know what your problem is. The current dependency node check will get the latest workflow instance to verify that the task instance in it meets the configuration conditions. However, this judgment does not take into account the case of manually starting individual tasks, which can only execute some of the nodes. This scenario calls for a solution discussion.

cc @caishunfeng @ruanwenjun @EricGao888 @rickchengx

@SbloodyS SbloodyS added bug Something isn't working discussion discussion priority:middle and removed question Further information is requested labels Jul 10, 2024
@SbloodyS SbloodyS changed the title [Question] [Master] Dependent nodes failed when the upstream rerun without the task relied on Jul 10, 2024
@starrysxy
Copy link
Contributor Author

I think I know what your problem is. The current dependency node check will get the latest workflow instance to verify that the task instance in it meets the configuration conditions. However, this judgment does not take into account the case of manually starting individual tasks, which can only execute some of the nodes. This scenario calls for a solution discussion.

cc @caishunfeng @ruanwenjun @EricGao888 @rickchengx

Yes, that's what I mean. Btw, my version is 3.1.8.(The logic of this part is still the same in later versions). I have checked the code, and I think this scenario is common, especially if you've just switched to DS from another dispatch system. Because tasks will not be migrated all at once, but will be migrated one by one, which means the process definition will be modified frequently.

Maybe we can start with these two places in the code:

  1. org.apache.dolphinscheduler.server.master.utils.DependentExecute#calculateResultForTasks
    image

  2. org.apache.dolphinscheduler.server.master.utils.DependentExecute#getDependTaskResult
    image

@ruanwenjun
Copy link
Member

I think I know what your problem is. The current dependency node check will get the latest workflow instance to verify that the task instance in it meets the configuration conditions. However, this judgment does not take into account the case of manually starting individual tasks, which can only execute some of the nodes. This scenario calls for a solution discussion.

cc @caishunfeng @ruanwenjun @EricGao888 @rickchengx

In my opinion, this is a bug, the task instance status should not bind with workflow instance status.

@starrysxy
Copy link
Contributor Author

I think I know what your problem is. The current dependency node check will get the latest workflow instance to verify that the task instance in it meets the configuration conditions. However, this judgment does not take into account the case of manually starting individual tasks, which can only execute some of the nodes. This scenario calls for a solution discussion.

cc @caishunfeng @ruanwenjun @EricGao888 @rickchengx

Looking forward to others' opinions.

@SbloodyS
Copy link
Member

In my opinion, this is a bug, the task instance status should not bind with workflow instance status.

+1

@starrysxy
Copy link
Contributor Author

@SbloodyS When I try to fix this bug, using the dev branch, I found it has been fixed in #15795. But the bug is still there in version 3.1.x. I'm not sure if I should commit to fix this bug in version 3.1.x. If yes, which brach should I use to commit a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working discussion discussion priority:middle
3 participants