Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[@parallel][jobsets] bug fixes and refactor #1900

Closed
wants to merge 2 commits into from

Conversation

valayDave
Copy link
Collaborator

  • Abstraction over Jobset Spec
  • Create general abstraction and copy style of jobspec for native k8s jobs for jobsets
  • simplified the implementation.
  • similar pattern/trend to follow for argo
  • removed older implementation
  • Superseds Bug fixes for @parallel --with kubernetes #1854

Review commits individually to see core changes.

  • One commit has the entire refactor into a more decoupled implementation.
  • One commit has fixes to the core such as ID minting for control jobs.
@valayDave valayDave force-pushed the valay/jobset-refactor-oss branch 3 times, most recently from 626b61e to a406b64 Compare July 1, 2024 20:42
metaflow/runtime.py Outdated Show resolved Hide resolved
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: make separate PRs for anything in the core for examples files in parallel_decorator and runtime.py

Copy link
Collaborator Author

@valayDave valayDave Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixme: ensure that @Secrets can work with control tasks too ; Since @parallel has a slightly different control flow than normal tasks, we need to ensure that implementation has parity across local/remote executions. [Can be done at a later date. Add a comment in the code for posterity ]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: add a property to current for signifying if it's inside a @parallel step.

Copy link
Collaborator Author

@valayDave valayDave Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixme: remove the current.parallel property from the metaflow_current.py file and instead dynamically inject the property inside the @parallel decorator. [Can be done at a later date. At least add a comment in the code. ] [DONE}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: validate this with @Batch.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: make a PR that has all k8s changes. and another PR which has all the core changes.

run_id=None,
step_name=None,
task_id=None,
flow_name,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: at some point re-unify the code bases for Jobset and native jobs.

Comment on lines +398 to +401
step_cli.replace(
METAFLOW_PARALLEL_STEP_CLI_OPTIONS_TEMPLATE,
"--ubf-context $UBF_CONTEXT --split-index %s --task-id %s"
% (index, _tskid),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: try of .format instead of .replace

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@savingoyal :

.format doesnt work as stated earlier for the following reason:

The point in time we are running the .format, we already have constructed the step_cli with more format strings; For example at this point the step_cli looks like :

python -u fail_flow.py --quiet --metadata service --environment local --datastore s3 --event-logger nullSidecarLogger --monitor nullSidecarMonitor --datastore-root s3://mybcket/metaflow/ --package-suffixes .py,.R,.RDS --pylint step train --run-id 5075 --input-paths ${METAFLOW_INPUT_PATHS_0} --retry-count 0 --max-user-code-retries 2 --namespace valay {METAFLOW_PARALLEL_STEP_CLI_OPTIONS_TEMPLATE}

At this point the step_cli already contains the ${METAFLOW_INPUT_PATHS_0} which is a shell variable that will be replaced at runtime. At this point if we use .format then python will crash complaining that we haven't given any value for format string METAFLOW_INPUT_PATHS_0. So we use replace instead of format.

@valayDave valayDave force-pushed the valay/jobset-refactor-oss branch 3 times, most recently from c57d925 to 0ff7f9a Compare July 4, 2024 02:07
- fix tests for interplay of @Secrets and @parallel
- Local runtime will allow secrets only on worker tasks
- Secrets will be set on all kinds of tasks when run remotely (control/worker)
- fix tests for ubf based on new changes to core.
- fix tests for tag-catch for ubf based on new changes to core.
- internal ubf decorator has a internal_task_type set to it.
- [feedback] register metadata in parallel decorator
- [feedback]@parallel inject in current:
    - move `current.parallel` from `metaflow_current` to `parallel_decorator`
- [feedback] appropariately setting task-metadata for parallel stuff
    - The 'world size' metadata will be set in the @parallel decorator.
    - The 'node-index' metadata, however, varies depending on the type of computing environment executing the task so it will be set in the appropriate compute decorators.
    - One reason to specify 'node-index' within compute decorators is that the parallel implementation in the compute decorator might not directly set the required environment variables (`MF_PARALLEL_*`). Instead, these values may be established during the `task_pre_step` phase of the compute decorator using other environment variables set during the implementation.
    - adding some aws batch changes
- [feedback] safety check for _parallel_buf_iter in task_pre_step for @parallel
- [feedback] set `is_parallel` in current to denote a step is running under an `@parallel` decorator.
- Some changes stemming from Netflix#1854
- reorder imports
- changes to env var prefixes + id generation
- Abstraction over Jobset Spec
- Create general abstraction and copy style of how we create jobspec for native k8s jobs.
- simplified the implementation.
- similar pattern/trend to follow for argo
- remove older implementation
- @parallel metadata from k8s decorator
    - added attempt to MetaDatum tags
    - We didn't have this earlier
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
1 participant