Skip to content

Commit

Permalink
Added new init action for Datasketches (#1153)
Browse files Browse the repository at this point in the history
* Added new init action for Datasketches

* Removed my username from the example output

* Applied code review suggestions

* Applied suggested code review changes
  • Loading branch information
kuldeepkk-dev committed Apr 15, 2024
1 parent 2d10d8e commit 7da5214
Show file tree
Hide file tree
Showing 3 changed files with 233 additions and 0 deletions.
109 changes: 109 additions & 0 deletions datasketches/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# Apache Datasketches

**:warning: NOTICE:** This init action is supported only on Dataproc clusters 2.1 and above.

This initialization action installs libraries required to run [Apache Datasketches](https://datasketches.apache.org/) on a
[Google Cloud Dataproc](https://cloud.google.com/dataproc) cluster.

## Using this initialization action

**:warning: NOTICE:** See
[best practices](/README.md#how-initialization-actions-are-used) of using
initialization actions in production.

This initialization action installs dataksketches libraries on Dataproc cluster at `/usr/lib/datasketches` location, below jars will be deployed:

```
datasketches-memory-2.0.0.jar
datasketches-java-3.1.0.jar
datasketches-pig-1.1.0.jar
datasketches-hive-1.2.0.jar
spark-java-thetasketches-1.0-SNAPSHOT.jar [ Only if Spark version < 3.5.0 ]
```

1. Using the `gcloud` command to create a new cluster with this initialization
action. The following command will create a new standard cluster named
`${CLUSTER_NAME}`.

```bash
REGION=<region>
CLUSTER_NAME=<cluster_name>
gcloud dataproc clusters create ${CLUSTER_NAME} \
--region ${REGION} \
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/datasketches/dataksketches.sh
```

## Apache Datasketches Examples:

### Spark:

Note: Starting Apache Spark version 3.5.0, Datasketches libraries are already integrated, follow this [example](https://www.databricks.com/blog/apache-spark-3-apache-datasketches-new-sketch-based-approximate-distinct-counting)

1. For Older 3.X Spark versions, follow [Thetasketches example](https://datasketches.apache.org/docs/Theta/ThetaSparkExample.html) from Datasketches documentation.

Note: `spark-java-thetasketches` example jar will be available under `/usr/lib/datasketches` as a part of this init action, run `spark-submit` with `spark-java-thetasketches-1.0-SNAPSHOT.jar` to try Thetasketches example.

```
spark-submit --jars /usr/lib/datasketches/datasketches-java-3.1.0.jar,/usr/lib/datasketches/datasketches-memory-2.0.0.jar --class Aggregate target/spark-java-thetasketches-1.0-SNAPSHOT.jar
```

If you modify the [java code](https://datasketches.apache.org/docs/Theta/ThetaSparkExample.html), use below instructions to build the jar.

1. Generate artifacts with Maven:

```
mvn archetype:generate -DgroupId=org.apache.datasketches -DartifactId=spark-java-thetasketches -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false
```

1. Replace pom.xml with https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/datasketches/pom.xml


1. Add modified code from https://datasketches.apache.org/docs/Theta/ThetaSparkExample.html under $local_path/src/main/java/org/apache/datasketches directory, remove the sample App.java file

Example:

```
root@cluster-$hostname-m:$local_path/spark-java-thetasketches/src/main/java/org/apache/datasketches# ls -lrt
total 20
-rw-r--r-- 1 root root 1920 Feb 21 17:03 ThetaSketchJavaSerializable.java
-rw-r--r-- 1 root root 2459 Feb 21 17:03 Spark2DatasetMapPartitionsReduceJavaSerialization.java
-rw-r--r-- 1 root root 3654 Feb 21 17:03 MapPartitionsToPairReduceByKey.java
-rw-r--r-- 1 root root 3142 Feb 21 17:03 AggregateByKey2.java
-rw-r--r-- 1 root root 2123 Feb 21 17:03 Aggregate.java
```

1. Compile the code and package a jar:

```
mvn package
```

1. Verify if jar is created under `target/`

```
root@cluster-$hostname-m:$local_path/spark-java-thetasketches# ls -lrt target/
total 48
drwxr-xr-x 3 root root 4096 Feb 29 18:36 maven-status
drwxr-xr-x 3 root root 4096 Feb 29 18:36 generated-sources
drwxr-xr-x 2 root root 4096 Feb 29 18:36 classes
drwxr-xr-x 3 root root 4096 Feb 29 18:36 generated-test-sources
drwxr-xr-x 3 root root 4096 Feb 29 18:36 test-classes
drwxr-xr-x 2 root root 4096 Feb 29 18:36 surefire-reports
drwxr-xr-x 2 root root 4096 Feb 29 18:36 maven-archiver
-rw-r--r-- 1 root root 17542 Feb 29 18:36 spark-java-thetasketches-1.0-SNAPSHOT.jar
```

1. Run `spark-submit` with newly generated jar from above step.

```
root@cluster-$hostname-m:$local_path/spark-java-thetasketches# spark-submit --jars /usr/lib/datasketches/datasketches-java-3.1.0.jar,/usr/lib/datasketches/datasketches-memory-2.0.0.jar --class Aggregate target/spark-java-thetasketches-1.0-SNAPSHOT.jar
```

### Hive:

1. cd to `/usr/lib/datasketches` and follow [Datasketches Hive examples](https://datasketches.apache.org/docs/SystemIntegrations/ApacheHiveIntegration.html)

#### Pig:

1. cd to `/usr/lib/datasketches` and follow [Datasketches Pig examples](https://datasketches.apache.org/docs/SystemIntegrations/ApachePigIntegration.html)

78 changes: 78 additions & 0 deletions datasketches/datasketches.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
#!/bin/bash
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS-IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# This script installs Below Datasketches libraries on Dataproc cluster 2.1 and above
# datasketches-java - https://github.com/apache/datasketches-java
# datasketches-memory - https://github.com/apache/datasketches-memory
# datasketches-hive - https://github.com/apache/datasketches-hive
# datasketches-pig - https://github.com/apache/datasketches-pig
# Official documentation link - https://datasketches.apache.org/
set -euxo pipefail

# Detect dataproc image version
if (! test -v DATAPROC_IMAGE_VERSION) && test -v DATAPROC_VERSION; then
DATAPROC_IMAGE_VERSION="${DATAPROC_VERSION}"
fi

if [[ $(echo "${DATAPROC_IMAGE_VERSION} < 2.1" | bc -l) == 1 ]]; then
echo "Datasketches integration is not supported on Dataproc image versions < 2.1"
exit 0
fi

readonly MAVEN_CENTRAL_URI=https://maven-central.storage-download.googleapis.com/maven2
readonly DS_LIBPATH="/usr/lib/datasketches"
readonly SPARK_VERSION=$(spark-submit --version 2>&1 | sed -n 's/.*version[[:blank:]]\+\([0-9]\+\.[0-9]\).*/\1/p' | head -n1)
readonly SPARK_JAVA_EXAMPLE_JAR="gs://spark-lib/datasketches/spark-java-thetasketches-1.0-SNAPSHOT.jar"

function download_libraries()
{
mkdir -p ${DS_LIBPATH}
declare -A all_components=( [java]="3.1.0" [hive]="1.2.0" [memory]="2.0.0" [pig]="1.1.0" )

for lib in "${!all_components[@]}"
do
local component=${lib}
local version=${all_components[$lib]}
wget -P "${DS_LIBPATH}" "${MAVEN_CENTRAL_URI}"/org/apache/datasketches/datasketches-"${component}"/"${version}"/datasketches-"${component}"-"${version}".jar
if [ $? -eq 0 ]; then
echo "Downloaded datasketches-"${component}"-"${version}".jar successfully"
else
echo "Problem downloading datasketches-"${component}"-"${version}".jar from ${MAVEN_CENTRAL_URI}, exiting!"
exit 1
fi
done
}

function download_example_jar()
{
if [[ "${SPARK_VERSION}" < "3.5" ]]; then
gsutil cp "${SPARK_JAVA_EXAMPLE_JAR}" "${DS_LIBPATH}"
if [ $? -eq 0 ]; then
echo "Downloaded "${SPARK_JAVA_EXAMPLE_JAR}" successfully"
else
echo "Problem downloading "${SPARK_JAVA_EXAMPLE_JAR}" from GCS, exiting!"
fi

else
echo "Datasketches libraries are already included in Spark version 3.5.0 and onwards! Follow README for examples"
fi
}

function main()
{
download_libraries
download_example_jar
}

main
46 changes: 46 additions & 0 deletions datasketches/pom.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.apache.datasketches</groupId>
<artifactId>spark-java-thetasketches</artifactId>
<packaging>jar</packaging>
<version>1.0-SNAPSHOT</version>
<name>spark-java-thetasketches</name>
<url>http://maven.apache.org</url>
<properties>
<maven.compiler.source>11</maven.compiler.source>
<maven.compiler.target>11</maven.compiler.target>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.datasketches</groupId>
<artifactId>datasketches-memory</artifactId>
<version>2.0.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.13</artifactId>
<version>3.3.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.datasketches/datasketches-java -->
<dependency>
<groupId>org.apache.datasketches</groupId>
<artifactId>datasketches-java</artifactId>
<version>3.1.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.13</artifactId>
<version>3.3.0</version>
<scope>provided</scope>
</dependency>
</dependencies>
</project>

0 comments on commit 7da5214

Please sign in to comment.