#

pyspark

Here are 35 public repositories matching this topic...

awesome-spark / awesome-spark

A curated list of awesome Apache Spark packages and resources.

awesome apache-spark pyspark sparkr

Updated Apr 8, 2024
Shell

mrugankray / Big-Data-Cluster

The goal of this project is to build a docker cluster that gives access to Hadoop, HDFS, Hive, PySpark, Sqoop, Airflow, Kafka, Flume, Postgres, Cassandra, Hue, Zeppelin, Kadmin, Kafka Control Center and pgAdmin. This cluster is solely intended for usage in a development environment. Do not use it to run any production workloads.

airflow kafka spark cassandra hive hadoop schema-registry postgresql python3 pyspark hdfs flume hue zeppelin pgadmin4 kadmin sqoop conda-environment control-center

Updated Feb 27, 2023
Shell

dimajix / docker-jupyter-spark

Docker image for Jupyter notebooks with PySpark

python docker spark hadoop jupyter pyspark

Updated Aug 3, 2018
Shell

radanalyticsio / oshinko-s2i

This is a place to put s2i images and utilities for spark application builders for openshift

java scala spark openshift pyspark s2i-image oshinko-s2i

Updated Apr 28, 2021
Shell

tlepple / iceberg-intro-workshop

Hands-on workshop with Apache Iceberg

linux big-data spark dell pyspark minio spark-streaming object-storage spark-sql apache-iceberg spark-sql-s3 dell-object-storage

Updated Mar 13, 2024
Shell

tlepple / data_origination_workshop

Hands-on workshop with Iceberg, Redpanda, Debezium and Kafka-Connect

python postgresql pyspark minio spark-streaming kafka-connect iceberg debezium redpanda apache-iceberg debeziumkafkaconnector redpanda-console

Updated Mar 15, 2024
Shell

anjijava16 / GCP_Data_Enginner_Utils

GCP_Data_Enginner

python bigquery scala notebook gcp pubsub pyspark dataflow shell-script dataproc-cluster dataproc gcp-storage big-data-processing

Updated Sep 4, 2021
Shell

malihasameen / sales-streaming

End to End Sales Streaming Pipeline (FastAPI, Kafka, Spark, Cassandra, MySQL, Superset)

mysql kafka cassandra docker-compose superset pyspark fastapi

Updated May 26, 2023
Shell

debugger24 / spark-on-k8s-images

Driver/Executor images for spark-operator

kubernetes spark apache-spark executor driver pyspark spark-operator

Updated Nov 29, 2022
Shell

Morphl-AI / MorphL-Orchestrator

Backbone for the MorphL-Community-Edition platform.

kubernetes machine-learning pipeline pyspark hdfs ubuntu1604 keras-tensorflow cassandra-database morphl-platform

Updated Nov 26, 2019
Shell

andgineer / spark-aws-rdkit

Docker image with Apache Spark / Hadoop3 (compatible with AWS services like S3) and with RDKit installed in anaconda environment

aws spark anaconda pyspark rdkit

Updated Mar 19, 2024
Shell

kadnan / vagrant-spark2

Vagrant Box with Python 3.6.1, Apache Spark 2.1.1 with Scala 2.11.8 and PySpark (2.1.1).

vagrant spark vagrant-boxes python3 pyspark

Updated Jun 11, 2017
Shell

mohammadzainabbas / data-warehouse-project

Data Warehouse Project - TPC-DS benchmarking on Spark SQL 👨🏻‍💻

spark python3 pyspark bash-script tpc-ds-benchmark tpc-ds-queries tpc-ds

Updated Dec 15, 2023
Shell

mpolatcan / spark-docker

Scalable Spark Docker image that can works on Docker Compose and Kubernetes

docker kubernetes dockerfile spark scalable docker-compose docker-image pyspark

Updated Nov 16, 2020
Shell

Thelin90 / deiteo

P.O.C Spark On Kubernetes

docker kubernetes spark kubernetes-cluster python3 pyspark minikube minikube-cluster kubectl pipenv spark-structured-streaming

Updated Feb 18, 2021
Shell

seunggihong / hadoop-install-guide

Guide to installing a Hadoop and Spark on an Oracle virtual machine.

spark hadoop virtualbox pyspark hadoop-cluster

Updated Mar 20, 2024
Shell

richardcann / spark-integration-localstack

Local integration test setup for pyspark with AWS through Localstack

spark hadoop aws-s3 pyspark localstack

Updated Jan 5, 2022
Shell

sankamuk / aws-kinesis-redshift-sparkstream

Spark Structured Streaming from AWS Kinesis and Redshift

aws spark terraform kinesis pyspark redshift structured-streaming

Updated Aug 15, 2021
Shell

akhlakm / dcs-cluster

On-premise Distributed Computing and Storage cluster deployment using Ansible and Docker.

docker ansible pyspark moosefs

Updated Sep 25, 2022
Shell

slatawa / Forex-Currency-Processing-Airflow-Hdfs-Hive-Spark

We build a Forex-currency rates pipeline to get currency rates from an external API and load the data into HDFS from where we use pyspark job to massage the data and insert it into a Hive table. The objective of this pipeline is to get the data ready for any downstream machine learning pipeline.

docker airflow hive pyspark hdfs hiveql

Updated Jul 30, 2021
Shell

Improve this page

Add a description, image, and links to the pyspark topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the pyspark topic, visit your repo's landing page and select "manage topics."