looking for some solutions? You are welcome.

SOLVED: Parquet with Athena VS Redshift

Louis Wong:

I hope someone out there can help me with this issue. I am currently working on a data pipeline project, my current dilemma is whether to use parquet with Athena or storing it to Redshift

2 Scenarios: First,

EVENTS --> STORE IT IN S3 AS JSON.GZ --> USE SPARK(EMR) TO CONVERT TO PARQUET --> STORE PARQUET BACK INTO S3 --> ATHENA FOR QUERY --> VIZ

Second,

EVENTS --> STORE IT IN S3 --> USE SPARK(EMR) TO STORE DATA INTO REDSHIFT

Issues with this scenario:

  1. Spark JDBC with Redshift is slow
  2. Spark-Redshift repo by data bricks have a fail build and was updated 2 years ago

I am unable to find useful information on which method is better. Should I even use Redshift or is parquet good enough?

Also it would be great if someone could tell me if there are any other methods for connecting spark with Redshift because there's only 2 solution that I saw online - JDBC and Spark-Reshift(Databricks)

P.S. the pricing model is not a concern to me also I'm dealing with millions of events data.



Posted in S.E.F
via StackOverflow & StackExchange Atomic Web Robots
Share:

No comments:

Recent