Spark-edshiftSpark 和 Redshift 整合
Spark-edshift 是可以从 Amazon Redshift 加载数据到 Spark SQL DataFrames 中的库,并且还可以写回到 Redshift 列表中。Amazon S3 可以让数据高效地转入或转出 Redshift,并且可以自动触发 Redshift 相应的 COPY 和 UNLOAD 指令。
示例代码:
import org.apache.spark.sql._val sc = // existing SparkContextval sqlContext = new SQLContext(sc)// Get some data from a Redshift tableval df: DataFrame = sqlContext.read
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass")
.option("dbtable", "my_table")
.option("tempdir", "s3n://path/for/temp/data")
.load()// Can also load data from a Redshift queryval df: DataFrame = sqlContext.read
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass")
.option("query", "select x, count(*) my_table group by x")
.option("tempdir", "s3n://path/for/temp/data")
.load()// Apply some transformations to the data as per normal, then you can use the// Data Source API to write the data back to another tabledf.write
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass")
.option("dbtable", "my_table_copy")
.option("tempdir", "s3n://path/for/temp/data")
.mode("error")
.save()
评论
