Spark Kafka 基于Direct自己管理offset

程序源代码

共 19011字,需浏览 39分钟

 ·

2020-08-05 19:55

点击上方蓝色字体,选择“设为星标

回复”资源“获取更多资源

大数据技术与架构
点击右侧关注,大数据开发领域最强公众号!

暴走大数据
点击右侧关注,暴走大数据!

1、SparkStreaming中使用Kafka的createDirectStream自己管理offset

在Spark Streaming中,目前官方推荐的方式是createDirectStream方式,但是这种方式就需要我们自己去管理offset。目前的资料大部分是通过scala来实现的,并且实现套路都是一样的,我自己根据scala的实现改成了Java的方式,后面又相应的实现。
Direct Approach 更符合Spark的思维。我们知道,RDD的概念是一个不变的,分区的数据集合。我们将kafka数据源包裹成了一个KafkaRDD,RDD里的partition 对应的数据源为kafka的partition。唯一的区别是数据在Kafka里而不是事先被放到Spark内存里。其实包括FileInputStream里也是把每个文件映射成一个RDD。

2、DirectKafkaInputDStream

Spark Streaming通过Direct Approach接收数据的入口自然是KafkaUtils.createDirectStream 了。在调用该方法时,会先创建
val kc = new KafkaCluster(kafkaParams)
KafkaCluster 这个类是真实负责和Kafka 交互的类,该类会获取Kafka的partition信息,接着会创建 DirectKafkaInputDStream,每个DirectKafkaInputDStream对应一个Topic。此时会获取每个Topic的每个Partition的offset。如果配置成smallest 则拿到最早的offset,否则拿最近的offset。
每个DirectKafkaInputDStream 也会持有一个KafkaCluster实例。
到了计算周期后,对应的DirectKafkaInputDStream .compute方法会被调用,此时做下面几个操作:
  1. 获取对应Kafka Partition的untilOffset。这样就确定过了需要获取数据的区间,同时也就知道了需要计算多少数据了

  2. 构建一个KafkaRDD实例。这里我们可以看到,每个计算周期里,DirectKafkaInputDStream 和 KafkaRDD 是一一对应的

  3. 将相关的offset信息报给InputInfoTracker

  4. 返回该RDD

3、KafkaRDD 的组成结构

KafkaRDD 包含 N(N=Kafka的partition数目)个 KafkaRDDPartition,每个KafkaRDDPartition 其实只是包含一些信息,譬如topic,offset等,真正如果想要拉数据, 是透过KafkaRDDIterator 来完成,一个KafkaRDDIterator对应一个 KafkaRDDPartition。
整个过程都是延时过程,也就是数据其实都在Kafka存着呢,直到有实际的Action被触发,才会有去kafka主动拉数据。

4、使用Java来管理offset

// 注意:一定要存在这个包下面package org.apache.spark.streaming.kafka;
import kafka.common.TopicAndPartition;import kafka.message.MessageAndMetadata;import kafka.serializer.StringDecoder;import org.apache.spark.SparkException;import org.apache.spark.api.java.JavaRDD;import org.apache.spark.api.java.function.Function;import org.apache.spark.streaming.api.java.JavaInputDStream;import org.apache.spark.streaming.api.java.JavaStreamingContext;import scala.Tuple2;import scala.collection.JavaConversions;import scala.collection.mutable.ArrayBuffer;import scala.util.Either;
import java.io.Serializable;import java.util.HashMap;import java.util.HashSet;import java.util.Map;import java.util.Set;

public class JavaKafkaManager implements Serializable{
private scala.collection.immutable.Map<String, String> kafkaParams; private KafkaCluster kafkaCluster;
public JavaKafkaManager(Map<String, String> kafkaParams) { //TODO this.kafkaParams = toScalaImmutableMap(kafkaParams); kafkaCluster = new KafkaCluster(this.kafkaParams); }
public JavaInputDStream<String> createDirectStream( JavaStreamingContext jssc, Map<String, String> kafkaParams, Set<String> topics) throws SparkException {
String groupId = kafkaParams.get("group.id");
// 在zookeeper上读取offsets前先根据实际情况更新offsets setOrUpdateOffsets(topics, groupId);
//从zookeeper上读取offset开始消费message //TODO scala.collection.immutable.Set<String> immutableTopics = JavaConversions.asScalaSet(topics).toSet(); Either<ArrayBuffer, scala.collection.immutable.Set> partitionsE = kafkaCluster.getPartitions(immutableTopics);
if (partitionsE.isLeft()){ throw new SparkException("get kafka partition failed: ${partitionsE.left.get}"); } Either.RightProjection<ArrayBuffer, scala.collection.immutable.Set> partitions = partitionsE.right(); Either<ArrayBuffer, scala.collection.immutable.MapObject>> consumerOffsetsE = kafkaCluster.getConsumerOffsets(groupId, partitions.get());
if (consumerOffsetsE.isLeft()){ throw new SparkException("get kafka consumer offsets failed: ${consumerOffsetsE.left.get}"); } scala.collection.immutable.MapObject> consumerOffsetsTemp = consumerOffsetsE.right().get(); MapObject> consumerOffsets = JavaConversions.mapAsJavaMap(consumerOffsetsTemp);
Map consumerOffsetsLong = new HashMap(); for (TopicAndPartition key: consumerOffsets.keySet()){ consumerOffsetsLong.put(key, (Long)consumerOffsets.get(key)); }
JavaInputDStream<String> message = KafkaUtils.createDirectStream( jssc, String.class, String.class, StringDecoder.class, StringDecoder.class, String.class, kafkaParams, consumerOffsetsLong, new FunctionString, String>, String>() { @Override public String call(MessageAndMetadata<String, String> v) throws Exception { return v.message(); } });
return message; }
/** * 创建数据流前,根据实际消费情况更新消费offsets * @param topics * @param groupId */ private void setOrUpdateOffsets(Set<String> topics, String groupId) throws SparkException { for (String topic: topics){ boolean hasConsumed = true; HashSet<String> topicSet = new HashSet<>(); topicSet.add(topic); scala.collection.immutable.Set<String> immutableTopic = JavaConversions.asScalaSet(topicSet).toSet(); Either<ArrayBuffer, scala.collection.immutable.Set> partitionsE = kafkaCluster.getPartitions(immutableTopic);
if (partitionsE.isLeft()){ throw new SparkException("get kafka partition failed: ${partitionsE.left.get}"); } scala.collection.immutable.Set partitions = partitionsE.right().get(); Either<ArrayBuffer, scala.collection.immutable.MapObject>> consumerOffsetsE = kafkaCluster.getConsumerOffsets(groupId, partitions);
if (consumerOffsetsE.isLeft()){ hasConsumed = false; }
if (hasConsumed){// 消费过 /** * 如果streaming程序执行的时候出现kafka.common.OffsetOutOfRangeException, * 说明zk上保存的offsets已经过时了,即kafka的定时清理策略已经将包含该offsets的文件删除。 * 针对这种情况,只要判断一下zk上的consumerOffsets和earliestLeaderOffsets的大小, * 如果consumerOffsets比earliestLeaderOffsets还小的话,说明consumerOffsets已过时, * 这时把consumerOffsets更新为earliestLeaderOffsets */ Either<ArrayBuffer, scala.collection.immutable.Map> earliestLeaderOffsetsE = kafkaCluster.getEarliestLeaderOffsets(partitions); if (earliestLeaderOffsetsE.isLeft()){ throw new SparkException("get earliest leader offsets failed: ${earliestLeaderOffsetsE.left.get}"); }
scala.collection.immutable.Map earliestLeaderOffsets = earliestLeaderOffsetsE.right().get(); scala.collection.immutable.MapObject> consumerOffsets = consumerOffsetsE.right().get();
// 可能只是存在部分分区consumerOffsets过时,所以只更新过时分区的consumerOffsets为earliestLeaderOffsets HashMapObject> offsets = new HashMap<>(); MapObject> topicAndPartitionObjectMap = JavaConversions.mapAsJavaMap(consumerOffsets); for (TopicAndPartition key: topicAndPartitionObjectMap.keySet()){ Long n = (Long) topicAndPartitionObjectMap.get(key); long earliestLeaderOffset = earliestLeaderOffsets.get(key).get().offset(); if (n < earliestLeaderOffset){ System.out.println("consumer group:" + groupId + ",topic:" + key.topic() + ",partition:" + key.partition() + " offsets已经过时,更新为" + earliestLeaderOffset); offsets.put(key, earliestLeaderOffset); } } if (!offsets.isEmpty()){ //TODO scala.collection.immutable.MapObject> topicAndPartitionLongMap = toScalaImmutableMap(offsets); kafkaCluster.setConsumerOffsets(groupId, topicAndPartitionLongMap);
}
}else{// 没有消费过 String offsetReset = kafkaParams.get("auto.offset.reset").get().toLowerCase(); scala.collection.immutable.Map leaderOffsets = null; if ("smallest".equals(offsetReset)){ Either<ArrayBuffer, scala.collection.immutable.Map> leaderOffsetsE = kafkaCluster.getEarliestLeaderOffsets(partitions); if (leaderOffsetsE.isLeft()) { throw new SparkException("get earliest leader offsets failed: ${leaderOffsetsE.left.get}"); } leaderOffsets = leaderOffsetsE.right().get(); }else { Either<ArrayBuffer, scala.collection.immutable.Map> latestLeaderOffsetsE = kafkaCluster.getLatestLeaderOffsets(partitions); if (latestLeaderOffsetsE.isLeft()){ throw new SparkException("get latest leader offsets failed: ${leaderOffsetsE.left.get}"); } leaderOffsets = latestLeaderOffsetsE.right().get(); } Map topicAndPartitionLeaderOffsetMap = JavaConversions.mapAsJavaMap(leaderOffsets); MapObject> offsets = new HashMap<>(); for (TopicAndPartition key: topicAndPartitionLeaderOffsetMap.keySet()){ KafkaCluster.LeaderOffset offset = topicAndPartitionLeaderOffsetMap.get(key); long offset1 = offset.offset(); offsets.put(key, offset1); }
//TODO scala.collection.immutable.MapObject> immutableOffsets = toScalaImmutableMap(offsets); kafkaCluster.setConsumerOffsets(groupId,immutableOffsets); }
}

}
/** * 更新zookeeper上的消费offsets * @param rdd */ public void updateZKOffsets(JavaRDD<String> rdd){ String groupId = kafkaParams.get("group.id").get();
OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges(); for (OffsetRange offset: offsetRanges){ TopicAndPartition topicAndPartition = new TopicAndPartition(offset.topic(), offset.partition()); MapObject> offsets = new HashMap<>(); offsets.put(topicAndPartition, offset.untilOffset()); Either<ArrayBuffer, scala.collection.immutable.MapObject>> o = kafkaCluster.setConsumerOffsets(groupId, toScalaImmutableMap(offsets)); if (o.isLeft()){ System.out.println("Error updating the offset to Kafka cluster: ${o.left.get}"); }
} }
/** * java Map convert immutable.Map * @param javaMap * @param * @param * @return */ private static scala.collection.immutable.Map toScalaImmutableMap(java.util.Map javaMap) { final java.util.List> list = new java.util.ArrayList<>(javaMap.size()); for (final java.util.Map.Entry entry : javaMap.entrySet()) { list.add(scala.Tuple2.apply(entry.getKey(), entry.getValue())); } final scala.collection.Seq> seq = scala.collection.JavaConverters.asScalaBufferConverter(list).asScala().toSeq(); return (scala.collection.immutable.Map) scala.collection.immutable.Map$.MODULE$.apply(seq); }}
import org.apache.spark.SparkConf;import org.apache.spark.SparkException;import org.apache.spark.api.java.JavaRDD;import org.apache.spark.api.java.JavaSparkContext;import org.apache.spark.api.java.function.Function;import org.apache.spark.api.java.function.VoidFunction;import org.apache.spark.streaming.Durations;import org.apache.spark.streaming.api.java.JavaInputDStream;import org.apache.spark.streaming.api.java.JavaStreamingContext;import org.apache.spark.streaming.kafka.JavaKafkaManager;
import java.util.HashMap;import java.util.HashSet;import java.util.Map;

public class KafkaManagerDemo {
public static void main(String[] args) throws SparkException, InterruptedException {
SparkConf sparkConf = new SparkConf().setAppName(KafkaManagerDemo.class.getName()); sparkConf.setMaster("local[3]"); sparkConf.set("spark.streaming.kafka.maxRatePerPartition", "5"); sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer"); JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf); JavaStreamingContext javaStreamingContext = new JavaStreamingContext(javaSparkContext, Durations.seconds(5)); javaStreamingContext.sparkContext().setLogLevel("WARN"); String brokers = "localhost:9092"; String topics = "finance_test2"; String groupId = "test22"; HashSet<String> topcisSet = new HashSet<>(); topcisSet.add(topics); Map<String,String> kafkaParams = new HashMap<>(); kafkaParams.put("metadata.broker.list", brokers); kafkaParams.put("group.id", groupId); kafkaParams.put("auto.offset.reset", "smallest"); JavaKafkaManager javaKafkaManager = new JavaKafkaManager(kafkaParams); JavaInputDStream<String> message = javaKafkaManager.createDirectStream(javaStreamingContext, kafkaParams, topcisSet); message.transform(new FunctionString>, JavaRDD<String>>() { @Override public JavaRDD<String> call(JavaRDD<String> v1) throws Exception { return v1; } }).foreachRDD(new VoidFunctionString>>() { @Override public void call(JavaRDD<String> rdd) throws Exception { System.out.println(rdd); if (!rdd.isEmpty()){ rdd.foreach(new VoidFunction<String>() { @Override public void call(String r) throws Exception { System.out.println(r); } }); javaKafkaManager.updateZKOffsets(rdd); } } }); javaStreamingContext.start(); javaStreamingContext.awaitTermination(); }}

5、使用Scala来管理offset

package org.apache.spark.streaming.kafka
import kafka.common.TopicAndPartitionimport kafka.message.MessageAndMetadataimport kafka.serializer.Decoderimport org.apache.spark.SparkExceptionimport org.apache.spark.rdd.RDDimport org.apache.spark.streaming.StreamingContextimport org.apache.spark.streaming.dstream.InputDStreamimport org.apache.spark.streaming.kafka.KafkaCluster.LeaderOffset
import scala.reflect.ClassTag
/** * 自己管理offset */class KafkaManager(val kafkaParams: Map[String, String]) extends Serializable {
private val kc = new KafkaCluster(kafkaParams)
/** * 创建数据流 */ def createDirectStream[K: ClassTag, V: ClassTag, KD <: Decoder[K]: ClassTag, VD <: Decoder[V]: ClassTag](ssc: StreamingContext, kafkaParams: Map[String, String], topics: Set[String]): InputDStream[(K, V)] = { val groupId = kafkaParams.get("group.id").get // 在zookeeper上读取offsets前先根据实际情况更新offsets setOrUpdateOffsets(topics, groupId)
//从zookeeper上读取offset开始消费message val messages = { val partitionsE = kc.getPartitions(topics) if (partitionsE.isLeft) throw new SparkException(s"get kafka partition failed: ${partitionsE.left.get}") val partitions = partitionsE.right.get val consumerOffsetsE = kc.getConsumerOffsets(groupId, partitions) if (consumerOffsetsE.isLeft) throw new SparkException(s"get kafka consumer offsets failed: ${consumerOffsetsE.left.get}") val consumerOffsets = consumerOffsetsE.right.get KafkaUtils.createDirectStream[K, V, KD, VD, (K, V)]( ssc, kafkaParams, consumerOffsets, (mmd: MessageAndMetadata[K, V]) => (mmd.key, mmd.message)) } messages }
/** * 创建数据流前,根据实际消费情况更新消费offsets * @param topics * @param groupId */ private def setOrUpdateOffsets(topics: Set[String], groupId: String): Unit = { topics.foreach(topic => { var hasConsumed = true val partitionsE = kc.getPartitions(Set(topic)) if (partitionsE.isLeft) throw new SparkException(s"get kafka partition failed: ${partitionsE.left.get}") val partitions = partitionsE.right.get val consumerOffsetsE = kc.getConsumerOffsets(groupId, partitions) if (consumerOffsetsE.isLeft) hasConsumed = false if (hasConsumed) {// 消费过 /** * 如果streaming程序执行的时候出现kafka.common.OffsetOutOfRangeException, * 说明zk上保存的offsets已经过时了,即kafka的定时清理策略已经将包含该offsets的文件删除。 * 针对这种情况,只要判断一下zk上的consumerOffsets和earliestLeaderOffsets的大小, * 如果consumerOffsets比earliestLeaderOffsets还小的话,说明consumerOffsets已过时, * 这时把consumerOffsets更新为earliestLeaderOffsets */ val earliestLeaderOffsetsE = kc.getEarliestLeaderOffsets(partitions) if (earliestLeaderOffsetsE.isLeft) throw new SparkException(s"get earliest leader offsets failed: ${earliestLeaderOffsetsE.left.get}") val earliestLeaderOffsets = earliestLeaderOffsetsE.right.get val consumerOffsets = consumerOffsetsE.right.get
// 可能只是存在部分分区consumerOffsets过时,所以只更新过时分区的consumerOffsets为earliestLeaderOffsets var offsets: Map[TopicAndPartition, Long] = Map() consumerOffsets.foreach({ case(tp, n) => val earliestLeaderOffset = earliestLeaderOffsets(tp).offset if (n < earliestLeaderOffset) { println("consumer group:" + groupId + ",topic:" + tp.topic + ",partition:" + tp.partition + " offsets已经过时,更新为" + earliestLeaderOffset) offsets += (tp -> earliestLeaderOffset) } }) if (!offsets.isEmpty) { kc.setConsumerOffsets(groupId, offsets) } } else {// 没有消费过 val reset = kafkaParams.get("auto.offset.reset").map(_.toLowerCase) var leaderOffsets: Map[TopicAndPartition, LeaderOffset] = null if (reset == Some("smallest")) { val leaderOffsetsE = kc.getEarliestLeaderOffsets(partitions) if (leaderOffsetsE.isLeft) throw new SparkException(s"get earliest leader offsets failed: ${leaderOffsetsE.left.get}") leaderOffsets = leaderOffsetsE.right.get } else { val leaderOffsetsE = kc.getLatestLeaderOffsets(partitions) if (leaderOffsetsE.isLeft) throw new SparkException(s"get latest leader offsets failed: ${leaderOffsetsE.left.get}") leaderOffsets = leaderOffsetsE.right.get } val offsets = leaderOffsets.map { case (tp, offset) => (tp, offset.offset) } kc.setConsumerOffsets(groupId, offsets) } }) }
/** * 更新zookeeper上的消费offsets * @param rdd */ def updateZKOffsets(rdd: RDD[(String, String)]) : Unit = { val groupId = kafkaParams.get("group.id").get val offsetsList = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
for (offsets <- offsetsList) { val topicAndPartition = TopicAndPartition(offsets.topic, offsets.partition) val o = kc.setConsumerOffsets(groupId, Map((topicAndPartition, offsets.untilOffset))) if (o.isLeft) { println(s"Error updating the offset to Kafka cluster: ${o.left.get}") } } }}
import kafka.serializer.StringDecoderimport org.apache.spark.rdd.RDDimport org.apache.log4j.{Level, Logger}import org.apache.spark.SparkConfimport org.apache.spark.streaming.kafka.KafkaManagerimport org.apache.spark.streaming.{Seconds, StreamingContext}

object SparkKafkaStreaming {
/* def dealLine(line: String): String = { val list = line.split(',').toList // val list = AnalysisUtil.dealString(line, ',', '"')// 把dealString函数当做split即可 list.get(0).substring(0, 10) + "-" + list.get(26) }*/
def processRdd(rdd: RDD[(String, String)]): Unit = { val lines = rdd.map(_._2).map(x => (1,1)).reduceByKey(_+_) /*val words = lines.map(_.split(" ")) val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)*/ lines.foreach(println) }
def main(args: Array[String]) { if (args.length < 3) { System.err.println( s""" |Usage: DirectKafkaWordCount | is a list of one or more Kafka brokers | is a list of one or more kafka topics to consume from | is a consume group | """.stripMargin) System.exit(1) }
Logger.getLogger("org").setLevel(Level.WARN)
val Array(brokers, topics, groupId) = args
// Create context with 2 second batch interval val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount") sparkConf.setMaster("local[3]") sparkConf.set("spark.streaming.kafka.maxRatePerPartition", "5") sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val ssc = new StreamingContext(sparkConf, Seconds(5)) ssc.sparkContext.setLogLevel("WARN")
// Create direct kafka stream with brokers and topics val topicsSet = topics.split(",").toSet val kafkaParams = Map[String, String]( "metadata.broker.list" -> brokers, "group.id" -> groupId, "auto.offset.reset" -> "smallest" )
val km = new KafkaManager(kafkaParams)
val messages = km.createDirectStream[String, String, StringDecoder, StringDecoder]( ssc, kafkaParams, topicsSet)
messages.foreachRDD(rdd => { if (!rdd.isEmpty()) { // 先处理消息 processRdd(rdd) // 再更新offsets km.updateZKOffsets(rdd) } })
ssc.start() ssc.awaitTermination() }
}
欢迎点赞+收藏+转发朋友圈素质三连


文章不错?点个【在看】吧! ?

浏览 34
点赞
评论
收藏
分享

手机扫一扫分享

分享
举报
评论
图片
表情
推荐
点赞
评论
收藏
分享

手机扫一扫分享

分享
举报