本文整理了Java中org.apache.spark.api.java.JavaRDD.zipWithUniqueId()
方法的一些代码示例,展示了JavaRDD.zipWithUniqueId()
的具体用法。这些代码示例主要来源于Github
/Stackoverflow
/Maven
等平台,是从一些精选项目中提取出来的代码,具有较强的参考意义,能在一定程度帮忙到你。JavaRDD.zipWithUniqueId()
方法的具体详情如下:
包路径:org.apache.spark.api.java.JavaRDD
类名称:JavaRDD
方法名:zipWithUniqueId
暂无
代码示例来源:origin: org.apache.spark/spark-core
@Test
public void zipWithUniqueId() {
List<Integer> dataArray = Arrays.asList(1, 2, 3, 4);
JavaPairRDD<Integer, Long> zip = sc.parallelize(dataArray).zipWithUniqueId();
JavaRDD<Long> indexes = zip.values();
assertEquals(4, new HashSet<>(indexes.collect()).size());
}
代码示例来源:origin: org.apache.spark/spark-core_2.10
@Test
public void zipWithUniqueId() {
List<Integer> dataArray = Arrays.asList(1, 2, 3, 4);
JavaPairRDD<Integer, Long> zip = sc.parallelize(dataArray).zipWithUniqueId();
JavaRDD<Long> indexes = zip.values();
assertEquals(4, new HashSet<>(indexes.collect()).size());
}
代码示例来源:origin: org.apache.spark/spark-core_2.11
@Test
public void zipWithUniqueId() {
List<Integer> dataArray = Arrays.asList(1, 2, 3, 4);
JavaPairRDD<Integer, Long> zip = sc.parallelize(dataArray).zipWithUniqueId();
JavaRDD<Long> indexes = zip.values();
assertEquals(4, new HashSet<>(indexes.collect()).size());
}
代码示例来源:origin: org.datavec/datavec-spark_2.11
/**
* Save a {@code JavaRDD<List<Writable>>} to a Hadoop {@link org.apache.hadoop.io.SequenceFile}. Each record is given
* a unique (but noncontiguous) {@link LongWritable} key, and values are stored as {@link RecordWritable} instances.
* <p>
* Use {@link #restoreSequenceFile(String, JavaSparkContext)} to restore values saved with this method.
*
* @param path Path to save the sequence file
* @param rdd RDD to save
* @param maxOutputFiles Nullable. If non-null: first coalesce the RDD to the specified size (number of partitions)
* to limit the maximum number of output sequence files
* @see #saveSequenceFileSequences(String, JavaRDD)
* @see #saveMapFile(String, JavaRDD)
*/
public static void saveSequenceFile(String path, JavaRDD<List<Writable>> rdd, @Nullable Integer maxOutputFiles) {
path = FilenameUtils.normalize(path, true);
if (maxOutputFiles != null) {
rdd = rdd.coalesce(maxOutputFiles);
}
JavaPairRDD<List<Writable>, Long> dataIndexPairs = rdd.zipWithUniqueId(); //Note: Long values are unique + NOT contiguous; more efficient than zipWithIndex
JavaPairRDD<LongWritable, RecordWritable> keyedByIndex =
dataIndexPairs.mapToPair(new RecordSavePrepPairFunction());
keyedByIndex.saveAsNewAPIHadoopFile(path, LongWritable.class, RecordWritable.class,
SequenceFileOutputFormat.class);
}
代码示例来源:origin: org.datavec/datavec-spark
/**
* Save a {@code JavaRDD<List<List<Writable>>>} to a Hadoop {@link org.apache.hadoop.io.SequenceFile}. Each record
* is given a unique (but noncontiguous) {@link LongWritable} key, and values are stored as {@link SequenceRecordWritable} instances.
* <p>
* Use {@link #restoreSequenceFileSequences(String, JavaSparkContext)} to restore values saved with this method.
*
* @param path Path to save the sequence file
* @param rdd RDD to save
* @param maxOutputFiles Nullable. If non-null: first coalesce the RDD to the specified size (number of partitions)
* to limit the maximum number of output sequence files
* @see #saveSequenceFile(String, JavaRDD)
* @see #saveMapFileSequences(String, JavaRDD)
*/
public static void saveSequenceFileSequences(String path, JavaRDD<List<List<Writable>>> rdd,
@Nullable Integer maxOutputFiles) {
path = FilenameUtils.normalize(path, true);
if (maxOutputFiles != null) {
rdd = rdd.coalesce(maxOutputFiles);
}
JavaPairRDD<List<List<Writable>>, Long> dataIndexPairs = rdd.zipWithUniqueId(); //Note: Long values are unique + NOT contiguous; more efficient than zipWithIndex
JavaPairRDD<LongWritable, SequenceRecordWritable> keyedByIndex =
dataIndexPairs.mapToPair(new SequenceRecordSavePrepPairFunction());
keyedByIndex.saveAsNewAPIHadoopFile(path, LongWritable.class, SequenceRecordWritable.class,
SequenceFileOutputFormat.class);
}
代码示例来源:origin: org.datavec/datavec-spark
/**
* Save a {@code JavaRDD<List<Writable>>} to a Hadoop {@link org.apache.hadoop.io.SequenceFile}. Each record is given
* a unique (but noncontiguous) {@link LongWritable} key, and values are stored as {@link RecordWritable} instances.
* <p>
* Use {@link #restoreSequenceFile(String, JavaSparkContext)} to restore values saved with this method.
*
* @param path Path to save the sequence file
* @param rdd RDD to save
* @param maxOutputFiles Nullable. If non-null: first coalesce the RDD to the specified size (number of partitions)
* to limit the maximum number of output sequence files
* @see #saveSequenceFileSequences(String, JavaRDD)
* @see #saveMapFile(String, JavaRDD)
*/
public static void saveSequenceFile(String path, JavaRDD<List<Writable>> rdd, @Nullable Integer maxOutputFiles) {
path = FilenameUtils.normalize(path, true);
if (maxOutputFiles != null) {
rdd = rdd.coalesce(maxOutputFiles);
}
JavaPairRDD<List<Writable>, Long> dataIndexPairs = rdd.zipWithUniqueId(); //Note: Long values are unique + NOT contiguous; more efficient than zipWithIndex
JavaPairRDD<LongWritable, RecordWritable> keyedByIndex =
dataIndexPairs.mapToPair(new RecordSavePrepPairFunction());
keyedByIndex.saveAsNewAPIHadoopFile(path, LongWritable.class, RecordWritable.class,
SequenceFileOutputFormat.class);
}
代码示例来源:origin: org.datavec/datavec-spark_2.11
/**
* Save a {@code JavaRDD<List<List<Writable>>>} to a Hadoop {@link org.apache.hadoop.io.SequenceFile}. Each record
* is given a unique (but noncontiguous) {@link LongWritable} key, and values are stored as {@link SequenceRecordWritable} instances.
* <p>
* Use {@link #restoreSequenceFileSequences(String, JavaSparkContext)} to restore values saved with this method.
*
* @param path Path to save the sequence file
* @param rdd RDD to save
* @param maxOutputFiles Nullable. If non-null: first coalesce the RDD to the specified size (number of partitions)
* to limit the maximum number of output sequence files
* @see #saveSequenceFile(String, JavaRDD)
* @see #saveMapFileSequences(String, JavaRDD)
*/
public static void saveSequenceFileSequences(String path, JavaRDD<List<List<Writable>>> rdd,
@Nullable Integer maxOutputFiles) {
path = FilenameUtils.normalize(path, true);
if (maxOutputFiles != null) {
rdd = rdd.coalesce(maxOutputFiles);
}
JavaPairRDD<List<List<Writable>>, Long> dataIndexPairs = rdd.zipWithUniqueId(); //Note: Long values are unique + NOT contiguous; more efficient than zipWithIndex
JavaPairRDD<LongWritable, SequenceRecordWritable> keyedByIndex =
dataIndexPairs.mapToPair(new SequenceRecordSavePrepPairFunction());
keyedByIndex.saveAsNewAPIHadoopFile(path, LongWritable.class, SequenceRecordWritable.class,
SequenceFileOutputFormat.class);
}
代码示例来源:origin: org.qcri.rheem/rheem-spark
@Override
public Tuple<Collection<ExecutionLineageNode>, Collection<ChannelInstance>> evaluate(
ChannelInstance[] inputs,
ChannelInstance[] outputs,
SparkExecutor sparkExecutor,
OptimizationContext.OperatorContext operatorContext) {
assert inputs.length == this.getNumInputs();
assert outputs.length == this.getNumOutputs();
RddChannel.Instance input = (RddChannel.Instance) inputs[0];
RddChannel.Instance output = (RddChannel.Instance) outputs[0];
final JavaRDD<InputType> inputRdd = input.provideRdd();
final JavaPairRDD<InputType, Long> zippedRdd = inputRdd.zipWithUniqueId();
this.name(zippedRdd);
final JavaRDD<Tuple2<Long, InputType>> outputRdd = zippedRdd.map(pair -> new Tuple2<>(pair._2, pair._1));
this.name(outputRdd);
output.accept(outputRdd, sparkExecutor);
return ExecutionOperator.modelLazyExecution(inputs, outputs, operatorContext);
}
内容来源于网络,如有侵权,请联系作者删除!