org.apache.spark.api.java.JavaRDD.zipWithUniqueId()方法的使用及代码示例

x33g5p2x  于2022-01-21 转载在 其他  
字(7.3k)|赞(0)|评价(0)|浏览(95)

本文整理了Java中org.apache.spark.api.java.JavaRDD.zipWithUniqueId()方法的一些代码示例,展示了JavaRDD.zipWithUniqueId()的具体用法。这些代码示例主要来源于Github/Stackoverflow/Maven等平台,是从一些精选项目中提取出来的代码,具有较强的参考意义,能在一定程度帮忙到你。JavaRDD.zipWithUniqueId()方法的具体详情如下:
包路径:org.apache.spark.api.java.JavaRDD
类名称:JavaRDD
方法名:zipWithUniqueId

JavaRDD.zipWithUniqueId介绍

暂无

代码示例

代码示例来源:origin: org.apache.spark/spark-core

@Test
public void zipWithUniqueId() {
 List<Integer> dataArray = Arrays.asList(1, 2, 3, 4);
 JavaPairRDD<Integer, Long> zip = sc.parallelize(dataArray).zipWithUniqueId();
 JavaRDD<Long> indexes = zip.values();
 assertEquals(4, new HashSet<>(indexes.collect()).size());
}

代码示例来源:origin: org.apache.spark/spark-core_2.10

@Test
public void zipWithUniqueId() {
 List<Integer> dataArray = Arrays.asList(1, 2, 3, 4);
 JavaPairRDD<Integer, Long> zip = sc.parallelize(dataArray).zipWithUniqueId();
 JavaRDD<Long> indexes = zip.values();
 assertEquals(4, new HashSet<>(indexes.collect()).size());
}

代码示例来源:origin: org.apache.spark/spark-core_2.11

@Test
public void zipWithUniqueId() {
 List<Integer> dataArray = Arrays.asList(1, 2, 3, 4);
 JavaPairRDD<Integer, Long> zip = sc.parallelize(dataArray).zipWithUniqueId();
 JavaRDD<Long> indexes = zip.values();
 assertEquals(4, new HashSet<>(indexes.collect()).size());
}

代码示例来源:origin: org.datavec/datavec-spark_2.11

/**
 * Save a {@code JavaRDD<List<Writable>>} to a Hadoop {@link org.apache.hadoop.io.SequenceFile}. Each record is given
 * a unique (but noncontiguous) {@link LongWritable} key, and values are stored as {@link RecordWritable} instances.
 * <p>
 * Use {@link #restoreSequenceFile(String, JavaSparkContext)} to restore values saved with this method.
 *
 * @param path           Path to save the sequence file
 * @param rdd            RDD to save
 * @param maxOutputFiles Nullable. If non-null: first coalesce the RDD to the specified size (number of partitions)
 *                       to limit the maximum number of output sequence files
 * @see #saveSequenceFileSequences(String, JavaRDD)
 * @see #saveMapFile(String, JavaRDD)
 */
public static void saveSequenceFile(String path, JavaRDD<List<Writable>> rdd, @Nullable Integer maxOutputFiles) {
  path = FilenameUtils.normalize(path, true);
  if (maxOutputFiles != null) {
    rdd = rdd.coalesce(maxOutputFiles);
  }
  JavaPairRDD<List<Writable>, Long> dataIndexPairs = rdd.zipWithUniqueId(); //Note: Long values are unique + NOT contiguous; more efficient than zipWithIndex
  JavaPairRDD<LongWritable, RecordWritable> keyedByIndex =
          dataIndexPairs.mapToPair(new RecordSavePrepPairFunction());
  keyedByIndex.saveAsNewAPIHadoopFile(path, LongWritable.class, RecordWritable.class,
          SequenceFileOutputFormat.class);
}

代码示例来源:origin: org.datavec/datavec-spark

/**
 * Save a {@code JavaRDD<List<List<Writable>>>} to a Hadoop {@link org.apache.hadoop.io.SequenceFile}. Each record
 * is given a unique (but noncontiguous) {@link LongWritable} key, and values are stored as {@link SequenceRecordWritable} instances.
 * <p>
 * Use {@link #restoreSequenceFileSequences(String, JavaSparkContext)} to restore values saved with this method.
 *
 * @param path           Path to save the sequence file
 * @param rdd            RDD to save
 * @param maxOutputFiles Nullable. If non-null: first coalesce the RDD to the specified size (number of partitions)
 *                       to limit the maximum number of output sequence files
 * @see #saveSequenceFile(String, JavaRDD)
 * @see #saveMapFileSequences(String, JavaRDD)
 */
public static void saveSequenceFileSequences(String path, JavaRDD<List<List<Writable>>> rdd,
        @Nullable Integer maxOutputFiles) {
  path = FilenameUtils.normalize(path, true);
  if (maxOutputFiles != null) {
    rdd = rdd.coalesce(maxOutputFiles);
  }
  JavaPairRDD<List<List<Writable>>, Long> dataIndexPairs = rdd.zipWithUniqueId(); //Note: Long values are unique + NOT contiguous; more efficient than zipWithIndex
  JavaPairRDD<LongWritable, SequenceRecordWritable> keyedByIndex =
          dataIndexPairs.mapToPair(new SequenceRecordSavePrepPairFunction());
  keyedByIndex.saveAsNewAPIHadoopFile(path, LongWritable.class, SequenceRecordWritable.class,
          SequenceFileOutputFormat.class);
}

代码示例来源:origin: org.datavec/datavec-spark

/**
 * Save a {@code JavaRDD<List<Writable>>} to a Hadoop {@link org.apache.hadoop.io.SequenceFile}. Each record is given
 * a unique (but noncontiguous) {@link LongWritable} key, and values are stored as {@link RecordWritable} instances.
 * <p>
 * Use {@link #restoreSequenceFile(String, JavaSparkContext)} to restore values saved with this method.
 *
 * @param path           Path to save the sequence file
 * @param rdd            RDD to save
 * @param maxOutputFiles Nullable. If non-null: first coalesce the RDD to the specified size (number of partitions)
 *                       to limit the maximum number of output sequence files
 * @see #saveSequenceFileSequences(String, JavaRDD)
 * @see #saveMapFile(String, JavaRDD)
 */
public static void saveSequenceFile(String path, JavaRDD<List<Writable>> rdd, @Nullable Integer maxOutputFiles) {
  path = FilenameUtils.normalize(path, true);
  if (maxOutputFiles != null) {
    rdd = rdd.coalesce(maxOutputFiles);
  }
  JavaPairRDD<List<Writable>, Long> dataIndexPairs = rdd.zipWithUniqueId(); //Note: Long values are unique + NOT contiguous; more efficient than zipWithIndex
  JavaPairRDD<LongWritable, RecordWritable> keyedByIndex =
          dataIndexPairs.mapToPair(new RecordSavePrepPairFunction());
  keyedByIndex.saveAsNewAPIHadoopFile(path, LongWritable.class, RecordWritable.class,
          SequenceFileOutputFormat.class);
}

代码示例来源:origin: org.datavec/datavec-spark_2.11

/**
 * Save a {@code JavaRDD<List<List<Writable>>>} to a Hadoop {@link org.apache.hadoop.io.SequenceFile}. Each record
 * is given a unique (but noncontiguous) {@link LongWritable} key, and values are stored as {@link SequenceRecordWritable} instances.
 * <p>
 * Use {@link #restoreSequenceFileSequences(String, JavaSparkContext)} to restore values saved with this method.
 *
 * @param path           Path to save the sequence file
 * @param rdd            RDD to save
 * @param maxOutputFiles Nullable. If non-null: first coalesce the RDD to the specified size (number of partitions)
 *                       to limit the maximum number of output sequence files
 * @see #saveSequenceFile(String, JavaRDD)
 * @see #saveMapFileSequences(String, JavaRDD)
 */
public static void saveSequenceFileSequences(String path, JavaRDD<List<List<Writable>>> rdd,
        @Nullable Integer maxOutputFiles) {
  path = FilenameUtils.normalize(path, true);
  if (maxOutputFiles != null) {
    rdd = rdd.coalesce(maxOutputFiles);
  }
  JavaPairRDD<List<List<Writable>>, Long> dataIndexPairs = rdd.zipWithUniqueId(); //Note: Long values are unique + NOT contiguous; more efficient than zipWithIndex
  JavaPairRDD<LongWritable, SequenceRecordWritable> keyedByIndex =
          dataIndexPairs.mapToPair(new SequenceRecordSavePrepPairFunction());
  keyedByIndex.saveAsNewAPIHadoopFile(path, LongWritable.class, SequenceRecordWritable.class,
          SequenceFileOutputFormat.class);
}

代码示例来源:origin: org.qcri.rheem/rheem-spark

@Override
public Tuple<Collection<ExecutionLineageNode>, Collection<ChannelInstance>> evaluate(
    ChannelInstance[] inputs,
    ChannelInstance[] outputs,
    SparkExecutor sparkExecutor,
    OptimizationContext.OperatorContext operatorContext) {
  assert inputs.length == this.getNumInputs();
  assert outputs.length == this.getNumOutputs();
  RddChannel.Instance input = (RddChannel.Instance) inputs[0];
  RddChannel.Instance output = (RddChannel.Instance) outputs[0];
  final JavaRDD<InputType> inputRdd = input.provideRdd();
  final JavaPairRDD<InputType, Long> zippedRdd = inputRdd.zipWithUniqueId();
  this.name(zippedRdd);
  final JavaRDD<Tuple2<Long, InputType>> outputRdd = zippedRdd.map(pair -> new Tuple2<>(pair._2, pair._1));
  this.name(outputRdd);
  output.accept(outputRdd, sparkExecutor);
  return ExecutionOperator.modelLazyExecution(inputs, outputs, operatorContext);
}

相关文章

微信公众号

最新文章

更多