org.apache.spark.rdd.RDD.map()方法的使用及代码示例

x33g5p2x  于2022-01-28 转载在 其他  
字(16.7k)|赞(0)|评价(0)|浏览(247)

本文整理了Java中org.apache.spark.rdd.RDD.map方法的一些代码示例,展示了RDD.map的具体用法。这些代码示例主要来源于Github/Stackoverflow/Maven等平台,是从一些精选项目中提取出来的代码,具有较强的参考意义,能在一定程度帮忙到你。RDD.map方法的具体详情如下:
包路径:org.apache.spark.rdd.RDD
类名称:RDD
方法名:map

RDD.map介绍

暂无

代码示例

代码示例来源:origin: org.apache.pig/pig

@Override
public RDD<Tuple> convert(List<RDD<Tuple>> predecessors,
    POPackage physicalOperator) throws IOException {
  SparkUtil.assertPredecessorSize(predecessors, physicalOperator, 1);
  RDD<Tuple> rdd = predecessors.get(0);
  // package will generate the group from the result of the local
  // rearrange
  return rdd.map(new PackageFunction(physicalOperator),
      SparkUtil.getManifest(Tuple.class));
}

代码示例来源:origin: org.apache.pig/pig

@Override
public RDD<Tuple> convert(List<RDD<Tuple>> predecessors,
    PhysicalOperator physicalOperator) throws IOException {
  SparkUtil.assertPredecessorSize(predecessors, physicalOperator, 1);
  RDD<Tuple> rdd = predecessors.get(0);
  // call local rearrange to get key and value
  return rdd.map(new LocalRearrangeFunction(physicalOperator),
      SparkUtil.getManifest(Tuple.class));
}

代码示例来源:origin: com.datastax.spark/spark-cassandra-connector-unshaded

/**
   * Groups items with the same key, assuming the items with the same key are next to each other in the
   * collection. It does not perform shuffle, therefore it is much faster than using much more
   * universal Spark RDD `groupByKey`. For this method to be useful with Cassandra tables, the key must
   * represent a prefix of the primary key, containing at least the partition key of the Cassandra
   * table.
   */
  public JavaPairRDD<K, Collection<V>> spanByKey(ClassTag<K> keyClassTag) {
    ClassTag<Tuple2<K, Collection<V>>> tupleClassTag = classTag(Tuple2.class);
    ClassTag<Collection<V>> vClassTag = classTag(Collection.class);
    RDD<Tuple2<K, Collection<V>>> newRDD = pairRDDFunctions.spanByKey()
        .map(JavaApiHelper.<K, V, Seq<V>>valuesAsJavaCollection(), tupleClassTag);

    return new JavaPairRDD<>(newRDD, keyClassTag, vClassTag);
  }
}

代码示例来源:origin: com.datastax.spark/spark-cassandra-connector_2.10

/**
   * Groups items with the same key, assuming the items with the same key are next to each other in the
   * collection. It does not perform shuffle, therefore it is much faster than using much more
   * universal Spark RDD `groupByKey`. For this method to be useful with Cassandra tables, the key must
   * represent a prefix of the primary key, containing at least the partition key of the Cassandra
   * table.
   */
  public JavaPairRDD<K, Collection<V>> spanByKey(ClassTag<K> keyClassTag) {
    ClassTag<Tuple2<K, Collection<V>>> tupleClassTag = classTag(Tuple2.class);
    ClassTag<Collection<V>> vClassTag = classTag(Collection.class);
    RDD<Tuple2<K, Collection<V>>> newRDD = pairRDDFunctions.spanByKey()
        .map(JavaApiHelper.<K, V, Seq<V>>valuesAsJavaCollection(), tupleClassTag);

    return new JavaPairRDD<>(newRDD, keyClassTag, vClassTag);
  }
}

代码示例来源:origin: com.datastax.spark/spark-cassandra-connector-java

/**
   * Groups items with the same key, assuming the items with the same key are next to each other in the
   * collection. It does not perform shuffle, therefore it is much faster than using much more
   * universal Spark RDD `groupByKey`. For this method to be useful with Cassandra tables, the key must
   * represent a prefix of the primary key, containing at least the partition key of the Cassandra
   * table.
   */
  public JavaPairRDD<K, Collection<V>> spanByKey(ClassTag<K> keyClassTag) {
    ClassTag<Tuple2<K, Collection<V>>> tupleClassTag = classTag(Tuple2.class);
    ClassTag<Collection<V>> vClassTag = classTag(Collection.class);
    RDD<Tuple2<K, Collection<V>>> newRDD = pairRDDFunctions.spanByKey()
        .map(JavaApiHelper.<K, V, Seq<V>>valuesAsJavaCollection(), tupleClassTag);

    return new JavaPairRDD<>(newRDD, keyClassTag, vClassTag);
  }
}

代码示例来源:origin: com.datastax.spark/spark-cassandra-connector-java_2.10

/**
   * Groups items with the same key, assuming the items with the same key are next to each other in the
   * collection. It does not perform shuffle, therefore it is much faster than using much more
   * universal Spark RDD `groupByKey`. For this method to be useful with Cassandra tables, the key must
   * represent a prefix of the primary key, containing at least the partition key of the Cassandra
   * table.
   */
  public JavaPairRDD<K, Collection<V>> spanByKey(ClassTag<K> keyClassTag) {
    ClassTag<Tuple2<K, Collection<V>>> tupleClassTag = classTag(Tuple2.class);
    ClassTag<Collection<V>> vClassTag = classTag(Collection.class);
    RDD<Tuple2<K, Collection<V>>> newRDD = pairRDDFunctions.spanByKey()
        .map(JavaApiHelper.<K, V, Seq<V>>valuesAsJavaCollection(), tupleClassTag);

    return new JavaPairRDD<>(newRDD, keyClassTag, vClassTag);
  }
}

代码示例来源:origin: com.datastax.spark/spark-cassandra-connector

/**
   * Groups items with the same key, assuming the items with the same key are next to each other in the
   * collection. It does not perform shuffle, therefore it is much faster than using much more
   * universal Spark RDD `groupByKey`. For this method to be useful with Cassandra tables, the key must
   * represent a prefix of the primary key, containing at least the partition key of the Cassandra
   * table.
   */
  public JavaPairRDD<K, Collection<V>> spanByKey(ClassTag<K> keyClassTag) {
    ClassTag<Tuple2<K, Collection<V>>> tupleClassTag = classTag(Tuple2.class);
    ClassTag<Collection<V>> vClassTag = classTag(Collection.class);
    RDD<Tuple2<K, Collection<V>>> newRDD = pairRDDFunctions.spanByKey()
        .map(JavaApiHelper.<K, V, Seq<V>>valuesAsJavaCollection(), tupleClassTag);

    return new JavaPairRDD<>(newRDD, keyClassTag, vClassTag);
  }
}

代码示例来源:origin: com.datastax.spark/spark-cassandra-connector

/**
 * Applies a function to each item, and groups consecutive items having the same value together.
 * Contrary to {@code groupBy}, items from the same group must be already next to each other in the
 * original collection. Works locally on each partition, so items from different partitions will
 * never be placed in the same group.
 */
public <U> JavaPairRDD<U, Iterable<T>> spanBy(final Function<T, U> f, ClassTag<U> keyClassTag) {
  ClassTag<Tuple2<U, Iterable<T>>> tupleClassTag = classTag(Tuple2.class);
  ClassTag<Iterable<T>> iterableClassTag = CassandraJavaUtil.classTag(Iterable.class);
  RDD<Tuple2<U, Iterable<T>>> newRDD = rddFunctions.spanBy(toScalaFunction1(f))
      .map(JavaApiHelper.<U, T, scala.collection.Iterable<T>>valuesAsJavaIterable(), tupleClassTag);
  return new JavaPairRDD<>(newRDD, keyClassTag, iterableClassTag);
}

代码示例来源:origin: com.datastax.spark/spark-cassandra-connector-java_2.10

/**
 * Applies a function to each item, and groups consecutive items having the same value together.
 * Contrary to {@code groupBy}, items from the same group must be already next to each other in the
 * original collection. Works locally on each partition, so items from different partitions will
 * never be placed in the same group.
 */
public <U> JavaPairRDD<U, Iterable<T>> spanBy(final Function<T, U> f, ClassTag<U> keyClassTag) {
  ClassTag<Tuple2<U, Iterable<T>>> tupleClassTag = classTag(Tuple2.class);
  ClassTag<Iterable<T>> iterableClassTag = CassandraJavaUtil.classTag(Iterable.class);
  RDD<Tuple2<U, Iterable<T>>> newRDD = rddFunctions.spanBy(toScalaFunction1(f))
      .map(JavaApiHelper.<U, T, scala.collection.Iterable<T>>valuesAsJavaIterable(), tupleClassTag);
  return new JavaPairRDD<>(newRDD, keyClassTag, iterableClassTag);
}

代码示例来源:origin: com.datastax.spark/spark-cassandra-connector-java

/**
 * Applies a function to each item, and groups consecutive items having the same value together.
 * Contrary to {@code groupBy}, items from the same group must be already next to each other in the
 * original collection. Works locally on each partition, so items from different partitions will
 * never be placed in the same group.
 */
public <U> JavaPairRDD<U, Iterable<T>> spanBy(final Function<T, U> f, ClassTag<U> keyClassTag) {
  ClassTag<Tuple2<U, Iterable<T>>> tupleClassTag = classTag(Tuple2.class);
  ClassTag<Iterable<T>> iterableClassTag = CassandraJavaUtil.classTag(Iterable.class);
  RDD<Tuple2<U, Iterable<T>>> newRDD = rddFunctions.spanBy(toScalaFunction1(f))
      .map(JavaApiHelper.<U, T, scala.collection.Iterable<T>>valuesAsJavaIterable(), tupleClassTag);
  return new JavaPairRDD<>(newRDD, keyClassTag, iterableClassTag);
}

代码示例来源:origin: com.datastax.spark/spark-cassandra-connector_2.10

/**
 * Applies a function to each item, and groups consecutive items having the same value together.
 * Contrary to {@code groupBy}, items from the same group must be already next to each other in the
 * original collection. Works locally on each partition, so items from different partitions will
 * never be placed in the same group.
 */
public <U> JavaPairRDD<U, Iterable<T>> spanBy(final Function<T, U> f, ClassTag<U> keyClassTag) {
  ClassTag<Tuple2<U, Iterable<T>>> tupleClassTag = classTag(Tuple2.class);
  ClassTag<Iterable<T>> iterableClassTag = CassandraJavaUtil.classTag(Iterable.class);
  RDD<Tuple2<U, Iterable<T>>> newRDD = rddFunctions.spanBy(toScalaFunction1(f))
      .map(JavaApiHelper.<U, T, scala.collection.Iterable<T>>valuesAsJavaIterable(), tupleClassTag);
  return new JavaPairRDD<>(newRDD, keyClassTag, iterableClassTag);
}

代码示例来源:origin: com.datastax.spark/spark-cassandra-connector-unshaded

/**
 * Applies a function to each item, and groups consecutive items having the same value together.
 * Contrary to {@code groupBy}, items from the same group must be already next to each other in the
 * original collection. Works locally on each partition, so items from different partitions will
 * never be placed in the same group.
 */
public <U> JavaPairRDD<U, Iterable<T>> spanBy(final Function<T, U> f, ClassTag<U> keyClassTag) {
  ClassTag<Tuple2<U, Iterable<T>>> tupleClassTag = classTag(Tuple2.class);
  ClassTag<Iterable<T>> iterableClassTag = CassandraJavaUtil.classTag(Iterable.class);
  RDD<Tuple2<U, Iterable<T>>> newRDD = rddFunctions.spanBy(toScalaFunction1(f))
      .map(JavaApiHelper.<U, T, scala.collection.Iterable<T>>valuesAsJavaIterable(), tupleClassTag);
  return new JavaPairRDD<>(newRDD, keyClassTag, iterableClassTag);
}

代码示例来源:origin: org.apache.pig/pig

@Override
public RDD<Tuple> convert(List<RDD<Tuple>> predecessors,
    PODistinct op) throws IOException {
  SparkUtil.assertPredecessorSize(predecessors, op, 1);
  RDD<Tuple> rdd = predecessors.get(0);
  // In DISTINCT operation, the key is the entire tuple.
  // RDD<Tuple> -> RDD<Tuple2<Tuple, null>>
  RDD<Tuple2<Tuple, Object>> keyValRDD = rdd.map(new ToKeyValueFunction(),
      SparkUtil.<Tuple, Object> getTuple2Manifest());
  PairRDDFunctions<Tuple, Object> pairRDDFunctions
   = new PairRDDFunctions<Tuple, Object>(keyValRDD,
      SparkUtil.getManifest(Tuple.class),
      SparkUtil.getManifest(Object.class), null);
  int parallelism = SparkPigContext.get().getParallelism(predecessors, op);
  return pairRDDFunctions.reduceByKey(
      SparkUtil.getPartitioner(op.getCustomPartitioner(), parallelism),
      new MergeValuesFunction())
      .map(new ToValueFunction(), SparkUtil.getManifest(Tuple.class));
}

代码示例来源:origin: uk.gov.gchq.gaffer/spark-accumulo-library

private RDD<Element> doOperationUsingElementInputFormat(final GetRDDOfAllElements operation,
                            final Context context,
                            final AccumuloStore accumuloStore)
    throws OperationException {
  final Configuration conf = getConfiguration(operation);
  addIterators(accumuloStore, conf, context.getUser(), operation);
  final String useBatchScannerRDD = operation.getOption(USE_BATCH_SCANNER_RDD);
  if (Boolean.parseBoolean(useBatchScannerRDD)) {
    InputConfigurator.setBatchScan(AccumuloInputFormat.class, conf, true);
  }
  final RDD<Tuple2<Element, NullWritable>> pairRDD = SparkContextUtil.getSparkSession(context, accumuloStore.getProperties()).sparkContext().newAPIHadoopRDD(conf,
      ElementInputFormat.class,
      Element.class,
      NullWritable.class);
  return pairRDD.map(new FirstElement(), ELEMENT_CLASS_TAG);
}

代码示例来源:origin: uk.gov.gchq.gaffer/spark-accumulo-library

private RDD<Element> doOperation(final GetRDDOfElements operation,
                 final Context context,
                 final AccumuloStore accumuloStore)
    throws OperationException {
  final Configuration conf = getConfiguration(operation);
  final SparkContext sparkContext = SparkContextUtil.getSparkSession(context, accumuloStore.getProperties()).sparkContext();
  sparkContext.hadoopConfiguration().addResource(conf);
  // Use batch scan option when performing seeded operation
  InputConfigurator.setBatchScan(AccumuloInputFormat.class, conf, true);
  addIterators(accumuloStore, conf, context.getUser(), operation);
  addRanges(accumuloStore, conf, operation);
  final RDD<Tuple2<Element, NullWritable>> pairRDD = sparkContext.newAPIHadoopRDD(conf,
      ElementInputFormat.class,
      Element.class,
      NullWritable.class);
  return pairRDD.map(new FirstElement(), ClassTagConstants.ELEMENT_CLASS_TAG);
}

代码示例来源:origin: org.apache.pig/pig

@Override
public RDD<Tuple> convert(List<RDD<Tuple>> predecessors, POSort sortOperator)
    throws IOException {
  SparkUtil.assertPredecessorSize(predecessors, sortOperator, 1);
  RDD<Tuple> rdd = predecessors.get(0);
  int parallelism = SparkPigContext.get().getParallelism(predecessors, sortOperator);
  RDD<Tuple2<Tuple, Object>> rddPair = rdd.map(new ToKeyValueFunction(),
      SparkUtil.<Tuple, Object> getTuple2Manifest());
  JavaPairRDD<Tuple, Object> r = new JavaPairRDD<Tuple, Object>(rddPair,
      SparkUtil.getManifest(Tuple.class),
      SparkUtil.getManifest(Object.class));
  JavaPairRDD<Tuple, Object> sorted = r.sortByKey(
      sortOperator.getMComparator(), true, parallelism);
  JavaRDD<Tuple> mapped = sorted.mapPartitions(TO_VALUE_FUNCTION);
  return mapped.rdd();
}

代码示例来源:origin: uk.gov.gchq.gaffer/spark-accumulo-library

private RDD<Element> doOperation(final GetRDDOfElementsInRanges operation,
                   final Context context,
                   final AccumuloStore accumuloStore)
      throws OperationException {
    final Configuration conf = getConfiguration(operation);
    final SparkContext sparkContext = SparkContextUtil.getSparkSession(context, accumuloStore.getProperties()).sparkContext();
    sparkContext.hadoopConfiguration().addResource(conf);
    // Use batch scan option when performing seeded operation
    InputConfigurator.setBatchScan(AccumuloInputFormat.class, conf, true);
    addIterators(accumuloStore, conf, context.getUser(), operation);
    addRangesFromPairs(accumuloStore, conf, operation);
    final RDD<Tuple2<Element, NullWritable>> pairRDD = sparkContext.newAPIHadoopRDD(conf,
        ElementInputFormat.class,
        Element.class,
        NullWritable.class);
    return pairRDD.map(new FirstElement(), ClassTagConstants.ELEMENT_CLASS_TAG);
  }
}

代码示例来源:origin: org.apache.pig/pig

@Override
public RDD<Tuple> convert(List<RDD<Tuple>> predecessors, POSampleSortSpark sortOperator)
    throws IOException {
  SparkUtil.assertPredecessorSize(predecessors, sortOperator, 1);
  RDD<Tuple> rdd = predecessors.get(0);
  RDD<Tuple2<Tuple, Object>> rddPair = rdd.map(new ToKeyValueFunction(),
      SparkUtil.<Tuple, Object> getTuple2Manifest());
  JavaPairRDD<Tuple, Object> r = new JavaPairRDD<Tuple, Object>(rddPair,
      SparkUtil.getManifest(Tuple.class),
      SparkUtil.getManifest(Object.class));
   //sort sample data
  JavaPairRDD<Tuple, Object> sorted = r.sortByKey(true);
   //convert every element in sample data from element to (all, element) format
  JavaPairRDD<String, Tuple> mapped = sorted.mapPartitionsToPair(new AggregateFunction());
  //use groupByKey to aggregate all values( the format will be ((all),{(sampleEle1),(sampleEle2),...} )
  JavaRDD<Tuple> groupByKey= mapped.groupByKey().map(new ToValueFunction());
  return  groupByKey.rdd();
}

代码示例来源:origin: org.apache.pig/pig

private JavaRDD<Tuple2<IndexedKey, Tuple>> handleSecondarySort(
    RDD<Tuple> rdd, POReduceBySpark op, int parallelism) {
  RDD<Tuple2<Tuple, Object>> rddPair = rdd.map(new ToKeyNullValueFunction(),
      SparkUtil.<Tuple, Object>getTuple2Manifest());
  JavaPairRDD<Tuple, Object> pairRDD = new JavaPairRDD<Tuple, Object>(rddPair,
      SparkUtil.getManifest(Tuple.class),
      SparkUtil.getManifest(Object.class));
  //first sort the tuple by secondary key if enable useSecondaryKey sort
  JavaPairRDD<Tuple, Object> sorted = pairRDD.repartitionAndSortWithinPartitions(
      new HashPartitioner(parallelism),
      new PigSecondaryKeyComparatorSpark(op.getSecondarySortOrder()));
  JavaRDD<Tuple> jrdd = sorted.keys();
  JavaRDD<Tuple2<IndexedKey, Tuple>> jrddPair = jrdd.map(new ToKeyValueFunction(op));
  return jrddPair;
}

代码示例来源:origin: org.apache.pig/pig

private JavaRDD<Tuple2<IndexedKey,Tuple>> handleSecondarySort(
    RDD<Tuple> rdd, POGlobalRearrangeSpark op, int parallelism) {
  RDD<Tuple2<Tuple, Object>> rddPair = rdd.map(new ToKeyNullValueFunction(),
      SparkUtil.<Tuple, Object>getTuple2Manifest());
  JavaPairRDD<Tuple, Object> pairRDD = new JavaPairRDD<Tuple, Object>(rddPair,
      SparkUtil.getManifest(Tuple.class),
      SparkUtil.getManifest(Object.class));
  //first sort the tuple by secondary key if enable useSecondaryKey sort
  JavaPairRDD<Tuple, Object> sorted = pairRDD.repartitionAndSortWithinPartitions(
      new HashPartitioner(parallelism),
      new PigSecondaryKeyComparatorSpark(op.getSecondarySortOrder()));
  JavaRDD<Tuple> jrdd = sorted.keys();
  JavaRDD<Tuple2<IndexedKey,Tuple>> jrddPair = jrdd.map(new ToKeyValueFunction(op));
  return jrddPair;
}

相关文章