要连接到Beam现有的I/O连接器不支持的数据存储,您必须创建一个通常由source和sink组成的自定义I/O连接器。所有的Beam source和sink都是复合变换;但是,自定义I/O的实现取决于您的使用场景。在开始之前,请阅读新的I/O连接器概述,了解开发新的I/O连接器的概述、可用的实现选项以及如何为您的使用场景选择正确的选项。

  1. To connect to a data store that isnt supported by Beams existing I/O connectors, you must create a custom I/O connector that usually consist of a source and a sink. All Beam sources and sinks are composite transforms; however, the implementation of your custom I/O depends on your use case. Before you start, read the new I/O connector overview for an overview of developing a new I/O connector, the available implementation options, and how to choose the right option for your use case.

本指南介绍了使用Java使用源和FileBasedSink接口。Python SDK提供了相同的功能,但使用的API略有不同。有关特定于Python SDK的信息,请参阅为Python开发I/O连接器。

  1. This guide covers using the Source and FileBasedSink interfaces using Java. The Python SDK offers the same functionality, but uses a slightly different API. See Developing I/O connectors for Python for information specific to the Python SDK.

Basic code requirements

Beam runner使用您提供的类来并行地使用多个worker实例读写数据。因此,您为Source和FileBasedSink子类提供的代码必须满足一些基本要求:

  1. Beam runners use the classes you provide to read and/or write data using multiple worker instances in parallel. As such, the code you provide for Source and FileBasedSink subclasses must meet some basic requirements:
  • 可序列化性:您的Source或FileBasedSink子类,无论是否有界,都必须是可序列化的。runner可能会创建Source或FileBasedSink子类的多个实例,并将其发送给多个远程workers,以促进并行读写。

    1. Serializability: Your Source or FileBasedSink subclass, whether bounded or unbounded, must be Serializable. A runner might create multiple instances of your Source or FileBasedSink subclass to be sent to multiple remote workers to facilitate reading or writing in parallel.
  • 不变性:您的源或FileBasedSink子类必须是有效不可变的。所有私有字段必须声明为final,集合类型的所有私有变量必须是有效不可变的。如果您的类有setter方法,那么这些方法必须返回修改了相关字段的对象的独立副本。

    1. Immutability: Your Source or FileBasedSink subclass must be effectively immutable. All private fields must be declared final, and all private variables of collection type must be effectively immutable. If your class has setter methods, those methods must return an independent copy of the object with the relevant field modified.

    如果您正在对需要实现源或接收器的昂贵计算进行延迟计算,则应该只在源或FileBasedSink子类中使用可变状态;在这种情况下,必须声明所有可变实例变量为transient。

    1. You should only use mutable state in your Source or FileBasedSink subclass if you are using lazy evaluation of expensive computations that you need to implement the source or sink; in that case, you must declare all mutable instance variables transient.
  • 线程安全:您的代码必须是线程安全的。如果您构建源代码来处理动态工作平衡,那么使代码线程安全是至关重要的。Beam SDK提供了一个帮助类来简化这一过程。有关更多细节,请参见 Using Your BoundedSource with dynamic work rebalancing。
    1. Thread-Safety: Your code must be thread-safe. If you build your source to work with dynamic work rebalancing, it is critical that you make your code thread-safe. The Beam SDK provides a helper class to make this easier. See Using Your BoundedSource with dynamic work rebalancing for more details.
  • 可测试性:对所有的源类和FileBasedSink子类进行详尽的单元测试是非常重要的,特别是当您构建类来使用诸如动态工作再平衡等高级特性时。一个很小的实现错误可能导致难以检测的数据损坏或数据丢失(例如跳过或复制记录)。
    1. Testability: It is critical to exhaustively unit test all of your Source and FileBasedSink subclasses, especially if you build your classes to work with advanced features such as dynamic work rebalancing. A minor implementation error can lead to data corruption or data loss (such as skipping or duplicating records) that can be hard to detect.
    为了帮助测试受限制的源代码实现,您可以使用SourceTestUtils类。SourceTestUtils包含用于自动验证您的BoundedSource实现的某些属性的实用程序。您可以使用SourceTestUtils来增加您的实现的测试覆盖率,使用范围广泛的输入和相对较少的代码行。有关使用SourceTestUtils的示例,请参阅AvroSourceTest和TextIOReadTest源代码。
    1. To assist in testing BoundedSource implementations, you can use the SourceTestUtils class. SourceTestUtils contains utilities for automatically verifying some of the properties of your BoundedSource implementation. You can use SourceTestUtils to increase your implementations test coverage using a wide range of inputs with relatively few lines of code. For examples that use SourceTestUtils, see the AvroSourceTest and TextIOReadTest source code.
    此外,有关Beam的转换样式指南,请参阅 PTransform style guide
    1. In addition, see the PTransform style guide for Beams transform style guidance.
文档更新时间: 2020-01-16 08:46   作者:admin