Jeff Li

Be another Jeff

Avro Cookbook : Part I

Avro is a data serialization framework. It is an Apache project led by Doug Cutting who is also the author of several other open source projects such as Hadoop, Lucene. Recently I need to leverage Avro to serialize/deserialize some data, however, I found its document is too poor, at least too poor for newbies like me who don’t have much experience on data exchange format frameworks.

In fact, it is very easy to understand what Avro can do. It helps to convert Java objects into bytes and vice versa. The key information the framework needs to know is the format of the date, namely ‘Schema’ in Avro. In this article, I won’t spend any time on explaining what Avro is. ## Recipe 1: Create a Maven Avro Project Intellij IDEA is my favorite Java IDE. The free Community edition has less features than the commercial Ultimate edition, however, great experience may be gained when the free community IDEA works with Maven. They complete each other. So the examples in this article will use Maven and Intellij IDEA as the IDE. Besides, TestNG instead of JUnit will be used as the test framework.

Initialize the project structure

  • Create an project with quickstart archetype:
1
mvn archetype:generate -DgroupId=me.jeffli -DartifactId=avrosamples -Dversion=0.01 -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false

Tweak the pom.xml

  • Add Avro dependency:
1
2
3
4
5
<dependency>
   <groupId>org.apache.avro</groupId>
   <artifactId>avro</artifactId>
   <version>1.7.5</version>
</dependency>
  • Use Avro Maven plugin
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
<plugin>
   <groupId>org.apache.avro</groupId>
   <artifactId>avro-maven-plugin</artifactId>
   <version>1.7.5</version>
   <executions>
      <execution>
         <phase>generate-sources</phase>
         <goals>
            <goal>schema</goal>
         </goals>
         <configuration>
            <!-- make sure the directory is created -->
            <sourceDirectory>${project.basedir}/src/main/avro/</sourceDirectory>
            <outputDirectory>${project.basedir}/src/main/java/</outputDirectory>
         </configuration>
      </execution>
   </executions>
</plugin>

It should be noted that the directory ${project.basedir}/src/main/avro/ must be created even it is empty at first. It is used to place the Avro schema files. The whole pom.xml has been posted to github gist.

Import the project into Intellij IDEA

IDEA provides full support to Maven, so it is very easy to import the Maven project as a IDEA project. Click “Import Project” in the ‘Quick Start’ panel. I suggest enable the Maven Auto-Import feature of IDEA before completing the importing process.

Recipe 2: Define a Schema

Assume that you want to log every access of your server, to make it simple, we only define 3 attributes in a log entry, namely the username, resource and ip. So the schema can be defined as :

1
2
3
4
5
6
7
8
9
10
{
 "namespace": "me.jeffli.avrosamples.model",
 "type": "record",
 "name": "LogEntry",
 "fields": [
     {"name": "name", "type": "string"},
     {"name": "resource",  "type": ["string", "null"]},
     {"name": "ip", "type": ["string", "null"]}
 ]
}

Save the content as ${project.basedir}/src/main/avro/LogEntry.avsc. After running mvn compile, a Java class me.jeffli.avrosamples.model.LogEntry will be generated automatically thank to the Avro Maven plugin.

Recipe 3: Serialize the Log Data to Disk File

Assume we want to store the log data to a disk file /tmp/log. The code snippet would be like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
@Test
public void testSerializeLogEntries() throws IOException {
   LogEntry entry1 = new LogEntry();
   entry1.setName("Jeff");
   entry1.setResource("readme.txt");
   entry1.setIp("192.168.1.1");

   LogEntry entry2 = new LogEntry();
   entry2.setName("John");
   entry2.setResource("readme.md");
   entry2.setIp("192.168.1.2");

   DatumWriter<LogEntry> logEntryDatumWriter = new SpecificDatumWriter<>(LogEntry.class);
   DataFileWriter<LogEntry> dataFileWriter = new DataFileWriter<>(logEntryDatumWriter);
   File file = new File("/tmp/log");
   dataFileWriter.create(entry1.getSchema(), file);

   dataFileWriter.append(entry1);
   dataFileWriter.append(entry2);

   dataFileWriter.close();
}

Recipe 4: Deserialize the Log Data from Disk File

Assume you need to parse the log data from disk files /tmp/log. Then the code snippet would be:

1
2
3
4
5
6
7
8
9
10
11
@Test (dependsOnMethods = "testSerializeLogEntries")
public void testDeSerializeLogEntries() throws IOException {
   DatumReader<LogEntry> logEntryDatumReader = new SpecificDatumReader<>(LogEntry.class);
   File file = new File("/tmp/log");
   DataFileReader<LogEntry> dataFileReader = new DataFileReader<>(file, logEntryDatumReader);
   LogEntry entry = null;
   while (dataFileReader.hasNext()) {
      entry = dataFileReader.next(entry);
      System.out.println(entry);
   }
}

Comments