Customizable record export

For this new version of refinery, we set ourselves the goal to improve some of the existing parts of the application and create connections to existing labeling tools. Outside of the application, we worked on reducing the overall size of refinery and all the docker images combined are now just 5.2 GB in size. In the application, you can now select new data types for calculated attributes to better reflect how that data should be treated. The global comment system we introduced in v1.4.0 now looks a lot better and slides in smoothly from the side of the screen. If you want to export your records, you can now precisely select the format, attributes, labeling tasks, and much more.

Reduction of refinery's total size

Every time you start refinery you start more than 20 separate services that are containerized with docker. Those separate services use memory-intensive libraries, e.g. the npm modules for our frontend or PyTorch for transformers. Before this change, every service was containerized separately, so if two different services require PyTorch to be installed, they would both install it in their own virtual environment. This resulted in a total size of 10.96 GB (arm) or 15.32 GB (amd) for refinery v1.4.0, which was quite a lot to pull for every new update. Thanks to some excellent engineering, refinery now only requires you to download 5.2 GB, a size reduction of more than 50 percent. Also updating utilizes the layer structure to its fullest so oftentimes only the last layer needs to be updated. Depending on the container a few KB up to ~500 MB.

This was achieved by two optimizations: choosing smaller parent images and sharing layers between different images. If you're interested in a more elaborate explanation, we will probably do a blog post about that soon.

Now that we're talking about it, we want to take this opportunity to also remind you about removing outdated versions of refinery from docker that you don't need anymore, as this is not part of our update procedure! If you have questions regarding this take a look at docker prune or reach out to us during our office hours.

Customizable record export - Supporting Label Studio import format

We are excited to bring you a new record export functionality that lets you customize the export to your needs. There are different presets that you can choose from, or you can make all the decisions yourself. This is great news if you need a specific format or just a selection of attributes. If you're exporting to Label Studio there is also a neat little feature to prepare your labeling interface based on your refinery project settings.

Specify data types in your attribute calculation

We introduced attribute calculation in v1.3.0, which allows you to create new attributes programmatically directly in refinery. As the potential of this is huge, we wanted to further enhance the usability. That is why you can now specify the attribute type at creation time.

There are two main advantages of specifying the data type of your calculated attribute:

  • better documentation
  • reliability through type safety

The type of the returned values from your function is checked against the type that you specified and if something does not match, you will get a ValueError stating what went wrong. This is especially useful with the "Run on 10" functionality, which can be used to check the correctness of your function on ten randomly sampled records.

We recommend utilizing this typing functionality as future versions of refinery will be able to use this information for much more, e.g. better filtering in the data browser.

To get you started, the selection of the data type also decides the example code that is generated, so you get a sense of how your function could look like.

Minor changes

  • Bugfix import Primary-Key: Issue#105
  • Bugfix save slice on low confidence: Issue#99
  • Bugfix empty columns display in data browser & labeling: Issue#144
  • Bugfix admin dashboard user deletion (managed version): Issue#132
  • Bugfix utf-8 encoding on json export: Issue#89
  • Color coding for attribute types in labeling function execution environment and attribute calculation