Several different types of intellectual property (IP) are created in biotech projects which provide opportunities for value extraction but also pose commercial risks. To address this properly, it is important to conduct an IP stocktake to list and map all IP as individual components. This collection can then be checked against the business model to identify IP components that are most valuable to the business. The IP components should then be the focus of IP protection, risk assessment and value creation. This process can result in an increase in overall business performance through better visibility and more effective IP management.

Additional value can be found in ‘unconventional’ IP

Biotech projects increasingly create value in software and data products in addition to traditional outputs, such as diagnostic and treatment methods. For example, many research groups operate sequencing machines, such as whole genome sequencers, protein mass spectrometers, etc. These sequencing machines generate a large amount of data. From an IP perspective it is almost irrelevant what type of data this is, so the data could also be images, for example. The data is then annotated with observations for each sample, such as cancer type, diagnosed disease, plant performance and animal trait.

Data scientists typically download third party software algorithms and write further software code to make the third party software algorithms usable for the particular problem and data structure at hand. Once the software is tailored to the available input data and desired output, the data scientists run that software to train a machine learning model, such as a neural network.

The trained model can then evaluate the sequencing data from a new sample without observations to predict an observation for that sample. For example, the trained model evaluates sequencing data from a new sample and predicts a cancer type or a disease. A ‘domain expert’ (e.g. clinician, farmer) can then use this prediction to provide a diagnosis to a patient, administer personalised treatment, select animals or plants for breeding, etc.

While there is a high level of expertise in how to protect traditional IP, such as biomarkers and treatment, the ‘unconventional’ IP (residing in software code, database content, trained models, algorithms and trade secrets) can be easily lost because the available options for value creation are less clear.

So how can we make sure that no valuable IP ends up in lost property? And what are the risks that could become a deal breaker during commercialisation, such as during due diligence in an investment round or an acquisition?

In this article, we will first describe what big data, software and machine learning looks like. We will also provide some insights from our experience in the field on how to extract value and navigate risks around these issues.

What does big data look like?

Big data is typically generated by different types of machines, simply because machine learning requires very large datasets which are impossible to create manually. Having said that, adding the observation to the dataset is often a manual process.

In essence, these machines generate a large amount of data that does not explicitly include the information of interest. So if a particular gene mutation is of interest, it would be possible to detect that with a directed gene test without the need for machine learning (ML). But instead, whole genome sequencers sequence the entire genome including any genes of unknown relevance. For the purpose of devising an IP strategy, it is not important what type of data is being collected. It is sufficient to know that the machines generate some sort of ‘biological’ data. This biological data can be sequencing data or could also be image data from a camera, e.g. inside a stomach, or other sensor data, e.g. step counter. What is important is that the data includes many measurements that are annotated with observations.

Often, the ‘raw data’ is processed to extract ‘features’ such as gene variants or image features, e.g. amount of red in the image.

The features are then stored together with the observations in a data format that looks very similar to a giant spreadsheet with millions of columns and thousands of rows, which often requires specific storage technology. This dataset may also be available publicly, so that some research groups can rely on external data instead of operating their own machines.

The important aspect is that the data may become useful for further projects later. For example, in a cancer project, the annotations may also include diabetes observations, which can be used in a future diabetes project. For that reason, there is often significant value in the dataset itself, which also reflects the high cost of generating these datasets. What this also makes clear is that accuracy and structure of annotations significantly affect the value of a dataset.

What does bioinformatics software look like?

Most bioinformatics software is a sequence of programs that process the dataset in the sense that the output of one program is used by the next program as an input. Therefore, the entire product is called a ‘pipeline’. Most of the actual program code, especially the algorithms, is downloaded from public sources. The data scientists create the connectors and data converters so that the available algorithms can be used. Sometimes, new algorithms are invented or existing ones modified to deal with the immense size of the data.

It is not unusual for a pipeline to take hours or days, so the analysis can quickly become impractical due to long wait times, i.e. a diagnosis that takes a year is not very useful. For that reason, the pipelines run on super-computer clusters or cloud computing platforms, such as Amazon Web Services (AWS), keeping in mind that the costs also increase with the amount of computational power that is required.

What does machine learning look like?

In its simplest form ML looks for feature values that predict a specific trait. So for example, ML analyses the genome from 1,000 cancer patients and finds the gene mutation that predicts cancer. In a simplistic way, the ML algorithm counts, for each mutation, how many cancer patients have this mutation and how many healthy subjects have this mutation. The algorithm then choses the mutation with the largest difference. The ‘trained ML Model’ would then look like: “IF mutationX THEN cancer”. A new patient can then be diagnosed by looking for this mutation.

This shows that the concept of ML is in itself straight forward. But it is the mathematical detail in real-world applications that can create problems. In particular, realistic datasets are large, complex to process and often inaccurate. Further, most algorithms perform poorly if the number of features is high but the number of learning samples are low. In those cases, the algorithm cannot find a good predictor. A large amount of data engineering is also necessary to find the most suitable algorithm for a particular problem. On top of that, most algorithms, in particular neural networks, are configurable across a wide gamut of options, which makes it very difficult to find the optimal configuration.

Types of intellectual property that should be considered in an IP strategy

In this section we provide four types of IP that should be considered in an IP strategy: copyright, patents, trade secrets and data.

Copyright in source code

Copyright protects creative works, which generally includes software code. This right does not need to be registered; it exists upon creation of the code. The right’s owner is originally the author, that is, the programmer of the code. As with other property types, most employment contracts transfer the copyright to the programmer’s employer. However, for less mature businesses, the programmers may have written the code outside their employment contract, which means the copyright needs to be assigned.

Since software projects often include a large amount of code, the code is rarely written by a single person but by a team of programmers. As a result, each line of code could potentially be owned by a different person, which is a cause of concern for many IP managers. However, the tools that programmers use to write their code inherently log all contributions to the code, so that the information about who has written what is readily available. In particular, the tools can normally produce a list of authors that have contributed over a given time period and that list can then be checked against employment contracts.

As well as ensuring proper ownership in the software code, it is also important to check that third party code is identified and considered for potential freedom to operate issues. This is especially important in biotech because most software used is third-party code. More particularly, most third party code is subject to a software licence that is implicitly accepted by downloading and/or using the code (known as a ‘click-wrap licence’). It is therefore important during due diligence to prepare a list of third party software programs and associated licences. The licences should then be reviewed to ensure that the terms and conditions are met.

Two common problems can arise. First, if code is in-licensed under the GNU General Public Licence, this licence can force the release of the entire software code as open source, which can significantly impact its value. Second, the licence may restrict the use of the code to non-commercial activities, such as research, and a commercial licence must be purchased if the plan is to monetize the software. If no suitable licence can be obtained from the copyright owner, it is often possible to use alternatives or re-implement the functionality outside copyright restrictions.


Since copyright only protects the specific ‘expression’ of an idea, it is typically possible to re-write the algorithm to do exactly the same but express it in a different form. This would avoid copyright of third party code. In that case, patents provide broader protection that covers all different expressions of the patented algorithm (or invention in general). On the other hand, patents need to be registered in each country of interest, which can be costly. Each country also assesses whether the algorithm is new, inventive and not abstract, i.e. technical.

The concept of ‘technicality’ is elusive and changes continuously. The question is: When does an abstract mathematical algorithm become a technical solution? It is impossible to answer the question in a general sense, so instead it may be instructive to list inventions that should be considered ‘technical’:

  • Efficient computer science algorithms (reduced CPU time, less memory)
  • Efficient data structures (graph databases, convert data to image for ML)
  • User interfaces (interactive, faster)
  • Equipment (hyperspectral cameras)
  • Clinical outcomes (methods of treatment)
  • Some diagnostics (biomarkers)

On the other hand, the following inventions may be difficult to patent:

  • Mere presentation of data (genome browser)
  • Data itself
  • Association itself (gene X causes cancer Y)
  • Algorithm to find associations unless more efficient computationally
  • Machine learning as such (concept of neural networks) except specific adaptations to technical problems
  • Discoveries of naturally occurring phenomena

Trade secrets/know-how

An alternative option to a patent is a trade secret. A trade secret is a particular insight or piece of information that is important to a business. Since patents inevitably get published, it is not possible to keep a patent as a trade secret.

There may be a view that trade secrets exist automatically simply through the absence of intentional publication. However, it is easy to lose control of trade secrets unless stringent processes are put in place to keep the information secret. So it is vital that specific pieces of information are identified as important to the business and appropriately labelled as trade secret, so that all employees are aware and access to that information is controlled. This is key because once the secrecy is lost, there is no way to regain it.

The decision is often whether to apply to patent an invention or keep it secret. One consideration may be whether the invention is detectable in the product. If it is not, it may be better to keep it secret. If it is detectable, patenting may be the better option because the invention may become visible to the public anyway.

Examples of potential trade secrets include calculated parameters of a trained ML model, achieved accuracies, and tuning parameters of the algorithms. For model selection and configuration, patents and trade secrets can be chosen depending on the business model.

Know-how is often combined with trade secrets and relates to expertise that is held by individuals of the company and may be publicly known but only to a small number of people in the industry. As such, know-how of individuals can be crucial to a company and should be managed.

Database rights

In most countries (except those in Europe) there are no specific database rights that could be comparable to copyright. This is potentially problematic given that bioinformatics projects typically require access to and transfer of large amounts of data. Importantly, however, the data can be protected provided that access to it is associated with terms and conditions through a contractual arrangement. For example, Twitter provides a programming interface that enables the analysis of tweets, but its terms and conditions set out clearly what is not allowed, such as identification of individuals.

So when a business model of a company relies on granting access to its data, it is important that this access is controlled contractually with suitable terms and conditions so that the data itself and the commercial value associated with it does not evaporate.

Commercialisation strategies for your IP

Almost every biotech project that involves the use of software generates IP in all of the categories above. But how to extract commercial value out of that IP? And how to identify key risks for the operation of the business? In relation to risks, these mostly arise from licensing and contractual issues, which need to be reviewed as set out above.

IP mapping and fact finding

It should be possible to list all IP components in one document, and this will be the first step in working up an IP strategy. This can be achieved by multiple meetings of an IP professional (such as a patent attorney) with the developers of the company to discuss what they are working on. This enables the identification of IP components. The emphasis lies here on the plural of components because it is important to point out that IP is then not an amorphous mass any more, but a well-defined list of items. This list is easy to update and missing items can be identified over time. Of course, there is some skill in getting the granularity right as there is a trade-off between an overwhelming amount of detail and an overgeneralisation.

A useful tool can be a map of IP, which shows all identified IP components on a single page. Each piece can be arranged on the page in two dimensions that are specific to the project. For example, the x-axis may represent different industry verticals or simply hardware, software and data. The y-axis can represent a progression of know-how – research – algorithms – results – products – business opportunities.

Fact based decision making

The next step would then be to review the map with the managing team and identify IP components that are crucial to the business model ‘crown jewels’. This review should be conducted by an experienced IP professional in close collaboration with the management team. The identified pieces then need to get attention in the form of clarification of copyright and licensing issues, patenting, formal protection of secrets, and data access control. While it is not listed above, trade marks can also be a useful tool to protect a business reputation for prediction accuracy achieved through a carefully managed training dataset, for example. Overall, like any other business decision, this is a question of where to invest to achieve an optimal return in terms of business value. An IP map will provide clear facts on which to base this decision.


One important risk factor is a potential breach of contractual agreements. These agreements are typically entered implicitly, such as by downloading software code or by accessing public data. Therefore, it is highly recommended to compile a list of licenses and other agreements that the company is subject to.

A further risk is a lack of ownership to the program code that has been created. To mitigate this risk, a list of software authors needs to be created and employment contracts or other IP transfer agreements need to be reviewed and maintained.

Potential inventors or collaboration partners will see significant value in opportunities where these risks are addressed along the way and information is readily available for due diligence purposes.

Examples of business models and associated IP classes

Business modelTypical IP classes
Sequence in-house, sell sequencing data, income finances researchData access and future trade secrets and patents
Offer clinical servicesKnow-how and patents on quality of service (accreditation, often public)
Software as a service (pay per use) where customers send their sequencing data for analysisTrade secrets in algorithms and ML, copyright in code
Deploy and out-licence software (for data privacy)

Copyright and patents for technical inventions

Receive test samples, train model, provide resultData access, trade secrets for algorithms, patents for efficient calculations
Provide access to trained model

Data access

Develop expertise, sell-off to large corporate

Patents as company assets

Provide access to databaseData access

Key takeaways for IP opportunities and risks in software and big data

  • Data technology: Many biotech projects generate and process big data. Typically, this big data is a list of features annotated with observations which can be used in bioinformatics pipelines. These pipelines typically include a considerable proportion of third party code, including machine learning algorithms. The algorithms train models on the big data, which can then be applied to a test sample to predict an observation for that test sample, such as a disease diagnosis.
  • IP types: In this process, there are various different types of IP at play including copyright, patents, trade secrets/know-how and database access.
  • Strategy: Every project has a different distribution of IP across these types and a useful strategy is to identify all IP components and arrange them on an IP map. This supports informed decision making on which IP components are business critical and should therefore be allocated additional investment for protection.