Data mining still plays a minor role in the research-based pharmaceutical industry. But this is likely to change in the future. Mathematician Hans-Jürgen Lomp predicts that exploratory data analysis will be used to a greater extent in the future. Lomp is the Global Head of Statistics in Boehringer Ingelheim Pharma GmbH & Co KG’s Department of Medical Data Service and Biostatistics in Biberach.
The confirmatory analysis of data obtained in clinical phase III trials remains the core business of any research-based pharmaceutical company and involves what is known as the learn and confirm strategy. This approach is essentially used in clinical trials to verify one or several hypotheses and provide data-generated proof of the efficacy and safety of a test substance. Ideally, the results obtained in phase III trials confirm the phase II results on drug efficacy and safety.
Confirmatory clinical data analysis is the confirmation or rejection of something for which there are no a priori expectations, and exploratory analysis is the exact opposite. As Lomp points out, exploratory analysis uses a variety of well-structured statistical methods that allow the data to “speak for themselves and tell a story”, which, in contrast to confirmatory analysis, is not yet known. “However, data mining will never be able to do more than generate hypotheses,” says Lomp. When data-mining generated hypotheses are found to be accurate and reliable, the pharmaceutical manufacturer then verifies the hypotheses by carrying out further (confirmatory) trials.
Drug developers can never guarantee that there will be no surprises. Basically, however, the clinical development of a pharmaceutical compound is a strictly sequential and rational process that progresses from one evidential step to the next. Lomp believes that this is one of the reasons why any additional information data mining produces has usually little influence on the process. In clinical drug development, data mining is used for affirming hypotheses put forward either by drug producers themselves or interested bodies (regulatory authorities and academic researchers).
Lomb predicts that the mining of data in clinical settings will become really interesting once all phase III data obtained for all drugs of a certain substance class have been pooled and analysed. He is convinced that this will lead to a much broader database that can then be used for network meta-analyses, individual patient data meta-analyses as well as analyses using data-mining methods, and that these analyses will potentially uncover previously undetected, rare, but serious adverse drug effects. Moreover, if a specific drug led to unexpected results, the investigators would be able to carry out similar investigations on drugs of the same substance class in order to find out whether the unexpected finding also holds true for these other drugs. Last but not least, data mining also enables the accurate examination of the positive and negative effects of the drug under investigation in rare but vulnerable patient groups (e.g. aged patients or people with kidney or liver disease). However, it remains to be seen who will have the capacity to mine such large amounts of data. Lomp believes that it is the government’s responsibility to create the required capacity, at university institutes for example.
In future, clinical trial information will need to be made more accessible to relevant audiences, and pharmaceutical producers would therefore be well advised to have their data systematically analysed with data-mining techniques in order to avoid oversights. This is essential in an industry where anybody can patent new, i.e. previously unknown drug characteristics as long as they can prove that the discovery was made on the basis of data evidence that they themselves have produced. Lomp therefore identifies this as a new field of activity for pharmaceutical statisticians. Back to the principle of confirmatory analysis: Clinical phase III trial data are used to test the hypotheses formulated in phase I and II trials on a large representative patient collective. Such tests follow a carefully planned analysis strategy to validate the efficacy and safety of the substance under investigation. A phase III trial report, including tables and statistical evaluations, can be over 15,000 pages long, without taking into account other documents that have to be submitted alongside the main dossier, including the protocol, the analysis plan, the kineticists’ report, validated analysis methods and qualified laboratory parameters.In general, evidence from at least two independent controlled clinical phase III trials is required for drug approval. Any pharmaceutical company that submits an application for a marketing authorization must also append the results of all preclinical and clinical trials (phase I to III) as individual reports and a structured summary of all phase II and III trials with regard to the efficacy and safety of the drug. When they collate this document, Lomp and his colleagues need to ensure that they are providing a comprehensive overview of the data from all of the trials. Only this overall approach potentially allows for the discovery of rare adverse events and statements to be made on the occurrence of rare but clinically important consequences of chronic diseases such as heart attack or stroke.
Amongst other things, data mining methods are used in cases when the regulatory authorities require information on the efficacy and safety of a drug under investigation of a particular ethnic population in North America, Europe or Asia. In such cases, the generation of evidence is more difficult because the population in question is comparatively small and also because the large number of data-driven issues make it extremely difficult to differentiate between a random finding and a “real signal”. In order to obtain reliable results, pharmaceutical statisticians like Lomp tend to analyse the data obtained in a large number of studies to see if individual trials have produced the same results. However, pharmaceutical statisticians are well aware that the merits of a drug under investigation are more uncertain the smaller the target group is.
Regulatory authorities and drug developers are all interested in the exploratory aspects of the data. In fact, drug developers have a particular interest in ensuring that the drug is only given to patients in cases where a positive benefit-risk ratio can be demonstrated, i.e. drugs whose benefit clearly outweighs the potential risk. When applying for marketing authorization, a drug producer is required to demonstrate a favourable risk-benefit ratio for drugs that are given to particularly vulnerable patients. If this proves impossible, the drug must include a health warning or a different dose recommendation.
Developing a drug takes many years and the unexpected can sometimes occur, such as situations where new scientific findings either require previous hypotheses to be reviewed or render them obsolete. A clinical project team therefore always has people in the non-clinical drug development area working on the comprehensive characterisation of the drug under investigation. “The constant feedback between clinical and non-clinical teams is extremely important and on-going data exchange is essential for the comprehensive characterisation of the drug”, says Lomp.
Clinical trials may be extremely comprehensive and are, of course, essential for the approval of a drug, but it needs to be taken into account that the data are open to interpretation and something might be missed. Once a medicine has entered the market, statisticians will focus on identifying additional information from user data, as well as carrying out what is known as secondary research, i.e. the analysis of data from groups that were not prespecified. This targeted search for additional signals tends to generate a large number of hypotheses, and always takes place alongside and following clinical phase III trials. Information of long-term observational studies is therefore released over a period of many years after drug approval in order to identify potential new safety findings (for example, the RE-LY study assessing the safety of Pradaxa® and the ONTARGET study assessing the safety of Micardis®, both drugs developed by Boehringer Ingelheim).Boehringer Ingelheim has recently extended the principle of data transparency (i.e. publishing the scientific results of all its studies in peer-reviewed journals and at scientific meetings, regardless of study outcome as stipulated by the regulatory authorities on www.clinicaltrials.gov) to giving qualified independent researchers on-request access to de-identified original clinical trial data. The company’s transparency commitments apply to studies initiated after January 1st 1998 (trials.boehringer-ingelheim.com/trial_results.html).Researchers from academic institutions such as the Cochrane Collaboration and the Wellcome Trust can apply for access to Boehringer Ingelheim’s clinical trial data. The decision as to whether to allow access is made by an independent five-member panel of renowned experts. Authorization for independent researchers to mine clinical trial data goes back to the international AllTrials campaign in 2013, which called for all past and present clinical trials to be registered and their full methods and summary results reported (www.altrials.net).
Multivariate (logistic) regression is the method most frequently used for the exploratory analysis of clinical data. Generally speaking, regression analysis takes into account random variables (age, weight, lung function, treatment) and their effect on disease progression. Based on one or several endpoints (e.g. lowering of the blood pressure or targeted blood glucose level after 12 months of treatment), the approach assesses how and to what extent a large number of variables affect this endpoint. An important issue is whether the variables affect, i.e. interact with, each other. In clinical drug trials, it is also important to distinguish between general risk factors for diseases (e.g., smoking has been shown to increase the risk of diabetes complications) and specific effects on a drug’s efficacy (mutual drug interactions). There are a number of other data-mining techniques that can also be used for regression and classification. Most of these - including methods such as Random Forest, Support Vector Machines and k-Nearest Neighbour - have been developed in the field of machine learning research.