DESIGN OF LOW CYTOTOXICITY DIARYLANILINE DERIVATIVES BASED ON QSAR RESULTS : AN APPLICATION OF ARTIFICIAL NEURAL NETWORK MODELLING

Study on cytotoxicity of diarylaniline derivatives by using quantitative structure-activity relationship (QSAR) has been done. The structures and cytotoxicities of diarylaniline derivatives were obtained from the literature. Calculation of molecular and electronic parameters was conducted using Austin Model 1 (AM1), Parameterized Model 3 (PM3), Hartree-Fock (HF), and density functional theory (DFT) methods. Artificial neural networks (ANN) analysis used to produce the best equation with configuration of input data-hidden node-output data = 5-8-1, value of r 2 = 0.913; PRESS = 0.069. The best equation used to design and predict new diarylaniline derivatives. The result shows that compound N1-(4′-Cyanophenyl)-5-(4′′-cyanovinyl-2′′,6′′dimethyl-phenoxy)-4-dimethylether benzene-1,2-diamine) is the best-proposed compound with cytotoxicity value (CC50) of 93.037 μM.


INTRODUCTION
Toxicity is still a major safety concern for drug withdrawal, the 'black box warning', and the discontinuation of clinical trials (such as Pfizer's hypercholesterolemia drug torcetrapib withdrawal from Phase III).An analysis of the first-in-human registration for ten big pharmacy companies demonstrated only 10% of total success rate leading to the final FDA approval.The failure rate becomes even higher when all drug candidates in preclinical research are included in the statistics.The traditional drug safety testing approaches include in vivo animal models and in-vitro cellbased assays, and most recently in silico assessment is also introduced.QSARbased expert systems are mainly used in early drug discovery to predict toxicological endpoints (Sun and Scott, 2010).
Currently, molecular modeling and computational chemistry is an indispensable part of the search and drug design.The computational method could save time and money to find new drugs.Among the computation methods in drug design, QSAR is the most widely used.QSAR method conducts a study of a relationship between molecular and electronic parameters of the activity or toxicity in the series of analog compounds.These parameters are obtained from calculations using quantum mechanical methods that have been developed (Hemmateenejad et al., 2009).In some earlier studies, multiple linear regression methods were not able to provide a good model, so that the artificial neural network method was used to produce the non-linear models.
Ekins and Williams (2012) mentioned that the potential to predict human toxicity directly from a molecular structure is feasible.By using the experimental properties of known compounds as the basis of predictive models, it is possible to develop structure activity relationships and resulting algorithms related to toxicity.Gacche and Jadhav (2012) reported the results of their research to make a model of the toxicity of the coumarin derivatives and their molecular parameters.Hosseini et al. (2013) have conducted a study by modeling cytotoxicities of the substituted amides of pyrazineβ-carboxylic acids versus their molecular parameters with the best model has the value of r 2 = 0.922.Low et al. (2011) reported their study on predicting drug-induced hepatotoxicity using QSAR and toxicogenomics approaches and had external predictivity as 76%.Ruiz et al. (2012) had obtained a relatively good model of acute mammalian toxicity using QSAR method, with the best model has the r 2 value of 0.929 (T.E.S.T model).Sun et al. (2012) reported the results of their research were synthesized and tested the anti-HIV activities and cytotoxicities of twenty derivatives of diarylaniline (DAAN) as the parent structure which is showed in Figure 1.Their results are promising, which have the lower value of EC 50 and higher value of CC 50 than the control drug (rilpivirine).QSAR modeling study on anti-HIV activity of the DAAN derivatives has been done in our previous research (Arief et al., 2013) .Thus, in this study, we performed QSAR modeling on cytotoxicities of DAAN derivatives to design the new compound with the lower cytotoxicity.

Data Set
The total set of compounds (Table 1) was divided into a training set (15 compounds) for generating QSAR models and a test set (5 compounds) for validating the quality of the models.Selection of molecules in the training set and the test is a key and important feature of any QSAR model.Therefore, the care was taken in such a way that biological activities of all compounds in test lie within the maximum and minimum value range of biological activities of the training set of compounds.

Descriptor Calculation
The basis of energy minimization is that the drug binds to effectors/receptors in the most stable form, i.e., the minimum energy form.QSAR study requires the calculation of molecular descriptors.In this study, the methods which have been used to optimize the structural geometries of data set are Austin Model 1 (AM1), Parameterized Model 3 (PM3), Hartree-Fock (HF), and density functional theory (DFT) on Gaussian 09W package (Frisch et al., 2009).

Model Development
The QSAR model was generated previously by Multiple Linear Regression (MLR) Backward method by using SPSS Release 19.0.0 package (IBM, 2010).It relates the dependent variable ŷ (biological activity) to a number of independent variables x i (molecular descriptor) by using linear equations.This regression method estimates the values of the regression coefficients by applying least square curve fitting method.MLR is the traditional and standard approach for multivariate data analysis.The best model was chosen based on some statistical parameters such as r 2 , standard estimation of error (SEE), F-ratio between the variance of predicted and observed activity, and PRESS (Podunavac-Kuzmanović et al., 2009), where : Because of the linear analysis method produced a model that does not pass the validation test, then it proceeds with the analysis of the non-linear form of artificial neural networks (ANN) as performed by Deeb and Jawabreh (2012) using MATLAB package.On ANN analysis, examination of data outliers in advance.An examination carried out by plotting the first principal component values and the value of the second component, which is then observed the distribution of data.
If there are points that apart from most of the other, then the points are considered as a data outlier and not included in the ANN analysis.ANN analysis was conducted by normalizing parameter data, which is then calculated as the input data.A number of hidden nodes used in the range of 3-15, where the sigmoid used as the activation function.
The value of the mean square error (MSE) is 1 × 10 -6 .

Model Validation
The best model which has chosen then used to predict log CC 50 values of the test set.A model can be determined as validated if passed some of the criteria such as r 2 pred > 0.5, r 2 m > 0.5, where (Hu et al., 2009): r 2 and r 2 0 between the observed and predicted values are calculated from the test set with and without intercept, respectively.

Design and Toxicities Prediction of New Compounds
Based on the validated model, new compounds of diarylaniline derivatives has been designed by replacing the substituent.Those new compounds then optimized and calculated their descriptors.The descriptors which have been calculated used to predict CC 50 values of new compounds using the validated model.The best new compound has chosen based on its CC 50 value which is higher than Rilpivirine.

Model Development
The process used in statistical analysis to develop the model was the backward method.This method used a principle that, in the first step of the analysis, all of the descriptors were included in model development.On the next step, non-significant descriptors excluded from the model and then regression parameters was recalculated.This procedure has been done continuously until the simpler model obtained (with fewer descriptor), but still has approved significantly (less than 0.05).
Based on the values of r 2 (>0.6) and F cal /F tab (>1) from each model, only AM1 and PM3 models are eligible to proceed to the model validation.

Model Validation
The models that passed the requirement then used to predict the value of log CC 50 of the data in the test set.The result is shown in Table 4.The data showed in Table 4, then plotted to check the other parameter, the value of r 2 pred .Unfortunately, the model of AM1 and PM3 has very low values of r 2 pred , they are 0.059 and 0.003 respectively.Those values were not qualified to express QSAR models to be valid (r 2 pred > 0.5).This shows that both models resulted from the analysis by linear methods can not predict the compounds outside the training set well, so we need another analysis to gain a better model.In this research, advanced analysis for the study of toxicity in the form of non-linear analysis of artificial neural networks (ANN).
Analysis method with ANN occurred to the independent variables and the dependent variable obtained from the previous MLR analysis.ANN analysis conducted in the process of finding the value of the "weights" that describes the relationship of each layer so as to produce an output value.Furthermore, the backpropagation process generates the less error value continuously (iteration).
Before analyzing with ANN, the data were checked for the outliers of the twentieth data that would not use by principle component analysis (PCA) as done by Deeb and Drabh (2010), Deeb and Jawabreh (2012).For example, the result of data analysis for outliers in AM1 data showed in Figure 2.  It can be observed that the data with the numbers 6, 11, and 18 on the order of compounds in the total series is regarded as outlier data, so it was not included in the later analysis of ANN.Examination of outliers data was also performed on three other models.Outliers data may interfere with the results of modeling because it has a different tendency with most of the other data.Exclusion of data outliers can improve the yield model, as reported by Eroglu et al. (2007) the MLR models that originally had r 2 values of 0.837 (using 18 data) amounted to 0.943 (using 16 data).Similarly, the results of the study of Wang et al. (2012) where the value of q 2 increased from 0.258 (using 17 data) into 0.582 (using 15 data).
On ANN analysis, the amount of nodes in the hidden layer was variated with ranges of 3-15 to find the best model.The comparison of the r 2 pred value of the test set to the number of nodes in the hidden layer is shown in Figure 3. Figure 3 showed that each model has a different configuration to generate the highest r 2 pred values.In general, only the three models have the r 2 pred value which are qualified validity (> 0.5).While the HF models do not meet, because it has a value of r 2 pred only 0.445.The highest r 2 pred value obtained on the AM1 model (0.985) with the configuration of the I-H-O = 5-8-1.The number of hidden nodes as much as 8 is the smallest value (the simplest configuration) , in which the other models require a number of hidden nodes is 13 (DFT model), 9 (PM3 model), 12 (HF model).Therefore, the AM1 model regarded as the best model for predicting the toxicity of diarylaniline derivative compounds.
For additional validity test, it was performed the determination r 2 m and PRESS value for each model.The result of ANN analysis to four of the best models can be seen in Table 5.Table 5 shows that the r 2 m parameter of PM3 models do not meet the minimum threshold (> 0.5).Whereas between AM1 and DFT models which are eligible the minimums, DFT models have the higher value of r 2 m (0.701), but the lower r 2 pred and higher PRESS value than the AM1 model.
Therefore, AM1 model was concluded as the best model.It supported also by the fact that in AM1 method, geometry optimization requires a shorter time than DFT.AM1 model regarded as the best model showed that in the toxicity studies, calculations involving energy with only the valence electrons can describe the relationship between the parameters and the toxicities of diarylaniline derivatives.While the method of calculation of the overall HF involving electrons (core and valence) and DFT methods are based on the function of the electron density, are less able to describe the relationship well.Results plot between log CC 50 values of observed and the model predictions using AM1 shown in Figure 4.

Design and Activities Prediction of New Compounds
In the design of new compounds, substituent was replaced with other groups or chemical species.The selection of those new species or groups was based on the possibility to be synthesized and the materials availability.It was expected also, the step of the synthesis will be done only in single or double steps to keep the rendemen in a good term.Log CC 50 values of the compounds was determined by the model proposed AM1 with input data values of atomic charge qC2 , qC4 , qC14 , qC23 , qN33 (5-8-1 configuration).Table 6 shows 10 designed compounds with their predicted log CC 50 and also predicted CC 50 values.In Table 6, it can be seen that generally, the best new compounds have substituent groups which potentially to make hydrogen bonding.In the other side, most of the best compounds have R2 groups with a double-bond.This indicates, the double bond in R2 also has a significant effect on cytotoxicity mechanism between DAAN derivatives to human liver microsomes.Compound number 1 has predicted CC 50 value of 93.037 µM, which is higher than Rilpivirine's (24.4 nM).This compound assumed as more safe as a drug to be synthesized, and then to be tested in in vitro and in vivo tests.

Figure 1 .
Figure 1.Parent structure of diarylaniline derivative COMPUTATIONAL METHODS

Figure 2 .
Figure 2. The result of outlier check by using principle component analysis

Figure 3 .
Figure 3. Result of artificial neural network modeling

Figure 4 .
Figure 4. Plot chart of predicted and observed log CC 50 values from AM1 model

Table 2 .
A Uni-Column statistics for the training set and test set were generated to check the correctness of selection criteria for training and test set molecules (Table2).Training set and test set were checked using a uni-column statistics as listed in Table2shows that average and standard deviation values of training and test set are not different significantly, indicating a similar data distribution in both.

Table 2 .
Uni-column statistics of the training and test sets for QSAR models

Table 3 .
Statistical parameters of 4 QSAR models of diarylaniline derivatives

Table 4 .
Comparison of observed and predicted values of log CC 50

Table 5 .
Additional statistical parameter from models by ANN

Table 6 .
New designed diarylaniline derivative compounds and its predicted log CC 50 and CC 50 using AM1 model