Skip to main content

Machine Learning Prediction

At Aropha, a significant portion of our efforts has been dedicated to develop such predictive models.



Classification models have been the most popular in the literature. For ready biodegradation, a pass level is usually set for a specific method, commonly at 60% or 70% to define whether a material is readily biodegradable or not. Therefore, classification models to predict the class of a material has been the major direction current literatures are focusing on.


In contrast to classification, regression models focus on the exact biodegradation percentage of the test substances. Therefore, such models can usually be more helpful.

ML algorithms

A wide range of algorithms have been used for the model development, such as Linear regression, Ridge, Lasso, K Nearest neighbors, Support vector machine, Decision tree, Random forest, Extreme tree, Bagging, Adaptive boosting, Gradient boosting, Xgboost, Artificial neural network, Deep neural network, and Convolutional neural network.

Chemical representations

Molecular descriptors

Molecular descriptors have been primarily used as the chemical representations for machine learning for the aerobic biodegradation since the interpretation of the modeling results can be easily made by correlating the structural/physicochemical properties related to the descriptors with the biodegradability

These studies generally start from a few hundred to more than two thousand descriptors, and then narrow down to less than fifty or even ten by scoring their importance. Different studies usually result in different sets of descriptors, and different models within a study can even give very different descriptor combinations.

However, all these descriptors are direct representatives of the structural/physicochemical properties that are highly related to the biodegradability, such as the presence of halogens, chain branching, nitro groups, rings and some functional groups.[1]

For example, a study initially incorporated 500 descriptors but eventually found 8 were the most important ones including LogKow (AlogP), XlogP, molecular weight (MW), topological polar surface area (TopoPSA), number of rings (nRing), nlow highest polarizability weighted BCUTS (BCUTp-1h), number of hydrogen bond acceptors (nHBAcc), and number of hydrogen bond donors (nHBDon).[6] Other studies also reported number of nitrogen, sulfur, oxygen, and halogen atoms (nN, nS, nO, nX), autocorrelation of a topological structure (ATS3p), number of esters (aromatic) (nArCOOR), number of nitro groups (aromatic) (nArNo2), number of substituted benzene C (sp2), and many other descriptors to be directly related to the biodegradation behaviors.[1-5]


At Aropha, we have two biodegradation prediction models built targeting both regression and classification issues. Please check out our site ArophaAI for more details about these predictors.


  1. Mansouri, K.; Ringsted, T.; Ballabio, D.; Todeschini, R.; Consonni, V. Quantitative structure-activity relationship models for ready biodegradability of chemicals. โ€ŽJ. Chem. Inf. Model. 2013, 53 (4), 867-78.
  2. Cheng, F.; Ikenaga, Y.; Zhou, Y.; Yu, Y.; Li, W.; Shen, J.; Du, Z.; Chen, L.; Xu, C.; Liu, G.; Lee, P. W.; Tang, Y. In silico assessment of chemical biodegradability. J. Chem. Inf. Model. 2012, 52 (3), 655-69.
  3. Putra, R. I. D.; Maulana, A. L.; Saputro, A. G. Study on building machine learning model to predict biodegradable-ready materials. AIP Conference Proceedings. 2019, 2088 (1), 060003.
  4. Chen, G.; Li, X.; Chen, J.; Zhang, Y. N.; Peijnenburg, W. J. Comparative study of biodegradability prediction of chemicals using decision trees, functional trees, and logistic regression. Environ. Toxicol. Chem. 2014, 33 (12), 2688-93.
  5. Acharya, K.; Werner, D.; Dolfing, J.; Barycki, M.; Meynet, P.; Mrozik, W.; Komolafe, O.; Puzyn, T.; Davenport, R. J. A quantitative structure-biodegradation relationship (QSBR) approach to predict biodegradation rates of aromatic chemicals. Water Res. 2019, 157, 181-190.