hku_diabetes package¶
Submodules¶
hku_diabetes.analytics module¶
Core data analytics logic.
-
class
hku_diabetes.analytics.
Analyser
(*, config: Type[hku_diabetes.config.DefaultConfig] = <class 'hku_diabetes.config.DefaultConfig'>)¶ Bases:
object
Execute core analytics logic.
This class implements the main execution sequence of the HKU diabetes regression analysis. It saves the results of the regression and CKD thresholds as csv, and all other intermediate steps as pickle.
Parameters: config – Configuration class, default to DefaultConfig. -
patient_ids
¶ A list of valid patient IDs analysed.
-
intermediate
¶ A dictionary of all objects in intermediate steps.
-
results
¶ A dictionary containing regression results and ckd values.
-
load
() → Dict[str, pandas.core.frame.DataFrame]¶ Load analytics results from file.
Call this method to load the previous analytics results. Calling script should catch FileNotFoundError and call the run method.
Raises: FileNotFoundError – No results files are found in config.results_path. Returns: A dictionary containing results for regression and ckd as DataFrame. Example
>>> from hku_diabetes.analytics import Analyser >>> from hku_diabetes.importer import import_all >>> analyser = Analyser() >>> try: >>> results = analyser.load() >>> except FileNotFoundError: >>> data = import_all() >>> results = analyser.run(data)
-
run
(data: Dict[str, pandas.core.frame.DataFrame]) → Dict[str, pandas.core.frame.DataFrame]¶ Execute the main date analytics sequence.
Call this method to execute the actual data analytics. All results are saved in path specified by config.results_path.
Parameters: data – A dictionary at least containing Creatinine, Hb1aC, and Demographics as DataFrames. Returns: A dictionary containing results for regression and ckd as DataFrame. Example
>>> from hku_diabetes.analytics import Analyser >>> from hku_diabetes.importer import import_all >>> analyser = Analyser() >>> data = import_all() >>> results = analyser.run(data)
-
-
hku_diabetes.analytics.
analyse_subject
(data: Dict[str, pandas.core.frame.DataFrame], patient_id: int, config: Type[hku_diabetes.config.DefaultConfig] = <class 'hku_diabetes.config.DefaultConfig'>) → Union[None, dict]¶ Compute the regression result and ckd values for one subject.
This function takes the data of one subject and compute its corresponding regression results and ckd values. It is called by Analyser.run via a ProcessPoolExecutor. It checks if either the Creatinine or Hb1aC has the minimum number of rows required by config.min_analysis_samples, and returns None if fails.
Parameters: - data – A dictionary at least containing Creatinine, Hb1aC, and Demographics as DataFrames, and only contains rows for one subject.
- patient_id – ID of the patient as int.
- config – Configuration class, default to DefaultConfig.
Returns: Either None or a dictionary of results including regression and ckd, as well as intermediate steps including patient_id, Creatinine, Hba1C, regression, ckd, Creatinine_LP, and cumulative_Hba1C.
Example
>>> from hku_diabetes import analytics >>> from hku_diabetes.importer import import_all >>> data = import_all() >>> patient_id = 802 >>> intermediate = analytics.analyse_subject(data, patient_id)
-
hku_diabetes.analytics.
dropna
(data: Dict[str, pandas.core.frame.DataFrame])¶ Calls dropna of all DataFrames in the data dictionary
Parameters: data – A dictionary at least containing Creatinine, Hb1aC, and Demographics as DataFrames. Example
>>> from hku_diabetes import analytics >>> from hku_diabetes.importer import import_all >>> data = import_all() >>> analytics.dropna(data)
-
hku_diabetes.analytics.
evaluate_eGFR
(data: Dict[str, pandas.core.frame.DataFrame])¶ Evaluates the eGFR value for each row of the Creatinine DataFrame.
This function takes the Sex and DOB from the Demographic DataFrame for each patient, and computes the corresponding Age of the patient at the time of each row of the Creatinine measurement. It uses the referenced eGFR formula assuming all subjects are not African. The computed eGFR values are inserted for all rows of the creatinine DataFrame.
Reference: http://www.sydpath.stvincents.com.au/tests/ChemFrames/MDRDBody.htm
Parameters: data – A dictionary at least containing Creatinine, Hb1aC, and Demographics as DataFrames. Example
>>> from hku_diabetes import analytics >>> from hku_diabetes.importer import import_all >>> data = import_all() >>> analytics.evaluate_eGFR(data) >>> print(data['Creatinine']['eGFR'])
-
hku_diabetes.analytics.
find_time_range
(Creatinine_time: numpy.ndarray, Hba1C_time: numpy.ndarray, config: Type[hku_diabetes.config.DefaultConfig] = <class 'hku_diabetes.config.DefaultConfig'>) → numpy.ndarray¶ Finds the longest possible overlapping time range between Creatinine and Hba1C.
Parameters: - Creatinine_time – Array of Creatinine datetime as Matplotlib dates.
- Hba1C_time – Array of Hba!C datetime as Matplotlib dates.
- config – Configuration class, default to DefaultConfig.
Returns: An array of longest possible overlapping datetime as Matplotlib dates.
Example
>>> from matplotlib.dates import date2num >>> from hku_diabetes import analytics >>> from hku_diabetes.importer import import_all >>> data = import_all() >>> patient_id = 802 >>> Creatinine = data['Creatinine'].loc[[patient_id]] >>> Hba1C = data['Hba1C'].loc[[patient_id]] >>> Creatinine_time = date2num(Creatinine['Datetime']) >>> Hba1C_time = date2num(Hba1C['Datetime']) >>> time_range = analytics.find_time_range(Creatinine_time, Hba1C_time)
-
hku_diabetes.analytics.
intersect
(data: Dict[str, pandas.core.frame.DataFrame])¶ Finds the intersects of unique patients from each DataFrame.
Parameters: data – A dictionary at least containing Creatinine, Hb1aC, and Demographics as DataFrames. Example
>>> from hku_diabetes import analytics >>> from hku_diabetes.importer import import_all >>> data = import_all() >>> analytics.intersect(data)
-
hku_diabetes.analytics.
remove_duplicate
(resource: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame¶ Removes duplicate measurements taken at the same datetime.
For some reasons, more than one entries are recorded at the same time and same date, but containing diferent values. This was observed for both Creatinine and Hba1c. This function finds the such entries and only keeps the first record.
Parameters: resource – A DataFrame of the resource to remove duplicate. Returns: A DataFrame with duplicates removed. Example
>>> from hku_diabetes import analytics >>> from hku_diabetes.importer import import_all >>> data = import_all() >>> patient_id = 802 >>> Creatinine = data['Creatinine'].loc[[patient_id]] >>> Creatinine = analytics.remove_duplicate(Creatinine)
hku_diabetes.config module¶
Configuration classes controlling module behaviours.
-
class
hku_diabetes.config.
DefaultConfig
¶ Bases:
object
Default configuration used by all module classes and functions.
This is the default configuration class defining all default parameters. All classes and functions of the module default to this class whenever they accept a config keyword parameter. Extend from this class to create your own configuration class.
-
Hba1C_color
= 'tab:blue'¶ The colour of Hb1aC axis and line.
-
ckd_thresholds
= (15, 30, 45, 60, 90)¶ The eGFR threshold values of CKD classifications.
-
data_file_extensions
= ('LIS.xls', 'DRG.xls', 'DX.xls', 'PX.xls')¶ The file name ending and extension of data files that has actual data.
-
eGFR_color
= 'tab:red'¶ The colour of eGFR axis and line.
-
eGFR_low_pass
= '90d'¶ The period of eGFR low pass filter. All measurements within the same period are averaged to one measurement.
-
interpolation_samples
= 100¶ The number of samples to be interpolated in interpolated plots.
-
min_analysis_samples
= 5¶ The minimum number of Creatinine and Hb1aC measurements required for each patient. Patient would be skip if the number of measurements is less than this.
-
plot_modes
= ['regression_distributions', 'regression', 'cumulative', 'low_pass', 'interpolated', 'raw']¶ The type of plots to be created.
-
plot_path
= 'output/plots'¶ The path for exporting plot PDFs.
-
plot_samples
= 1000¶ The number of patients to be plotted for each plot mode.
-
processed_data_path
= 'processed_data'¶ The path for storing processed data.
-
raw_data_path
= 'raw_data'¶ The path for importing raw data.
-
required_resources
= ['Creatinine', 'Hba1C', 'Medication', 'Diagnosis', 'Procedure', 'HDL', 'LDL']¶ The resources to be loaded by importer.
-
results_path
= 'output/results'¶ The path for exporting results CSV and intermediate pickles.
-
t_test_mean
= {'intercept': 100, 'pvalue': 0.5, 'rvalue': 0, 'slope': 0, 'stderr': 0}¶ The Gaussian mean of the null hypothesis of 1 sample t-test.
-
test_samples
= 10¶ The number of samples analysed by the analytics module.
-
-
class
hku_diabetes.config.
RunConfig
¶ Bases:
hku_diabetes.config.DefaultConfig
Configuration used for running the full data analytic.
-
plot_modes
= ['regression_distributions']¶ The type of plots to be created.
As it takes a lot of time to generate all the raw plots, only plot the regression distributions.
-
required_resources
= ['Creatinine', 'Hba1C']¶ The resources to be loaded by importer.
As the current analytics only support Creatinine and Hba1C, there is no need to load the other resources.
-
-
class
hku_diabetes.config.
TestConfig
¶ Bases:
hku_diabetes.config.DefaultConfig
Configuration used for development and testing.
-
plot_path
= 'test_output/plots'¶ The path for exporting plot PDFs.
-
plot_samples
= 5¶ The number of patients to be plotted for each plot mode.
Speed up testing time by plotting less patients.
-
processed_data_path
= 'test_processed_data'¶ The path for storing processed data.
-
required_resources
= ['Creatinine', 'Hba1C']¶ The resources to be loaded by importer.
As the current analytics only support Creatinine and Hba1C, there is no need to load the other resources.
-
results_path
= 'test_output/results'¶ The path for exporting results CSV and intermediate pickles.
-
test_samples
= 10¶ The number of samples analysed by the analytics module.
This allows faster testing time as there is no need to analyse all the data.
-
hku_diabetes.importer module¶
Importer for importing resources from the data directory.
-
hku_diabetes.importer.
import_all
(config: Type[hku_diabetes.config.DefaultConfig] = <class 'hku_diabetes.config.DefaultConfig'>) → Dict[str, pandas.core.frame.DataFrame]¶ Imports all resources and returns a dictionary of resources.
It searches for sub-directory in config.raw_data_path and checks against config.required_resources. If the resources is required, it first tries to import the CSV file with the same name in config.processed_Data. If this fails, it searches for all files within the resource directory and checks against the file name ending and extension against config.data_file_extension. After loading the data files, it then saves all the data as CSV file to be imported directly next time. It also calls data cleaning logic to convert the column names of the resources to something better.
Parameters: config – Configuration class, default to DefaultConfig. Returns: A dictionary containing all required resources as DataFrames. Example
>>> from hku_diabetes.importer import import_all >>> data = import_all()
-
hku_diabetes.importer.
import_resource
(resource_name: str, config: Type[hku_diabetes.config.DefaultConfig] = <class 'hku_diabetes.config.DefaultConfig'>) → pandas.core.frame.DataFrame¶ Imports one particular resource.
This function is a sub-routine called by import_all to import one particular resource. It first tries to import the CSV file with the same name in config.processed_Data. If this fails, it searches for all files within the resource directory and checks against the file name ending and extension against config.data_file_extension. After loading the data files, it then saves all the data as CSV file to be imported directly next time.
Parameters: - resource_name – The name of the resource to be loaded.
- config – Configuration class, default to DefaultConfig.
Returns: A DataFrame of the resource, with patient_id as index, and column names matching the raw data file.
Example
>>> from hku_diabetes.importer import import_resource >>> resource = import_resource('Creatinine')
hku_diabetes.plot module¶
Data and results visualisation
-
hku_diabetes.plot.
plot_all
(analyser: hku_diabetes.analytics.Analyser)¶ Plots all required PDFs.
This calls the plot_one function and plot all the required PDFs specified by analyser.config.plot_modes.
Parameters: analyser – An instance of the Analyser class with intermediate data available. Example
>>> from hku_diabetes.analytics import Analyser >>> from hku_diabetes.plot import plot_all >>> analyser = Analyser() >>> results = analyser.load() >>> plot_all(analyser)
-
hku_diabetes.plot.
plot_one
(analyser: hku_diabetes.analytics.Analyser, mode: str)¶ Plot one PDF according to required mode.
This calls the corresponding private plot functions and plot the required PDF. A dot is printed to terminal every 10 figures to indicate something is happening.
Parameters: - analyser – An instance of the Analyser class with intermediate data available.
- mode – The plot mode required
Example
>>> from hku_diabetes.analytics import Analyser >>> from hku_diabetes.plot import plot_one >>> analyser = Analyser() >>> results = analyser.load() >>> plot_one(analyser, 'raw')