#3 Feature Extraction: Dates From Data.

The following is a part of my text mining project log completed during the applied text mining in python course from Coursera.


Premise

This time around, I was given a messy medical data (aka. Unstructured) and was tasked to extract relevant information from it using python’s regex libraries. These “relevant information” refer to the many date formats mentioned in each medical note (There are 500 in total).

Extracting just one or two formats would’ve been pretty easy. But as the data used is from the real world, there are more than just a “few” date formats.

Here’s a list of the date format variants used:

  • 04/20/2009; 04/20/09; 4/20/09; 4/3/09
  • Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
  • 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
  • Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
  • Feb 2009; Sep 2009; Oct 2010
  • 6/2008; 12/2009
  • 2009; 2010

Because of all this variation, the medical notes are out of chronological order. (i.e note from 1998 may be placed after one from 2008) To fix this, the first part of the task is to identify and standardize all dates into an interoperable form. The assignment underlined some good measures,

  • Assume all dates in xx/xx/xx format are mm/dd/yy
  • Assume all dates where the year is encoded in only two digits are years from the 1900s (e.g. 1/5/89 is January 5th, 1989)
  • If the day is missing (e.g. 9/2009), assume it is the first day of the month (e.g. September 1, 2009).
  • If the month is missing (e.g. 2010), assume it is the first of January of that year (e.g. January 1, 2010).
  • Watch out for potential typos as this is a raw, real-life derived dataset.

The second part – that is after I’ve successfully interpreted all 500 records. Is to sort them in chronological order and return a Panda Series containing the new order indices. i.e,

For example, if the original series was this:

0    1999
1    2010
2    1978
3    2015
4    1985

The output should be:

0    2
1    4
2    0
3    1
4    3

Sounds simple enough, my function should return a Panda Series of length 500 consisting of index numbers and be of data type int.


GOAL

Identify All Of The Different Date Variants Encoded, Properly Normalize And Sort The Dates Chronologically

GOAL

Data Sources Used

Dates.txt: This contains a list of 500 unstructured medical notes. Each note contains patient info and dates in the many forms discussed above that need to be extracted (It’s referred to as docs in the codes).


Methodology

Part 1: Importing Libraries

Part 2: Date Extraction

For this feature extraction, I came up with the following regex expression.

pattern_dates = r'\d{1,2}\/\d{1,2}\/\d{2,4}|\d{1,2}\-\d{1,2}\-\d{2,4}|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*\-\d{1,2}\-\d{4}|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*[,.]? \d{2}[a-z]*,? \d{4}|\d{1,2} (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z,.]* \d{4}|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]{2}[,.]* \d{4}|'+"[,.]? \d{4}|".join(month_name[1:])+"[,.]? \d{4}"+r'|\d{1,2}\/\d{4}|\d{4}'

Though it worked well for 99% of the entries, It did had it’s own Achille’s heel. Consider the following entry (no.271)

.Spoke to sister Naomi Ely 708-810-7787 who reports he has been doing much better since he went to Dysart Clinic (he was drinking for a month leading up to this, his ammonia was high, and physicians were worried about early). She feels his cognition is back to baseline, "100% better". She says he has been successful in abstaining from substances as far as she knows, thinks a schedule is useful to him, doctor's appts etc. Notes that he returned from LA in August 2008, gets bouts of "exhaustion" even in sobriety. She denies ever witnessing any periods of manic behavior from patient. Their father has dementia that started at age 84. Notes patient is living with uncle in Black River Falls (uncle is 89), lived with sister 3 months who also takes care of her own father in Talladega. She knows he is working on getting social security, subsidizing housing. Stable situation with patient's girlfriend Nutt.Suicidal Behavior Hx of Suicidal Behavior: No

Although it was able to correctly detect August 2008 as a date – it also recognizes 7787 as a valid year (which it isn’t). Since this was a one-off occurrence (an outlier, if you will) I decided to fix it manually with this line of code.

Now, I know. This hack and slash approach is highly discouraged and the more preferred way would’ve been to come up with a more generalized algorithm. But cut me some slack okay? I genuinely tried but wasn’t able to come up with any regular expression that would specifically weed this out and not touch the other entries.

For the record, I did try to only include those years that start with 1 or 2 (i.e, 1967 or 2007 and not 7787) but that had its own unintended consequence. Consider this,

7-8-77

Here, the number 77 refers to the year 1977 – a valid date. And the notes are filled with offhand references like these and thus I cannot weed them out either.

Part 3: Date Normalization

This part was simple enough, – especially since the assignment specifically outlined the rules for normalization. All I did here was convert them to python code.

Part 4: Date Sorting

Once I obtained the normalized date_list, It was only a matter of chronologically arranging the entries (and with them their indexes) using Python’s sorted() function where I used the DateTime object as a key.

The full code in its entirety can be found here,


What I’ve Learned

Feature extraction, – especially in python. Was, easier than I thought. Of course, the initial learning curve of regex expressions is pretty hard. I failed on that front more times than I would like to admit, and I still feel like I barely understood it. But hey, It’s a start.

The calendar, DateTime, and dateutil libraries were very handy in this assignment and I’m glad I stumbled upon them. – the dateutil.parser.parse() function was particularly useful.

The crown jewel for this project has to be the myriad date formats, and the endless loops I went through to come up with an (almost) valid regex filter. Of course, As I stated above. The best way would’ve been to avoid manual filtering in favor of a more granular algorithm. – And that’s probably what I’ll work on in my upcoming project logs.


About Me!

An aspiring data scientist with a great interest in machine learning and its applications. I post my work here in the hope to improve over time.