I have a column of semi-structured text (datatype varchar(7000)). Within that text there are certain patterns and I want to be able to split based on the presence of the pattern for instance:
Row 1
Age:
78
Height:
178cm
Comments:
Likes the following: Ice cream, pickles, artichokes.
Row 2
Age:
12
E-mail:
123234345@mildew.com
Comments:
Visited the monkeys at the zoo this week.
Row 3
Height:
173cm
Weight:
85kg
I want the data (initially at least) to be split into two columns [question] and [answer] which would look like so:
row_id question answer
1 Age 78
1 Height 178cm
1 Comments Likes the following: Ice cream, pickles, artichokes.
2 Age 12
2 E-mail 123234345@mildew.com
2 Comments Visited the monkeys at the zoo this week.
3 Height 173cm
3 Weight 85kg
So as I see it the delimiter also becomes a value in the question column. The pattern for the delimiter is:
char(10) + [any string of characters excluding char(10)] + ':' + char(10)
One thing I'm struggling with, is how to express [any string of characters excluding char(10)] in TSQL.
Notes:
CLR/regex are not an option.
The same questions don't necessarily repeat in each row, and order of questions may not be predictable.
If it works, it may be applied to millions of rows.
By way of background, the data comes into the database via a "message" format that could easily have been converted into relatively normalised tables/columns. For whatever reason the vendor decided to mangle the incoming messages into a format suitable for presentation instead of storage/retrieval... and once this is done, the incoming message is destroyed. Arrgh!!!! So now, in order to get meaning out of the "data" (I use the word loosely) we have to try and reverse engineer it. And it takes up at least 100 times more storage than it otherwise would.
↧