Monday, March 19, 2012

Keyword Density/Count

Hi All,

let me try to explain what I'm trying to accomplish - I really hope someone
can help.

I have a table (tblArticles) which has the following:

vcrKeywords varchar(2000)
txtBody text(8000)
vcrType varchar(128)

and this is a sample of the data

vcrKeyWords || txtBody || vcrType
key1,key2,key4,key7,key9 || <snipped body|| Site5 News
key1,key3,key6,key8,key9 || <snipped body|| Site5 News
key1,key3,key4,key5,key9 || <snipped body|| Site5 News
key1,key2,key5,key7,key8 || <snipped body|| Site5 News

What I'm trying to accomplish is to return a keyword count based on the
content of vcrKeywords (i.e. each comma seperated entry as a count.

My SQL statement originally was :

SELECT vcrKeyWords, COUNT(vcrKeyWords) AS keycount FROM tblArticles
where vcrType LIKE 'site 5%' GROUP BY vcrKeyWords ORDER BY keycount DESC

However, this simply matches the entire vcrKeyword Column and not each comma
seperated value.

Is there a way in SQL that I can achieve this or do I need to use some kind
of scripting language to accomplish it...?

Regards,

Carl.Carl,

It would take you more time than it's worth to develop a procedure to
search for keywords on the fly in that mess. And once you do create
it, it's going to be really slow because it's not going to be able to
use any indexes. Let's look at a redesign.

Looks to me like you have a many to many relationship between Keywords
and Articles.

tbl_keywords:
keyword_id (unique-idx)
keyword_name (pk)

tbl_article:
article_id (unique-idx)
txtBody
vcrType

tbl_art_key:
keyword_id (Clust-PK)
article_id (Clust-PK)

Now, for your query above with the new design:
SELECT tk.keyword_name, count(tak.article_id)
FROM tbl_keyword tk JOIN tbl_art_key tak
ON tk.keyword_id = tak.keyword_id

Cheers,
Jason Lepack

On Apr 16, 5:15 am, "news.demon.co.uk" <c...@.ocvision.comwrote:

Quote:

Originally Posted by

Hi All,
>
let me try to explain what I'm trying to accomplish - I really hope someone
can help.
>
I have a table (tblArticles) which has the following:
>
vcrKeywords varchar(2000)
txtBody text(8000)
vcrType varchar(128)
>
and this is a sample of the data
>
vcrKeyWords || txtBody || vcrType
key1,key2,key4,key7,key9 || <snipped body|| Site5 News
key1,key3,key6,key8,key9 || <snipped body|| Site5 News
key1,key3,key4,key5,key9 || <snipped body|| Site5 News
key1,key2,key5,key7,key8 || <snipped body|| Site5 News
>
What I'm trying to accomplish is to return a keyword count based on the
content of vcrKeywords (i.e. each comma seperated entry as a count.
>
My SQL statement originally was :
>
SELECT vcrKeyWords, COUNT(vcrKeyWords) AS keycount FROM tblArticles
where vcrType LIKE 'site 5%' GROUP BY vcrKeyWords ORDER BY keycount DESC
>
However, this simply matches the entire vcrKeyword Column and not each comma
seperated value.
>
Is there a way in SQL that I can achieve this or do I need to use some kind
of scripting language to accomplish it...?
>
Regards,
>
Carl.

|||Ah, the joys of not proofreading...

Now, for your query above with the new design:
SELECT tk.keyword_name, count(tak.article_id) keycount
FROM tbl_keyword tk
JOIN tbl_art_key tak
ON tk.keyword_id = tak.keyword_id
JOIN tbl_article ta
ON ta.article_id = tak.article_id
WHERE ta.vcrType LIKE 'site 5%'
GROUP BY tk.keyword_name
ORDER BY keycount DESC

On Apr 16, 8:41 am, "Jason Lepack" <jlep...@.gmail.comwrote:

Quote:

Originally Posted by

Carl,
>
It would take you more time than it's worth to develop a procedure to
search for keywords on the fly in that mess. And once you do create
it, it's going to be really slow because it's not going to be able to
use any indexes. Let's look at a redesign.
>
Looks to me like you have a many to many relationship between Keywords
and Articles.
>
tbl_keywords:
keyword_id (unique-idx)
keyword_name (pk)
>
tbl_article:
article_id (unique-idx)
txtBody
vcrType
>
tbl_art_key:
keyword_id (Clust-PK)
article_id (Clust-PK)
>
Now, for your query above with the new design:
SELECT tk.keyword_name, count(tak.article_id)
FROM tbl_keyword tk JOIN tbl_art_key tak
ON tk.keyword_id = tak.keyword_id
>
Cheers,
Jason Lepack
>
On Apr 16, 5:15 am, "news.demon.co.uk" <c...@.ocvision.comwrote:
>
>
>

Quote:

Originally Posted by

Hi All,


>

Quote:

Originally Posted by

let me try to explain what I'm trying to accomplish - I really hope someone
can help.


>

Quote:

Originally Posted by

I have a table (tblArticles) which has the following:


>

Quote:

Originally Posted by

vcrKeywords varchar(2000)
txtBody text(8000)
vcrType varchar(128)


>

Quote:

Originally Posted by

and this is a sample of the data


>

Quote:

Originally Posted by

vcrKeyWords || txtBody || vcrType
key1,key2,key4,key7,key9 || <snipped body|| Site5 News
key1,key3,key6,key8,key9 || <snipped body|| Site5 News
key1,key3,key4,key5,key9 || <snipped body|| Site5 News
key1,key2,key5,key7,key8 || <snipped body|| Site5 News


>

Quote:

Originally Posted by

What I'm trying to accomplish is to return a keyword count based on the
content of vcrKeywords (i.e. each comma seperated entry as a count.


>

Quote:

Originally Posted by

My SQL statement originally was :


>

Quote:

Originally Posted by

SELECT vcrKeyWords, COUNT(vcrKeyWords) AS keycount FROM tblArticles
where vcrType LIKE 'site 5%' GROUP BY vcrKeyWords ORDER BY keycount DESC


>

Quote:

Originally Posted by

However, this simply matches the entire vcrKeyword Column and not each comma
seperated value.


>

Quote:

Originally Posted by

Is there a way in SQL that I can achieve this or do I need to use some kind
of scripting language to accomplish it...?


>

Quote:

Originally Posted by

Regards,


>

Quote:

Originally Posted by

Carl.- Hide quoted text -


>
- Show quoted text -

|||I agree with Jason that normalizing the design of the table for keywords
will be most beneficial. It adds so much power to what you can do with those
keywords.

A couple notes that may help too, especially if you are not in a position to
change table design:

- Scripting and client side languages have a very good support to deal with
lists and arrays. Most of them implement some sort of "split" function where
passing a list and delimiter as parameters will give you an array that will
be much easier to deal with. Plus they do it fast.

- Erland Sommarskog has excellent articles on Arrays and Lists in SQL Server
that can help you to normalize the existing data and use it more
efficiently:
http://www.sommarskog.se/arrays-in-sql-2005.html
http://www.sommarskog.se/arrays-in-sql-2000.html
- If you just need the count of keywords and your keyword delimiter is
always a comma, then you can write something like this to get the count:

SELECT vcrKeywords,
CASE
WHEN LEN(vcrKeyWords) = 0 THEN 0
ELSE
LEN(vcrKeyWords) - LEN(REPLACE(vcrKeyWords, ',', '')) + 1
END AS keycount
FROM tblArticles
WHERE vcrType LIKE 'site 5%'
GROUP BY vcrKeyWords
ORDER BY keycount DESC

Not sure if your keywords column is a blank string or NULL when there are no
keywords associated, so you can tune up the first CASE condition accordingly
to handle the case when there are 0 keywords.

HTH,

Plamen Ratchev
http://www.SQLStudio.com|||Oops, I just did a copy/paste on the bottom part of your query... Not sure
if you need the GROUP BY at all. If the intent is to summarize counts and
there are multiple occurrences of the same set of keywords, then you can
just add SUM for the CASE expression.

Plamen Ratchev
http://www.SQLStudio.com|||

Quote:

Originally Posted by

Is there a way in SQL that I can achieve this or do I need to use
some kind of scripting language to accomplish it...?
>
Regards,
>


Hi All,

Thanks for the pointers - I'd kind of figured it was too much hassle to be
worth it and unfortunately I've inheritied this database which is 3+ years
old and has *a lot* of content in it, so redesigning the schema would be a
huge undertaking.

I'm already using FTS Contains Clause to get the data out, so I'm going to
have to make do with that for now...

Thanks anyway,

Carl.|||news.demon.co.uk wrote:

Quote:

Originally Posted by

Thanks for the pointers - I'd kind of figured it was too much hassle to be
worth it and unfortunately I've inheritied this database which is 3+ years
old and has *a lot* of content in it, so redesigning the schema would be a
huge undertaking.


The question is not how much content, but how many tables and how many
things looking at those tables. (And if you can create views based on
the new tables that look like the old ones, then you can avoid having
to redesign the things-looking-at-them right away.)

No comments:

Post a Comment