With our Mini, I have been toying with some ways to search for substrings. What I'm finding is that dashes, dots, etc. don't allow for a substring match. However, there is one interesting observation that I have to note here.
The first special character is dropped.
That's right, be it a dash or a dot, the first one is dropped and ignored. Even when searching, the same behaviour occurs. What does this mean? How can I use it to my advantage?
Well, let's take a look at the UB181 example. If I were to make 2 variations of that in our "Additional Keywords" field (only shown to our Mini), simply making ub-181dz and ub181-dz (adding a dash where a letter meets a number, which is a simple regular expression) will not allow these to match a search for 181dz or ub181. Instead, I'd have to make ub-181-dz for a match of ub181 and u-b-181dz for a match of 181dz. That seems like an odd way to have to handle this, but again - simple regular expressions to make this happen.
I'm going to give this a shot and see how much more relevant I can make the searches. Most of the searches I notice are people searching for the ub181 example, so the ub-181-dz style would probably work well. I see similar searches for other types of tools, so probably something I can easily do globally to all of our SKU's for the optimal relevance. I may even feed that data to the big G and see how they like it. I can't see it hurting much of anything.
As one of the early adopters of developing eCommerce sites (first one in '96), I've seen a lot of changes on the web. Now working at an engineering company that produces cutting-edge electronics you've never heard of, I'm tackling some interesting issues in PHP, MySQL and Perl. Join my adventures in development and systems administration right here at my blog.
Thursday, October 27, 2005
Monday, October 24, 2005
Substring limitations
So I have some major beefs with the Mini and how Google's algo handles stemming / substring matches. This just isn't working right.
For example, one of our products is the "UB181DZ". People commonly search for the "UB181", since this is the base model. Well, the default results were 0 matches. That's not good, especially when we have the UB181DZ in stock, along with the DZK and the DZK-2 (we "invented" these kits by customer demand.)
So, a simple fix is to pre-fetch substring matches of any skus. I'm doing a "SELECT sku FROM products WHERE sku like '%$searchstring%';" and sending the search as "ub181 OR ub181dz OR ub181dkz OR ub181dzk-2" for now, but that's just not a good long-term solution. It also doesn't work when they search for multiple words. I'd have to make some funky syntax to get that working, and I'm feeling lazy right now.
As an experiment, I thought we could try dashes. The first attempt was with our HG1100. I had an additional keyword (shown only to our mini) added of "h-g-1-1-0-0". Well, that didn't work for a search on makita hg110 or makita 1100. Also, we have a match for makita 1,100 (it's even the right product since it heats up to 1100 degrees), but it still doesn't show up. I have entered a secondary test now of h.g.1.1.0.0. to see if that works at all. I'm not holding my breath.
This really makes me wonder how many searches at google.com don't work properly (or any of the engines) because they don't do partial matches and stem properly. I'm sure people search similar on our site to how they'd search on Google.com, so does anything relevant come up for a search for makita 1100 there? No HG1100's listed. It is all the 1100W generator. (At least we show up as the first product result... ;-)
I was a bit annoyed by this at first, now I'm mostly curious. What can I do to make this search work on our site AND at Google.com? When I have an answer, I'll post it here. I'll be playing with my IBL's as well to see what I can do since the mini factors those as well.
Oh, and in case anyone is wondering, my new "Fun" site is coming along nicely. If you haven't already heard about it, I'm sure you will soon enough.
Brian.
For example, one of our products is the "UB181DZ". People commonly search for the "UB181", since this is the base model. Well, the default results were 0 matches. That's not good, especially when we have the UB181DZ in stock, along with the DZK and the DZK-2 (we "invented" these kits by customer demand.)
So, a simple fix is to pre-fetch substring matches of any skus. I'm doing a "SELECT sku FROM products WHERE sku like '%$searchstring%';" and sending the search as "ub181 OR ub181dz OR ub181dkz OR ub181dzk-2" for now, but that's just not a good long-term solution. It also doesn't work when they search for multiple words. I'd have to make some funky syntax to get that working, and I'm feeling lazy right now.
As an experiment, I thought we could try dashes. The first attempt was with our HG1100. I had an additional keyword (shown only to our mini) added of "h-g-1-1-0-0". Well, that didn't work for a search on makita hg110 or makita 1100. Also, we have a match for makita 1,100 (it's even the right product since it heats up to 1100 degrees), but it still doesn't show up. I have entered a secondary test now of h.g.1.1.0.0. to see if that works at all. I'm not holding my breath.
This really makes me wonder how many searches at google.com don't work properly (or any of the engines) because they don't do partial matches and stem properly. I'm sure people search similar on our site to how they'd search on Google.com, so does anything relevant come up for a search for makita 1100 there? No HG1100's listed. It is all the 1100W generator. (At least we show up as the first product result... ;-)
I was a bit annoyed by this at first, now I'm mostly curious. What can I do to make this search work on our site AND at Google.com? When I have an answer, I'll post it here. I'll be playing with my IBL's as well to see what I can do since the mini factors those as well.
Oh, and in case anyone is wondering, my new "Fun" site is coming along nicely. If you haven't already heard about it, I'm sure you will soon enough.
Brian.
Labels:
Databases,
Google,
Programming,
SEO Strategies,
Site Search
Monday, October 17, 2005
Beginning insights from the Google Mini
We bought Google. Ok, so it was just a mini, but it's cleared up my understanding of Google a bit.
Upon setting it up, nothing really great came to mind. However, after doing some tweaking (it's powering the search on toolbarn.com right now), it has become clear that the more I learn about this box and it's capabilities the more I understand Google.
For starters, I ended up having to cloak some pages to our mini to get our results to come out right. A search for makita drills gave me results of milwaukee drills as well because of our breadcrumb navigation and the cross-linking. Every page on our site was returned for power tools because it's in the main navigation. Some searches return poor results, such as makita 5000, which several people have searched for. I have a temp solution in place for that.
So, after playing with it and then sitting back to think about how it works / serves results, I figured something out that may end up being priceless.
Searches done on our site that the Google Mini return 0 results for need site changes.
I'm logging how many results the mini returns for every search that is done on our site. What I'm seeing is patterns in the way people search that yield no results. Well guess what... people search for those same phrases at Google and many times get 0 relevant results there as well. Sure, they'll get results, but the relevancy isn't there.
For example, the search for makita 5000 could be a GV5000 or an HR5000, accessories for either of those, or perhaps something else. The results at Google.com are 5000 RPM, 5000 staples per pack on a stapler page, or 5000 Watts for a generator. Why wouldn't I do some work to make my site come up #1 for makita 5000 since I've seen quite a few searches on our site for it (I'm sure I'll see more after hitting submit for this post) and the results are poor in the SERPs.
Now, that's not the only thing I've learned. Google's mini, while having some technical differences due to only being concerned with a small sampling of the web, gives me a sense of what optimizes better between 2 pages. For example, I can create a test result set and have 1 page using identical link text to point to 2 pages, then have their algo decide which is better optimized. Any SEO that just read that should be getting out their credit cards. How useful is that? I've seen some results from those experiments already within the Google SERPs. Oh, and I can supress those pages from being served in the results, allow them for a few minutes to do my test, then hide them again. Very cool.
It also makes sense now why there is a delay between crawling and showing up in the SERPs.
There is a 3 step process that the mini uses.
1) Crawl.
2) Build Index.
3) Launch / Replicate Index.
While they've undoubtedly got more processing power and storage than thousands of these little guys for their primary engine (Dual PIII with 2GB of RAM in that little blue box), indexing our site takes it over 4 hours. By default, it tries to keep no more than 4 connections open at a time to any domain. Given how many pages our site is comprised of, 4 pages at a time makes for a very long crawl time.
Once everything is crawled, the index building takes it almost 30 minutes for our sites. That's just 25,000 pages that we index out of the billions that they index. We're limiting which pages the mini crawls and assigning it a cookie so it doesn't see 100,000 different checkout page URL's to evaluate. Talk about some major processing power to build an index on the data they gather - mind blowing. When this machine takes that long for 25,000 pages it's got to take a while for their index and that's got to take more processing power than I've ever considered building. =)
Then, after everything is crawled and an index is built, it replicates the index. It copies the old index to a new location, sets the copy active, then replaces the primary index with the new build, followed by a switch to the new index after testing for our required results. After considering the safeguards that it gives by having some test searches with required results, I'm sure they've got a ton of required results to make an index active in their web search. For example, searching for microsoft better give you microsoft.com somewhere in the top so many pages of the SERPs or you've got issues.
I've got more, but I'm still pondering what useful information I can garner from the insight. Really, for under $4000 (we bought the extra year of upgrades and hardware replacement which is where the extra $1000 came in) it's probably going to be a worthwhile investment just for increasing our SERPs, let alone the search results it gives our customers.
Brian.
Upon setting it up, nothing really great came to mind. However, after doing some tweaking (it's powering the search on toolbarn.com right now), it has become clear that the more I learn about this box and it's capabilities the more I understand Google.
For starters, I ended up having to cloak some pages to our mini to get our results to come out right. A search for makita drills gave me results of milwaukee drills as well because of our breadcrumb navigation and the cross-linking. Every page on our site was returned for power tools because it's in the main navigation. Some searches return poor results, such as makita 5000, which several people have searched for. I have a temp solution in place for that.
So, after playing with it and then sitting back to think about how it works / serves results, I figured something out that may end up being priceless.
Searches done on our site that the Google Mini return 0 results for need site changes.
I'm logging how many results the mini returns for every search that is done on our site. What I'm seeing is patterns in the way people search that yield no results. Well guess what... people search for those same phrases at Google and many times get 0 relevant results there as well. Sure, they'll get results, but the relevancy isn't there.
For example, the search for makita 5000 could be a GV5000 or an HR5000, accessories for either of those, or perhaps something else. The results at Google.com are 5000 RPM, 5000 staples per pack on a stapler page, or 5000 Watts for a generator. Why wouldn't I do some work to make my site come up #1 for makita 5000 since I've seen quite a few searches on our site for it (I'm sure I'll see more after hitting submit for this post) and the results are poor in the SERPs.
Now, that's not the only thing I've learned. Google's mini, while having some technical differences due to only being concerned with a small sampling of the web, gives me a sense of what optimizes better between 2 pages. For example, I can create a test result set and have 1 page using identical link text to point to 2 pages, then have their algo decide which is better optimized. Any SEO that just read that should be getting out their credit cards. How useful is that? I've seen some results from those experiments already within the Google SERPs. Oh, and I can supress those pages from being served in the results, allow them for a few minutes to do my test, then hide them again. Very cool.
It also makes sense now why there is a delay between crawling and showing up in the SERPs.
There is a 3 step process that the mini uses.
1) Crawl.
2) Build Index.
3) Launch / Replicate Index.
While they've undoubtedly got more processing power and storage than thousands of these little guys for their primary engine (Dual PIII with 2GB of RAM in that little blue box), indexing our site takes it over 4 hours. By default, it tries to keep no more than 4 connections open at a time to any domain. Given how many pages our site is comprised of, 4 pages at a time makes for a very long crawl time.
Once everything is crawled, the index building takes it almost 30 minutes for our sites. That's just 25,000 pages that we index out of the billions that they index. We're limiting which pages the mini crawls and assigning it a cookie so it doesn't see 100,000 different checkout page URL's to evaluate. Talk about some major processing power to build an index on the data they gather - mind blowing. When this machine takes that long for 25,000 pages it's got to take a while for their index and that's got to take more processing power than I've ever considered building. =)
Then, after everything is crawled and an index is built, it replicates the index. It copies the old index to a new location, sets the copy active, then replaces the primary index with the new build, followed by a switch to the new index after testing for our required results. After considering the safeguards that it gives by having some test searches with required results, I'm sure they've got a ton of required results to make an index active in their web search. For example, searching for microsoft better give you microsoft.com somewhere in the top so many pages of the SERPs or you've got issues.
I've got more, but I'm still pondering what useful information I can garner from the insight. Really, for under $4000 (we bought the extra year of upgrades and hardware replacement which is where the extra $1000 came in) it's probably going to be a worthwhile investment just for increasing our SERPs, let alone the search results it gives our customers.
Brian.
Labels:
Google,
Online Marketing,
Programming,
SEO Strategies,
Site Search
Wednesday, October 12, 2005
MP3 File Manipulation
So I started on a site that is going to do some manipulation of MP3 files dynamically, taking a preview out of the song that should be sort of representative of the song. I toyed with a few different ways of pulling out the section I wanted, looking for large dynamic changes, widest frequency range, flattest frequency pattern, breaks in the song, etc. I also tried a few different methods of making the snip. Here's what I found.
1) Regardless how cool it is to be able to detect changes in dynamics, frequency ranges, patterns, breaks, or anything else it just doesn't mean that the section will be representative.
2) MP3 players are fairly bullet proof and don't mind some abuse. There isn't any need to split properly across frames within the file - MP3 players (all that I've tested in Win and Linux) all handle improperly split files just fine.
3) Bitrates don't translate perfectly. Just because you know how many bits per second a file is doesn't mean you can clip the file at seconds * bits/second and get the proper spot. Nor can you take that spot, add some number of seconds * bits/second and get a proper length snip.
4) Audio files are much more fun to work with than images or html files.
5) Samples at 48kb/s without a bandpass filter of any kind (high pass and low pass) don't sound very good.
6) Audio file format conversion is pretty simple. The documentation on formats is very thorough online, so the conversion between well documented to another well documented is easy.
7) The existing libraries for working with MP3 files rock. Even if you throw country, jazz, metal or blues at them, they rock.
After playing with the low-level stuff and having some fun, I ended up using the libraries since that'll take some of the more tedious work out of my hands - especially as formats evolve. But it was fun to play with while it lasted.
1) Regardless how cool it is to be able to detect changes in dynamics, frequency ranges, patterns, breaks, or anything else it just doesn't mean that the section will be representative.
2) MP3 players are fairly bullet proof and don't mind some abuse. There isn't any need to split properly across frames within the file - MP3 players (all that I've tested in Win and Linux) all handle improperly split files just fine.
3) Bitrates don't translate perfectly. Just because you know how many bits per second a file is doesn't mean you can clip the file at seconds * bits/second and get the proper spot. Nor can you take that spot, add some number of seconds * bits/second and get a proper length snip.
4) Audio files are much more fun to work with than images or html files.
5) Samples at 48kb/s without a bandpass filter of any kind (high pass and low pass) don't sound very good.
6) Audio file format conversion is pretty simple. The documentation on formats is very thorough online, so the conversion between well documented to another well documented is easy.
7) The existing libraries for working with MP3 files rock. Even if you throw country, jazz, metal or blues at them, they rock.
After playing with the low-level stuff and having some fun, I ended up using the libraries since that'll take some of the more tedious work out of my hands - especially as formats evolve. But it was fun to play with while it lasted.
Subscribe to:
Posts (Atom)