4:4:4 10bit single CMOS HD project

Kyle Granger · April 1st, 2005, 01:29 PM

If a thread is just calling a routine, there should be no additional latency. It should be as fast as if it were called by WinMain(), which is just a thread too.

BTW, Linux is not simple and nor fool proof (but a damned good OS). It is possible to write inefficient and buggy code on any platform, even on the Mac. ;-)

Kyle Granger · April 1st, 2005, 01:36 PM

If your display is chewing up 60% of the CPU (this is also true when not writing?), you may want to skip every other frame on the display and bring it down to 30%.

60% is way high.

Wayne Morellini · April 1st, 2005, 09:39 PM

<<<-- Originally posted by Kyle Granger : If a thread is just calling a routine, there should be no additional latency. It should be as fast as if it were called by WinMain(), which is just a thread too.
-->>>

Obin, what is in your inner loops? If you are calling routines each time you get a pixel you will be wasting a lot lot of time on latency. One way to get around this is too flatten out the code (or second, simple at this stage, choice compile in line) where you eliminate as many subroutines as possible by integrating them into one routine, in the inner loops. If you have profiled your software properly, you will know which loops the programs spends 90% of it's execution time in. It helps, a lot, to define the work to be done on each pixel at once (i.e capture has it's own speed/timing separate from storage and can't be integrated together conveniently). We would probably be surprised at the number of developments that don't model this behaviour properly, so it is worth a rescan. My memory has gone again, so I have forgotten the third and most vital thing, will try to update if I remember again.

I have been involved with Forth, and am aware of the large (unseen) latency problems in windows PC systems. In the old days hits of 1000's of percent happened, I doubt much of that happens in XP, but from using XP it looks far from ideal. So, 50% of your execution cycles could be slipping away, and that is just the ones you can prevent (how come you think the Mac always does so well).

I think it is good to profile the weaknesses of your OS/PC, and work around them.

Wayne Morellini · April 1st, 2005, 10:13 PM

<<<-- Originally posted by Kyle Granger : If your display is chewing up 60% of the CPU (this is also true when not writing?), you may want to skip every other frame on the display and bring it down to 30%.

60% is way high. -->>>

I forget where ever Obin is using the 3.4Ghz P4, or the 2Ghz PM, but wouldn't 30% be high, even for a software solution?

I know Obin is using GPU programming for display, so I would expect it should be closer to 6%. What I said before about the slow software emulation of missing GPU functions problem, I would still suggest (and too still keep in mind those latency problems).

Obin:

There was another thing I forgot (and those sites I suggested about how to configure a machine for best performance would help) writing the inner loops code so you can force it to stay in the cache. If the code strays outside the cache, a page has to be read in, and another potentially read out, only for the process to be reversed when it strays somewhere else, that could easily consume 30% (and making a call to a foreign routine who's setup in the cache you have no control over might just do that, which could also be a problem with GPU software emulation). A page is big, thats a lot of cycles, even a subroutine call can do a lot of cycles before you hit new code. Subroutine oriented languages tend to have a lot of problems in modern PC machines (and others) partly because their hi-speed memories are not made for low latency non sequential instruction flow out of cache.

I don't know how C compilers are in general nowadays, but the code they produced used to perform pretty poorly compared to the Intel compiler, and I think MS eventually improved theirs (don't know if it was to the same level as Intel). But you can get a missive boost switching to the best compiler back in those days. Worth finding out about.

As long as you have the active model of how the machine actually physically works (and the OS) in your head + the experience, you can see lots of issues you can never see in the code itself. I only have the physical machine sufficiently mapped in my mind, so I can make good guesses, I suggest buying advanced books on real-time (with machine code as well as C) games programming if you really get stuck.

I am going to take a hunch, knowing how lousy PC's can get, that the performance difference between unrefined code and the most refined code might be ten times on a Windows PC, so if you have improved your performance by ten times since you started coding, you are close to the maximum you can get. Does that sound possible Kyle?

Kyle Granger · April 2nd, 2005, 05:32 AM

> so if you have improved your performance by ten times
> since you started coding, you are close to the maximum
> you can get. Does that sound possible Kyle?

I suppose a factor of ten can well be possible, but honestly, I haven't thought about it too much.

Obin,

A few more suggestions, just to get your application working

1) Display one out or three images. This will give you 8 frames/sec, and should bring your graphics CPU usage down to 20% (from 60%). This should let you work in peace.

2) Profile where your Display CPU usage is going. Is it in the processing of your RAW 16-bit data, or is it sending the data to the GPU and displaying it? These are clearly separate tasks, easy enough to comment out to profile separately.

3) Try displaying only one of the primaries. I.e., for every 2x2 square of Bayer pixels, display only one of the green pixels as a monochrome (luma) bitmap.

4) Consider using OpenGL for the screen drawing. Sending a bitmap to the GPU and displaying it is only a few lines of code. There is a lot of introductory code available on the net. It should not be complicated at all.

Good luck!

Obin Olson · April 3rd, 2005, 12:35 AM

thank you Kyle..I am working on all your ideas

Obin Olson · April 3rd, 2005, 11:51 AM

we are doing a bunch of re-coding now with the software to streamline things a bit....and I am going to get a new graphics card to see if that helps the cpu%% overhead with the display..looks like the older gpu card I have may be spitting the tasks back out to the cpu, giving us the very high display cpu%

Wayne Morellini · April 3rd, 2005, 11:44 PM

If you get a new graphic card to measure the results, get one that is closet to what your code and GPU shader package depends on. This will either be the latest mid to high end card from Nvidia or ATI. Nvidia has the most advanced shaders in the cards over the last year or so, just not often the fastest at the functions the games have been using a number of times. ATI either has similar capability in their latest top cards now, or will by the time the xbox2 comes out (DX10 compliant).

Either Nvidia is a clear winner for you (some of their lower end cards have same shader functions) or the functions you are all supported on ATI. It is an compromise as to which, as ATI may have low cost one by end of the year (or maybe only in the xbox2) with Direct X 10. DirectX 10 would definitely outclasses everything out for shader programming. Either DX Ten, or 11, you can whack most of the image code directly on the card only dumping results to the computer PC to be saved (as it is to support full most program flow capabilities with integrated ). Some of us want to implement new true 3D raytracing software that will make ordinary 3D look second rate, that is difficult on a PC.

Go to Tomshardware.com, www.digit-life.com, or extremetech.com, to find articles on the current situation with cards and direct X.

I don't know about latest Intel GPU, but most integrated GPU's area compromise and support limited hardware functionality. ATI or Nvidia, might have near desktop functionality on the GPU (but have problems with shared memory nonlinearly stealing memory time (making memory loads jump from place to place, which is the worst things to do, unless managed). But some integrated chips have their own memory. As long as it is big enough for you, the programs and OS to occupy at once, then yo will get best efficiency.

What card were you using, Obin? You should be able to map the low level functionality of a card to the instructions/functions you use, from it's formal low level specifications on it's web site, probably listed in a whitepaper type PDF document (or email their development section). They also support the same functionality in different ways, but apart from Nvidia and ATI (maybe slower Matrox and Wildcats) there is no other cards to look at in terms of completness and performance.

You can get around many issues with integrated as well, by finding out what it does do, separating the the GPU shader supported execution into one batch and the stuff that has to be done in software into your own customised software routines (by passing DirectX) (as much as is feasible for performance). By a process of program code factoring.

Have a good day.

Wayne.

Obin Olson · April 4th, 2005, 08:27 PM

we are now testing a bypassed method of image calculations without GPU support to see what the results will be...looks like our current setup has the gpu choking and shooting all the work BACK to the cpu...providing our 50-60% cpu numbers just for preview!

I will know more in the morning...would it be to much to ask for some PROGRESS!!? ;)

Wayne Morellini · April 5th, 2005, 04:25 AM

Good move, how many percent are the bypassed routines doing?

I have news on he next ATI chip with new shaders, mid year, the low end or low powered versions might be end of the year (I imagine something like this may come out on main-boards). Whatever solution you go for, try to get involved with that vendors official development section, they should have answers to many of the questions, hopefully in low cost support documents (Intel/AMD and Microsoft are also good sources for development information (I think). http://www.gamedev.net/reference/ have good resources too, and igda.org, and gamasutra are also spots that may help.

I should be posting links (if God is willing) about new silent coolers, storage etc in the technical thread in the next day or so. I should also be posting technical design tips, which I haven't, in times past because so much stuff is an potential source of patentable income, but some stuff not or less so.

Obin Olson · April 5th, 2005, 06:40 PM

well well..I get 36% cpu load now with the image resize being done by the cpu and then feeding that to the gpu...this is working very well but we still get choked up with the save AND display at the same time...

I have a profiling test app now from my programmer that I will try. It will tell us what the HECK is going on in my dfi system here...he says things are working on his system but not mine..Kyle any ideas why we would have display refresh issues when we start saving raw data on the disks. Display AND packing only take about 45-50% cpu and I KNOW saving will not take 50%!!! it's like the thing has timing issues..we did try your suggestions from before..anymore ideas pop into your head..we are still using DIrectDraw for display AFTER pixel packing and resize is done with the cpu

Wayne Morellini · April 5th, 2005, 07:44 PM

<<<-- Originally posted by Obin Olson : well well..I get 36% cpu load now with the image resize being done by the cpu and then feeding that to the gpu...this is working very well but we still get choked up with the save AND display at the same time... -->>>

I am not Kyle, but must say, that is more like it. Strange, image resize should be a basic function on most GPU, it should not effect. Maybe the way resize done. But I assume you talk about resize from one screen resolution to another. Cards should have hardware (nowadays) that auto displays an image at a different resolution than what it's stored at, virtually for free, no resize needed. Still are we talking about a resize after the GPU is finished versus a resize before he resolution is finished. If the GPU is stalling then then resizing pre or post would explain the differences.

The rest of the stuff (what did you mean by "dfi system" anyway). It most likely is that memory access timing thing I mentioned last year. To many things competing with memory at the same time stalling the memory pipeline causing long delays in access to main memory (keep in cache and sequentialise everything that does not need to be parallel, then adjust all that to work around each other).

Kyle Granger · April 6th, 2005, 03:04 AM

Obin,
Wayne is absolutely correct when he says the GPU should be doing the resize: you get that for free.
What is the size of the bitmap you are creating? How are you doing the Bayer interpolation?

Obin Olson · April 6th, 2005, 08:26 AM

about 960x540 or 1/4 the resolution of the 1080 image..this is what we do so that the thing will fit on a small screen 1024x768

Obin Olson · April 6th, 2005, 08:32 AM

we take the RGGB and make it one pixel instead of 4..this is what the GPU was choking on and spitting back to the cpu

April 3rd, 2005, 11:44 PM	#2708
Wayne Morellini Inner Circle Join Date: May 2003 Location: Australia Posts: 2,762	Direct X cards. If you get a new graphic card to measure the results, get one that is closet to what your code and GPU shader package depends on. This will either be the latest mid to high end card from Nvidia or ATI. Nvidia has the most advanced shaders in the cards over the last year or so, just not often the fastest at the functions the games have been using a number of times. ATI either has similar capability in their latest top cards now, or will by the time the xbox2 comes out (DX10 compliant). Either Nvidia is a clear winner for you (some of their lower end cards have same shader functions) or the functions you are all supported on ATI. It is an compromise as to which, as ATI may have low cost one by end of the year (or maybe only in the xbox2) with Direct X 10. DirectX 10 would definitely outclasses everything out for shader programming. Either DX Ten, or 11, you can whack most of the image code directly on the card only dumping results to the computer PC to be saved (as it is to support full most program flow capabilities with integrated ). Some of us want to implement new true 3D raytracing software that will make ordinary 3D look second rate, that is difficult on a PC. Go to Tomshardware.com, www.digit-life.com, or extremetech.com, to find articles on the current situation with cards and direct X. I don't know about latest Intel GPU, but most integrated GPU's area compromise and support limited hardware functionality. ATI or Nvidia, might have near desktop functionality on the GPU (but have problems with shared memory nonlinearly stealing memory time (making memory loads jump from place to place, which is the worst things to do, unless managed). But some integrated chips have their own memory. As long as it is big enough for you, the programs and OS to occupy at once, then yo will get best efficiency. What card were you using, Obin? You should be able to map the low level functionality of a card to the instructions/functions you use, from it's formal low level specifications on it's web site, probably listed in a whitepaper type PDF document (or email their development section). They also support the same functionality in different ways, but apart from Nvidia and ATI (maybe slower Matrox and Wildcats) there is no other cards to look at in terms of completness and performance. You can get around many issues with integrated as well, by finding out what it does do, separating the the GPU shader supported execution into one batch and the stuff that has to be done in software into your own customised software routines (by passing DirectX) (as much as is feasible for performance). By a process of program code factoring. Have a good day. Wayne.

April 1st, 2005, 01:29 PM	#2701
Kyle Granger Regular Crew Join Date: Feb 2005 Location: . Posts: 52	If a thread is just calling a routine, there should be no additional latency. It should be as fast as if it were called by WinMain(), which is just a thread too. BTW, Linux is not simple and nor fool proof (but a damned good OS). It is possible to write inefficient and buggy code on any platform, even on the Mac. ;-)

April 1st, 2005, 01:36 PM	#2702
Kyle Granger Regular Crew Join Date: Feb 2005 Location: . Posts: 52	If your display is chewing up 60% of the CPU (this is also true when not writing?), you may want to skip every other frame on the display and bring it down to 30%. 60% is way high.

April 1st, 2005, 09:39 PM	#2703
Wayne Morellini Inner Circle Join Date: May 2003 Location: Australia Posts: 2,762	<<<-- Originally posted by Kyle Granger : If a thread is just calling a routine, there should be no additional latency. It should be as fast as if it were called by WinMain(), which is just a thread too. -->>> Obin, what is in your inner loops? If you are calling routines each time you get a pixel you will be wasting a lot lot of time on latency. One way to get around this is too flatten out the code (or second, simple at this stage, choice compile in line) where you eliminate as many subroutines as possible by integrating them into one routine, in the inner loops. If you have profiled your software properly, you will know which loops the programs spends 90% of it's execution time in. It helps, a lot, to define the work to be done on each pixel at once (i.e capture has it's own speed/timing separate from storage and can't be integrated together conveniently). We would probably be surprised at the number of developments that don't model this behaviour properly, so it is worth a rescan. My memory has gone again, so I have forgotten the third and most vital thing, will try to update if I remember again. I have been involved with Forth, and am aware of the large (unseen) latency problems in windows PC systems. In the old days hits of 1000's of percent happened, I doubt much of that happens in XP, but from using XP it looks far from ideal. So, 50% of your execution cycles could be slipping away, and that is just the ones you can prevent (how come you think the Mac always does so well). I think it is good to profile the weaknesses of your OS/PC, and work around them.

April 1st, 2005, 10:13 PM	#2704
Wayne Morellini Inner Circle Join Date: May 2003 Location: Australia Posts: 2,762	<<<-- Originally posted by Kyle Granger : If your display is chewing up 60% of the CPU (this is also true when not writing?), you may want to skip every other frame on the display and bring it down to 30%. 60% is way high. -->>> I forget where ever Obin is using the 3.4Ghz P4, or the 2Ghz PM, but wouldn't 30% be high, even for a software solution? I know Obin is using GPU programming for display, so I would expect it should be closer to 6%. What I said before about the slow software emulation of missing GPU functions problem, I would still suggest (and too still keep in mind those latency problems). Obin: There was another thing I forgot (and those sites I suggested about how to configure a machine for best performance would help) writing the inner loops code so you can force it to stay in the cache. If the code strays outside the cache, a page has to be read in, and another potentially read out, only for the process to be reversed when it strays somewhere else, that could easily consume 30% (and making a call to a foreign routine who's setup in the cache you have no control over might just do that, which could also be a problem with GPU software emulation). A page is big, thats a lot of cycles, even a subroutine call can do a lot of cycles before you hit new code. Subroutine oriented languages tend to have a lot of problems in modern PC machines (and others) partly because their hi-speed memories are not made for low latency non sequential instruction flow out of cache. I don't know how C compilers are in general nowadays, but the code they produced used to perform pretty poorly compared to the Intel compiler, and I think MS eventually improved theirs (don't know if it was to the same level as Intel). But you can get a missive boost switching to the best compiler back in those days. Worth finding out about. As long as you have the active model of how the machine actually physically works (and the OS) in your head + the experience, you can see lots of issues you can never see in the code itself. I only have the physical machine sufficiently mapped in my mind, so I can make good guesses, I suggest buying advanced books on real-time (with machine code as well as C) games programming if you really get stuck. I am going to take a hunch, knowing how lousy PC's can get, that the performance difference between unrefined code and the most refined code might be ten times on a Windows PC, so if you have improved your performance by ten times since you started coding, you are close to the maximum you can get. Does that sound possible Kyle?

April 2nd, 2005, 05:32 AM	#2705
Kyle Granger Regular Crew Join Date: Feb 2005 Location: . Posts: 52	> so if you have improved your performance by ten times > since you started coding, you are close to the maximum > you can get. Does that sound possible Kyle? I suppose a factor of ten can well be possible, but honestly, I haven't thought about it too much. Obin, A few more suggestions, just to get your application working 1) Display one out or three images. This will give you 8 frames/sec, and should bring your graphics CPU usage down to 20% (from 60%). This should let you work in peace. 2) Profile where your Display CPU usage is going. Is it in the processing of your RAW 16-bit data, or is it sending the data to the GPU and displaying it? These are clearly separate tasks, easy enough to comment out to profile separately. 3) Try displaying only one of the primaries. I.e., for every 2x2 square of Bayer pixels, display only one of the green pixels as a monochrome (luma) bitmap. 4) Consider using OpenGL for the screen drawing. Sending a bitmap to the GPU and displaying it is only a few lines of code. There is a lot of introductory code available on the net. It should not be complicated at all. Good luck!

April 3rd, 2005, 12:35 AM	#2706
Obin Olson Trustee Join Date: Jan 2003 Location: Wilmington NC Posts: 1,414	thank you Kyle..I am working on all your ideas

April 3rd, 2005, 11:51 AM	#2707
Obin Olson Trustee Join Date: Jan 2003 Location: Wilmington NC Posts: 1,414	we are doing a bunch of re-coding now with the software to streamline things a bit....and I am going to get a new graphics card to see if that helps the cpu%% overhead with the display..looks like the older gpu card I have may be spitting the tasks back out to the cpu, giving us the very high display cpu%

April 4th, 2005, 08:27 PM	#2709
Obin Olson Trustee Join Date: Jan 2003 Location: Wilmington NC Posts: 1,414	we are now testing a bypassed method of image calculations without GPU support to see what the results will be...looks like our current setup has the gpu choking and shooting all the work BACK to the cpu...providing our 50-60% cpu numbers just for preview! I will know more in the morning...would it be to much to ask for some PROGRESS!!? ;)

April 5th, 2005, 04:25 AM	#2710
Wayne Morellini Inner Circle Join Date: May 2003 Location: Australia Posts: 2,762	Good move, how many percent are the bypassed routines doing? I have news on he next ATI chip with new shaders, mid year, the low end or low powered versions might be end of the year (I imagine something like this may come out on main-boards). Whatever solution you go for, try to get involved with that vendors official development section, they should have answers to many of the questions, hopefully in low cost support documents (Intel/AMD and Microsoft are also good sources for development information (I think). http://www.gamedev.net/reference/ have good resources too, and igda.org, and gamasutra are also spots that may help. I should be posting links (if God is willing) about new silent coolers, storage etc in the technical thread in the next day or so. I should also be posting technical design tips, which I haven't, in times past because so much stuff is an potential source of patentable income, but some stuff not or less so.

April 5th, 2005, 06:40 PM	#2711
Obin Olson Trustee Join Date: Jan 2003 Location: Wilmington NC Posts: 1,414	well well..I get 36% cpu load now with the image resize being done by the cpu and then feeding that to the gpu...this is working very well but we still get choked up with the save AND display at the same time... I have a profiling test app now from my programmer that I will try. It will tell us what the HECK is going on in my dfi system here...he says things are working on his system but not mine..Kyle any ideas why we would have display refresh issues when we start saving raw data on the disks. Display AND packing only take about 45-50% cpu and I KNOW saving will not take 50%!!! it's like the thing has timing issues..we did try your suggestions from before..anymore ideas pop into your head..we are still using DIrectDraw for display AFTER pixel packing and resize is done with the cpu

April 5th, 2005, 07:44 PM	#2712
Wayne Morellini Inner Circle Join Date: May 2003 Location: Australia Posts: 2,762	<<<-- Originally posted by Obin Olson : well well..I get 36% cpu load now with the image resize being done by the cpu and then feeding that to the gpu...this is working very well but we still get choked up with the save AND display at the same time... -->>> I am not Kyle, but must say, that is more like it. Strange, image resize should be a basic function on most GPU, it should not effect. Maybe the way resize done. But I assume you talk about resize from one screen resolution to another. Cards should have hardware (nowadays) that auto displays an image at a different resolution than what it's stored at, virtually for free, no resize needed. Still are we talking about a resize after the GPU is finished versus a resize before he resolution is finished. If the GPU is stalling then then resizing pre or post would explain the differences. The rest of the stuff (what did you mean by "dfi system" anyway). It most likely is that memory access timing thing I mentioned last year. To many things competing with memory at the same time stalling the memory pipeline causing long delays in access to main memory (keep in cache and sequentialise everything that does not need to be parallel, then adjust all that to work around each other).

April 6th, 2005, 03:04 AM	#2713
Kyle Granger Regular Crew Join Date: Feb 2005 Location: . Posts: 52	Obin, Wayne is absolutely correct when he says the GPU should be doing the resize: you get that for free. What is the size of the bitmap you are creating? How are you doing the Bayer interpolation?

April 6th, 2005, 08:26 AM	#2714
Obin Olson Trustee Join Date: Jan 2003 Location: Wilmington NC Posts: 1,414	about 960x540 or 1/4 the resolution of the 1080 image..this is what we do so that the thing will fit on a small screen 1024x768

April 6th, 2005, 08:32 AM	#2715
Obin Olson Trustee Join Date: Jan 2003 Location: Wilmington NC Posts: 1,414	we take the RGGB and make it one pixel instead of 4..this is what the GPU was choking on and spitting back to the cpu